# Kaggle: Intermediate Machine Learning
### Introduction
Load the training and validation features in `X_train` and `X_valid`, along with the prediction targets in `y_train` and `y_valid`.  The test features are loaded in `X_test`.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../Intro-to-ML/home-data-for-ml-course/train.csv', index_col='Id')
X_test_full = pd.read_csv('../Intro-to-ML/home-data-for-ml-course/test.csv', index_col='Id')

# Obtain target and features
y = X_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [5]:
# Looking at first few rows of data for the train set
X_train.head()

Unnamed: 0_level_0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
619,11694,2007,1828,0,2,3,9
871,6600,1962,894,0,1,2,5
93,13360,1921,964,0,1,2,5
818,13265,2002,1689,0,2,3,7
303,13704,2001,1541,0,2,3,6


Defining five different Random Forest models.

In [7]:
from sklearn.ensemble import RandomForestRegressor

# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]

To select the best model out of the five, we define a function `score_model()` below.  This function returns the mean absolute error (MAE) from the validation set.  Recall that the best model will obtain the lowest MAE.

In [10]:
from sklearn.metrics import mean_absolute_error

# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)

model_runs = {}
for i in range(len(models)):
    mae = score_model(models[i])
    model_runs[f'model_{i+1}'] = [mae, models[i]]
    print(f"Model {i+1} \t MAE: {mae}")

Model 1 	 MAE: 24015.492818003917
Model 2 	 MAE: 23740.979228636657
Model 3 	 MAE: 23528.78421232877
Model 4 	 MAE: 23996.676789668687
Model 5 	 MAE: 23706.672864217904


### Evaluate and select best model

In [12]:
best_model = min(model_runs.values())
best_model

[23528.78421232877,
 RandomForestRegressor(criterion='absolute_error', random_state=0)]

### Generate test predictions
Create a Random Forest model with the variable name `my_model`.

In [14]:
# Create Random Forest model from best_model
my_model = best_model[1]
my_model

RandomForestRegressor(criterion='absolute_error', random_state=0)

Fit the model to the training and validation sets, and generate predictions on the test set saved as `X_test`.

In [17]:
# Fit model to the training set
my_model.fit(X, y)

# Generate predictions
preds_test = my_model.predict(X_test)

# Save predictions in the format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice:': preds_test})
output.to_csv('submission.csv', index=False)

In [18]:
output

Unnamed: 0,Id,SalePrice:
0,1461,119433.08
1,1462,158367.50
2,1463,185351.21
3,1464,178343.12
4,1465,192898.29
...,...,...
1454,2915,86155.00
1455,2916,89050.00
1456,2917,156296.92
1457,2918,132232.50


### Missing Values

There are many ways data can end up with missing values. For example,
- A 2 bedroom house won't include a value for the size of a third bedroom.
- A survey respondent may choose not to share his income.

Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.