# Housing Prices Competition - Supervised Learning
- Description: Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
- Link: https://www.kaggle.com/c/home-data-for-ml-course
- Tools: sklearn, pandas

In [1]:
# Mount google drive to this notebook
from google.colab import drive
drive.mount('/content/drive')

# Change dir
import os
os.chdir('/content/drive/MyDrive/Senior/Kaggle/Courses/Intermediate ML')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read data
X_full = pd.read_csv('train.csv', index_col = 'Id')
X_test_full = pd.read_csv('test.csv', index_col = 'Id')

# Obtain target and predictions
y = X_full.SalePrice
selected_features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[selected_features].copy() # make selected features into training features
X_test = X_test_full[selected_features].copy()

# Split dataset intro trainning and testing data set
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 0)

In [3]:
# Visualize the training data
X_train.head()

Unnamed: 0_level_0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
619,11694,2007,1828,0,2,3,9
871,6600,1962,894,0,1,2,5
93,13360,1921,964,0,1,2,5
818,13265,2002,1689,0,2,3,7
303,13704,2001,1541,0,2,3,6


In [4]:
# Create 5 Random Forrest models to train and pick the highest performance
from sklearn.ensemble import RandomForestRegressor

model_1 = RandomForestRegressor(n_estimators = 50, random_state = 0)
model_2 = RandomForestRegressor(n_estimators = 100, random_state = 0)
model_3 = RandomForestRegressor(n_estimators = 100, criterion = 'mae', random_state = 0)
model_4 = RandomForestRegressor(n_estimators = 200, min_samples_split = 20, random_state = 0)
model_5 = RandomForestRegressor(n_estimators = 100, max_depth = 7, random_state = 0)

models = [model_1, model_2, model_3, model_4, model_5]

In [5]:
# Check the Mean Absolute Errors of each models
from sklearn.metrics import mean_absolute_error

# Helper function calculate the MAE of a model based on training dataset
def score_model(model, X_t = X_train, X_v = X_valid, y_t = y_train, y_v = y_valid):
  model.fit(X_t, y_t) # train model
  preds = model.predict(X_v)
  return mean_absolute_error(y_v, preds)

# Calculate MAE of all five models
for i in range(0, len(models)):
  mae = score_model(models[i])
  print(f"Model {i+1} MAE: {mae}" )

Model 1 MAE: 24015.492818003917
Model 2 MAE: 23740.979228636657
Model 3 MAE: 23528.78421232877
Model 4 MAE: 23996.676789668687
Model 5 MAE: 23706.672864217904


In [6]:
# The best model is Model 3 since it got the lowest MAE
best_model = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)

In [7]:
# Fit/train the model to the training data
best_model.fit(X, y)

# Generate test predictions
preds_test = best_model.predict(X_test)

# Save predictions in format used for the competition scoring
output = pd.DataFrame({'Id': X_test.index, 'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)