# Introduction

This notebook depicts the code used for the **Housing Pricing Competition** in  kaggle

In the following lines i will do this series of steps:

1. Read test and train data provided by kaggle
2. Explore said data
3. Build a Random Forest model with the data, and see how accurate it can get
4. Create a submission csv file and then upload it to kaggle

In [3]:
#Importing the libraries required
import pandas as pd
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [4]:
#set my file names
test_file = 'test.csv'
train_file = 'train.csv'
submission_file = 'luisreyes_submission.csv'

In [5]:
#load data into pandas frame
train_df = pd.read_csv(train_file)

Basic Exploratory Analysis

In [157]:
#check data description
train_df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [7]:
train_df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [338]:
#we are going to be predicting the sales price of each house using the features bellow and y as target
y = train_df.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd','OverallQual','OverallCond','Fireplaces','GarageCars','Condition2','MSZoning', 'Street', 'LotShape', 'LandContour', 'LotConfig', 
            'BldgType', 'HouseStyle', 'ExterQual', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleCondition','Neighborhood','Exterior1st','Exterior2nd']
X = train_df[features]

In [339]:
#Spliting the data between train and validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [340]:
print("Unique values in 'Condition2' column in training data:", train_X['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", val_X['Condition2'].unique())

Unique values in 'Condition2' column in training data: ['Norm' 'PosN' 'Artery' 'RRNn' 'Feedr' 'RRAe' 'PosA']

Unique values in 'Condition2' column in validation data: ['Norm' 'Feedr' 'RRAn']


In [341]:
# All categorical columns
object_cols = [col for col in train_X.columns if train_X[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(train_X[col]) == set(val_X[col])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

Categorical columns that will be label encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'BldgType', 'HouseStyle', 'ExterQual', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleCondition', 'Neighborhood']

Categorical columns that will be dropped from the dataset: ['Exterior1st', 'Condition2', 'Exterior2nd']


Handling Categorical features
* BldgType,Foundation,Utilities,MSZoning,ExterQual,PavedDrive,Electrical,KitchenQual,Street,LotShape
* LandContour,Condition1,HouseStyle,RoofStyle,RoofMatl,Heating,HeatingQC

In [342]:
# Get list of categorical variables
s = (train_X.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['Condition2', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'BldgType', 'HouseStyle', 'ExterQual', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleCondition', 'Neighborhood', 'Exterior1st', 'Exterior2nd']


Function taken from the Intermediate Machine Learning

In [327]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    
    # Get numerical feature importances
    importances = list(model.feature_importances_)
    # List of tuples with variable and importance
    feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(features, importances)]

    # Sort the feature importances by most important first
    feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
    # Print out the feature and importances 
    [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]
    
    return mean_absolute_error(y_valid, preds)

In [322]:
#Checking ONE HOT ENCODER TOO
from sklearn.preprocessing import OneHotEncoder

# Drop categorical columns that will not be encoded
label_X_train = train_X.drop(bad_label_cols, axis=1)
label_X_valid = val_X.drop(bad_label_cols, axis=1)

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(label_X_train[good_label_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(label_X_valid[good_label_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = label_X_train.index
OH_cols_valid.index = label_X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = label_X_train.drop(good_label_cols, axis=1)
num_X_valid = label_X_valid.drop(good_label_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, train_y, val_y))

MAE from Approach 3 (One-Hot Encoding):
17818.33291324201


In [343]:
#check missing data
# Shape of training data (num_rows, num_columns)
print(train_X.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (train_X.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(1095, 27)
Series([], dtype: int64)


### Model Building

We'll be using **Random Forest** to evaluate and predict the housing prices

In [344]:
#using the label enconder of scikit-learn
from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = train_X.drop(bad_label_cols, axis=1)
label_X_valid = val_X.drop(bad_label_cols, axis=1)


# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in good_label_cols:
    label_X_train[col] = label_encoder.fit_transform(train_X[col])
    label_X_valid[col] = label_encoder.transform(val_X[col])

print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, train_y, val_y))

MAE from Approach 2 (Label Encoding):
Variable: OverallQual          Importance: 0.58
Variable: 1stFlrSF             Importance: 0.1
Variable: LotArea              Importance: 0.06
Variable: 2ndFlrSF             Importance: 0.05
Variable: GarageCars           Importance: 0.04
Variable: FullBath             Importance: 0.03
Variable: YearBuilt            Importance: 0.02
Variable: TotRmsAbvGrd         Importance: 0.02
Variable: SaleCondition        Importance: 0.02
Variable: BedroomAbvGr         Importance: 0.01
Variable: OverallCond          Importance: 0.01
Variable: Fireplaces           Importance: 0.01
Variable: HouseStyle           Importance: 0.01
Variable: CentralAir           Importance: 0.01
Variable: PavedDrive           Importance: 0.01
Variable: Condition2           Importance: 0.0
Variable: MSZoning             Importance: 0.0
Variable: Street               Importance: 0.0
Variable: LotShape             Importance: 0.0
Variable: LandContour          Importance: 0.0
Variable

In [337]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: train_X[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('Street', 2),
 ('CentralAir', 2),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('SaleCondition', 6),
 ('Condition2', 7),
 ('HouseStyle', 8),
 ('Neighborhood', 25)]

Model building with test data

In [28]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
forest_full_data = RandomForestRegressor(n_estimators=345,random_state=1)

# fit rf_model_on_full_data on all data from the training data
forest_full_data.fit(X,y)

test_df = pd.read_csv(test_file)

test_X = test_df[features]

# Imputation
my_imputer = SimpleImputer()
imputed_X_test = pd.DataFrame(my_imputer.fit_transform(test_X))

# Imputation removed column names; put them back
imputed_X_test.columns = train_X.columns

test_prediction = forest_full_data.predict(imputed_X_test)
test_prediction

array([127777.14782609, 153264.63768116, 167241.72753623, ...,
       180380.06956522, 111491.88405797, 236521.00289855])

In [26]:
#saves submission csv for competition
output = pd.DataFrame({'Id': test_df.Id,
                       'SalePrice': test_prediction})
output.to_csv('submission.csv', index=False)