# [Kaggle] Housing Prices Competition for Kaggle Learn Users

This notebook is to perform a simple decision tree to predict house prices in Iowa

Author: Han-Elliot Nguyen<br>Email: hanelliotn@gmail.com

Start date: July 10, 2023<br>End date: July 10, 2023

In [8]:
# Import libraries

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer


In [9]:
# Import train and test data

train_data = pd.read_csv("datasets/train.csv", index_col='Id')
test_data = pd.read_csv("datasets/test.csv", index_col='Id')

In [10]:
# Remove rows with missing target, separate target from predictors

train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice
train_data.drop(['SalePrice'], axis=1, inplace=True)

In [11]:
# Select numerical predictors

X = train_data.select_dtypes(exclude=['object'])
X_test = test_data.select_dtypes(exclude=['object'])

In [12]:
# Split train and test data

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

<bound method NDFrame.head of       MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
Id                                                                            
619           20         90.0    11694            9            5       2007   
871           20         60.0     6600            5            5       1962   
93            30         80.0    13360            5            7       1921   
818           20          NaN    13265            8            5       2002   
303           20        118.0    13704            7            5       2001   
...          ...          ...      ...          ...          ...        ...   
764           60         82.0     9430            8            5       1999   
836           20         60.0     9600            4            7       1950   
1217          90         68.0     8930            6            5       1978   
560          120          NaN     3196            7            5       2003   
685           60      

In [13]:
# Display train data

print(X_train.head)
print(X_train.shape)

<bound method NDFrame.head of       MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
Id                                                                            
619           20         90.0    11694            9            5       2007   
871           20         60.0     6600            5            5       1962   
93            30         80.0    13360            5            7       1921   
818           20          NaN    13265            8            5       2002   
303           20        118.0    13704            7            5       2001   
...          ...          ...      ...          ...          ...        ...   
764           60         82.0     9430            8            5       1999   
836           20         60.0     9600            4            7       1950   
1217          90         68.0     8930            6            5       1978   
560          120          NaN     3196            7            5       2003   
685           60      

In [14]:
# Check the number of missing values in each column of training data (if any)

missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


In [15]:
# Since the data has missing values, perform imputation towards the train and test data

final_imputer = SimpleImputer(strategy='constant', fill_value=0)
final_X_train = pd.DataFrame(final_imputer.fit_transform(X_train))
final_X_valid = pd.DataFrame(final_imputer.fit_transform(X_valid))

In [None]:
# Define and fit the model

model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(final_X_train, y_train)