# [Kaggle] Housing Prices Competition for Kaggle Learn Users

This notebook is to perform a simple decision tree to predict house prices in Iowa

**Author**: Han-Elliot Phan<br>**Email**: hanelliotphan@gmail.com

Start date: July 10, 2023<br>End date: July 10, 2023

In [1]:
# Import libraries

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer


In [2]:
# Import train and test data

train_data = pd.read_csv("datasets/train.csv", index_col='Id')
test_data = pd.read_csv("datasets/test.csv", index_col='Id')

In [3]:
# Remove rows with missing target, separate target from predictors

train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice
train_data.drop(['SalePrice'], axis=1, inplace=True)

In [4]:
# Select numerical predictors

X = train_data.select_dtypes(exclude=['object'])
X_test = test_data.select_dtypes(exclude=['object'])

In [5]:
# Split train and test data

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [6]:
# Display train data

print(X_train.head)
print(X_train.shape)

<bound method NDFrame.head of       MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
Id                                                                            
619           20         90.0    11694            9            5       2007   
871           20         60.0     6600            5            5       1962   
93            30         80.0    13360            5            7       1921   
818           20          NaN    13265            8            5       2002   
303           20        118.0    13704            7            5       2001   
...          ...          ...      ...          ...          ...        ...   
764           60         82.0     9430            8            5       1999   
836           20         60.0     9600            4            7       1950   
1217          90         68.0     8930            6            5       1978   
560          120          NaN     3196            7            5       2003   
685           60      

In [7]:
# Check the number of missing values in each column of training data (if any)

missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


In [8]:
# Since the data has missing values, perform imputation towards the train and test data

final_imputer = SimpleImputer(strategy='constant', fill_value=0)
final_X_train = pd.DataFrame(final_imputer.fit_transform(X_train))
final_X_valid = pd.DataFrame(final_imputer.fit_transform(X_valid))

In [9]:
# Develop data imputation

imputer = SimpleImputer(strategy='constant', fill_value=0.0)
final_X_train = pd.DataFrame(imputer.fit_transform(X_train))
final_X_valid = pd.DataFrame(imputer.fit_transform(X_valid))

In [10]:
# Define and fit the model

model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(final_X_train, y_train)

In [11]:
# Get validation predictions and MAE

preds_valid = model.predict(final_X_valid)
print("MAE (Your approach):")
print(mean_absolute_error(y_valid, preds_valid))

MAE (Your approach):
18017.665970319635


In [12]:
# Preprocess test data and get test predictions

final_X_test = pd.DataFrame(final_imputer.transform(X_test))
preds_test = model.predict(final_X_test)

In [13]:
# Get prediction results in dataframe

result = pd.DataFrame({'Id': X_test.index, 'SalePrice': preds_test})
print(result)

        Id  SalePrice
0     1461  127436.50
1     1462  159961.25
2     1463  178998.25
3     1464  184179.00
4     1465  197855.00
...    ...        ...
1454  2915   85155.00
1455  2916   87293.50
1456  2917  155623.87
1457  2918  106376.75
1458  2919  230180.90

[1459 rows x 2 columns]


In [14]:
# Create submission file

result.to_csv('datasets/submission.csv', index=False)