# [Kaggle] Housing Prices Competition for Kaggle Learn Users

This notebook is to perform a simple decision tree to predict house prices in Iowa

Author: Han-Elliot Nguyen<br>Email: hanelliotn@gmail.com

Start date: July 10, 2023<br>End date: July 10, 2023

In [9]:
# Import libraries

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [3]:
# Import train and test data

train_data = pd.read_csv("datasets/train.csv", index_col='Id')
test_data = pd.read_csv("datasets/test.csv", index_col='Id')

In [4]:
# Determine `SalePrice` as the prediction target from train data

y = train_data.SalePrice
y

Id
1       208500
2       181500
3       223500
4       140000
5       250000
         ...  
1456    175000
1457    210000
1458    266500
1459    142125
1460    147500
Name: SalePrice, Length: 1460, dtype: int64

In [6]:
# Get all the features that are needed for the prediction from the train and test data

features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = train_data[features].copy()
X_test = test_data[features].copy()

In [8]:
# Split train test split data

X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
X_train.head

<bound method NDFrame.head of       LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
Id                                                                     
619     11694       2007      1828         0         2             3   
871      6600       1962       894         0         1             2   
93      13360       1921       964         0         1             2   
818     13265       2002      1689         0         2             3   
303     13704       2001      1541         0         2             3   
...       ...        ...       ...       ...       ...           ...   
764      9430       1999      1268      1097         2             3   
836      9600       1950      1067         0         2             2   
1217     8930       1978      1318       584         2             4   
560      3196       2003      1557         0         2             2   
685     16770       1998      1195       644         2             4   

      TotRmsAbvGrd  
Id          

In [12]:
# Define the training and fit the data

model = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
model.fit(X_train, y_train)

In [21]:
# Make predictions

preds_test = model.predict(X_val)
mae_test = mean_absolute_error(y_val, preds_test)
mae_test

8902.216695205481

In [23]:
# Make predictions from the feature data

model.fit(X, y)
preds = model.predict(X_test)

In [24]:
# Generate results for the prediction

result = pd.DataFrame({"Id": X_test.index, "SalePrice": preds})
result

Unnamed: 0,Id,SalePrice
0,1461,119433.08
1,1462,158367.50
2,1463,185351.21
3,1464,178343.12
4,1465,192898.29
...,...,...
1454,2915,86155.00
1455,2916,89050.00
1456,2917,156296.92
1457,2918,132232.50


In [25]:
# Submit results

result.to_csv("datasets/submission.csv", index=False)