<a href="https://colab.research.google.com/github/nachoacev/practice-data-science/blob/main/HousingPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction of Housing Prices in Iowa

We will apply a simple model of Desicion Tree to predict the price for housing in the state of Iowa, based on data from Kaggle.

This is only to practice a simple machine learning model and the basic notions of modeling in data science.

We will use in this jupyter notebook:
- Decision Tree Regression algorithm.
- Random Forest Regression algorithm.
- Notions of overfitting/underfitting.

In [1]:
# Import data
import kagglehub

# Download latest version
path = kagglehub.dataset_download("dansbecker/home-data-for-ml-course") + "/train.csv"

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/dansbecker/home-data-for-ml-course?dataset_version_number=1...


100%|██████████| 94.0k/94.0k [00:00<00:00, 22.3MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/dansbecker/home-data-for-ml-course/versions/1/train.csv





## Basic Data Exploration

This is just to gain insight of data.

In [2]:
import pandas as pd

# read data and store it in DataFrame
home_data = pd.read_csv(path)

# print a summary of the data
home_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


## Create feature and target

We specify the prediction target variable $y$ corresponding to the `SalePrice` column. Notice that we employ the `dot-notation` in pandas for this.

For the features to predict the price, we will use the collumns
  * LotArea
  * YearBuilt
  * 1stFlrSF
  * 2ndFlrSF
  * FullBath
  * BedroomAbvGr
  * TotRmsAbvGrd

In [3]:
# print the list of columns in the dataset to find the name of the prediction target
home_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [4]:
# Select data dot-notation
y = home_data.SalePrice

# Create the list of features below
feature_names = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]

# Select data corresponding to features in feature_names
X = home_data[feature_names]

In [5]:
X.head()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


In [6]:
X.describe()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


## Splitting data and defining model

We split the data in 4 sets: training features, validation features, training target, validation target. This is to **avoid overfitting** of our model, obtaing real information of the error estimate and making it more robust for future predictions.

Then we define our `DecisionTreeRegressor` model and **fit** it to the relevant data. Remember that in `sklearn` fitting refers to training the model.

In [7]:
# Import the train_test_split function
from sklearn.model_selection import train_test_split

# The split is based on a random number generator.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)


In [8]:
from sklearn.tree import DecisionTreeRegressor

# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

## Making predictions and Validating

We proceed to make predictions over the validation data.

We measure the predictions using the *Mean Absolute Error* (MAE). This is simply the average of the difference between the `val_y` and predictions in euclidean norm 1.

In [9]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

# print the top few validation predictions
print(val_predictions[:5])
# print the top few actual prices from validation data
print(val_y.head())

[186500. 184000. 130000.  92000. 164500.]
258     231500
267     179500
288     122000
649      84500
1233    142000
Name: SalePrice, dtype: int64


In [10]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

print("The MAE of the model is", val_mae, "dolars")

The MAE of the model is 29652.931506849316 dolars


## Overfitting

**Definition:** Overfitting occurs when a machine learning model learns the training data too well, capturing noise and details that are specific to the training set but do not generalize to new, unseen data.

**Characteristics:**

- The model performs exceptionally well on the training data but poorly on the test or validation data.

- It often happens when the model is too complex (e.g., too many parameters or features) relative to the amount of training data.

- The model essentially "memorizes" the training data instead of learning the underlying patterns.

**Example:** A decision tree that grows too deep and creates a leaf for every single data point in the training set.

## Underfitting

**Definition:** Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and test datasets.

**Characteristics:**

- The model performs poorly on both the training and test data.

- It often happens when the model is too simple (e.g., not enough parameters or features) or when it is not trained for enough iterations.

- The model fails to learn the relationships in the data.

**Example:** Using a linear model to fit data that has a nonlinear relationship.

## Comparing different depths to fit the model

We employ an auxiliary function returning the MAE to compare different DecisionTree models and see which one has better accuracy in function of the *max_leaf_nodes*.

In [11]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=1)
  model.fit(train_X, train_y)
  preds_val = model.predict(val_X)
  mae = mean_absolute_error(val_y, preds_val)

  return mae

In [20]:
# compare MAE with differing values of max_leaf_nodes
best_mae = 0
best_nodes = 0
i = 0
for max_leaf_nodes in [5, 50, 500, 5000]:
  my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
  if i == 0: best_mae = my_mae; best_nodes = max_leaf_nodes
  elif my_mae < best_mae: best_mae = my_mae; best_nodes = max_leaf_nodes
  i += 1
  print("Max leaf nodes: %d  \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error: 35044
Max leaf nodes: 50  		 Mean Absolute Error: 27405
Max leaf nodes: 500  		 Mean Absolute Error: 28357
Max leaf nodes: 5000  		 Mean Absolute Error: 28942


Of the options listed, 50 is the optimal number of leaves.

## Conclusion

**Overfitting:** capturing spurious patterns that won't recur in the future, leading to less accurate predictions.

**Underfitting:** failing to capture relevant patterns, again leading to less accurate predictions.

We use `validation data`, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

## More sophisticated ML algorithm (RandomForest)

DecisionTree algorithm is too basic, we will try with a `Random Forest`. The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

In [23]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
iowa_preds = forest_model.predict(val_X)
print("Mean Absolute Error: {}".format(mean_absolute_error(val_y, iowa_preds)))

Mean Absolute Error: 21857.15912981083
