### House Prices - Advanced Regression Techniques


Author: Juan Manuel Gonz√°lez Kapnik

Starting date: Jul 29, 2023

"Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home."

**Practice skills**
* Creative feature engineering 
* Advanced regression techniques like random forest and gradient boosting

**Goal**

"It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the `SalePrice` variable."

# 1. Importing Libraries

In [2]:
# Data Handling
import pandas as pd
import numpy as np

# Warnings
import warnings

# Statistics
import scipy.stats as stats
from scipy.stats import skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

# Charts
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

# Machine Learning
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from mlxtend.regressor import StackingCVRegressor

from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import RobustScaler, StandardScaler, LabelEncoder


Libraries config

In [3]:
# Pandas display all columns
pd.set_option('display.max.columns', None)

# Ignore warnings
warnings.filterwarnings('ignore')

# Display chart below code
%matplotlib inline

Plot style config

In [4]:
def plot_style(ax):
    sns.despine(top=True, right=True, left=False, bottom=False)
    ax.spines['bottom'].set_color('gray')
    ax.spines['left'].set_color('gray')
    ax.tick_params(colors='gray')
    ax.xaxis.label.set_color('gray')
    ax.yaxis.label.set_color('gray')

# 2. Loading Datasets

In [5]:
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')

# 3. Basic Analysis

Checking data shape

In [6]:
print(f'Train dataframe has {train_df.shape[0]} records and {train_df.shape[1]} features')
print(f'Test dataframe has {test_df.shape[0]} records and {test_df.shape[1]} features')

Train dataframe has 1460 records and 81 features
Test dataframe has 1459 records and 80 features


The column that does not have test dataframe is the one we are looking to predict

Checking data types

In [7]:
print('Train dataframe features data types:')
train_df.dtypes.value_counts()

Train dataframe features data types:


object     43
int64      35
float64     3
Name: count, dtype: int64

Train dataframe has 81 features: 38 are numerical and 43 are categorical

Checking duplicates

In [8]:
print(f'Train dataframe has {train_df.duplicated().sum()} duplicated values')
print(f'Test dataframe has {test_df.duplicated().sum()} duplicated values')

Train dataframe has 0 duplicated values
Test dataframe has 0 duplicated values


# 4. Train Data Descriptive Statical Analysis 

The idea is to get a first impression of the data, since in the exploratory analysis we will continue to deal with the statistical analysis

Analysis for numerical features

In [9]:
train_df.describe().T.sort_values(by='std', ascending=False)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SalePrice,1460.0,180921.19589,79442.502883,34900.0,129975.0,163000.0,214000.0,755000.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
GrLivArea,1460.0,1515.463699,525.480383,334.0,1129.5,1464.0,1776.75,5642.0
MiscVal,1460.0,43.489041,496.123024,0.0,0.0,0.0,0.0,15500.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0
BsmtUnfSF,1460.0,567.240411,441.866955,0.0,223.0,477.5,808.0,2336.0
TotalBsmtSF,1460.0,1057.429452,438.705324,0.0,795.75,991.5,1298.25,6110.0
2ndFlrSF,1460.0,346.992466,436.528436,0.0,0.0,0.0,728.0,2065.0
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
1stFlrSF,1460.0,1162.626712,386.587738,334.0,882.0,1087.0,1391.25,4692.0


Insights:

* **Very high standard deviation** in general due to high dispersion of data in `SalePrice`, `LotArea`, `GrLivarea`, `MiscVal`, `BsmtFinSF1`, `BsmtUnfSF`, `TotalBsmtSF`, `2ndFlrSF`, `Id` (not very relevant, since we will droppe it), `1stFlrSF`, `GarageArea` (all of these features have standar deviation above 200). However, in general **standard deviation is above 20**. From this, we will perform an in-depth analysis on the distribution of the data, to observe its trend.
* `SalePrice` is the target feature, and has the **highest deviation** which can lead to bias, Overfitting, and can affect the accuracy of the model. So we are going to do a descriptive specific analysis
* Mean and median are similar in most of the categories. However, **this does not guarantee statistical symmetry.**

Analysis for categorical features

In [11]:
train_df.describe(include='object').T.sort_values(by='unique', ascending=False)

Unnamed: 0,count,unique,top,freq
Neighborhood,1460,25,NAmes,225
Exterior2nd,1460,16,VinylSd,504
Exterior1st,1460,15,VinylSd,515
SaleType,1460,9,WD,1267
Condition1,1460,9,Norm,1260
Condition2,1460,8,Norm,1445
HouseStyle,1460,8,1Story,726
RoofMatl,1460,8,CompShg,1434
Functional,1460,7,Typ,1360
BsmtFinType2,1422,6,Unf,1256


Insights:

* Low cardinality in general, but `Neighborhood`, `Exterior1st` and `Exterior2nd` have a bit high, but we can handle them using different techniques of **encoding**