### House Prices - Advanced Regression Techniques


Author: Juan Manuel González Kapnik

Starting date: Jul 29, 2023

"Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home."

**Practice skills**
* Creative feature engineering 
* Advanced regression techniques like random forest and gradient boosting

**Goal**

"It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the `SalePrice` variable."

# 1. Importing Libraries

In [15]:
# Data Handling
import pandas as pd
import numpy as np

# Warnings
import warnings

# Statistics
import scipy.stats as stats
from scipy.stats import skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

# Charts
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

# Machine Learning
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from mlxtend.regressor import StackingCVRegressor

from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import RobustScaler, StandardScaler, LabelEncoder


Libraries config

In [13]:
# Pandas display all columns
pd.set_option('display.max.columns', None)

# Ignore warnings
warnings.filterwarnings('ignore')

# Display chart below code
%matplotlib inline

Plot style config

In [16]:
def plot_style(ax):
    sns.despine(top=True, right=True, left=False, bottom=False)
    ax.spines['bottom'].set_color('gray')
    ax.spines['left'].set_color('gray')
    ax.tick_params(colors='gray')
    ax.xaxis.label.set_color('gray')
    ax.yaxis.label.set_color('gray')

# 2. Loading Datasets

In [6]:
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')

# 3. Basic Analysis

Checking data shape

In [18]:
print(f'Train dataframe has {train_df.shape[0]} records and {train_df.shape[1]} features')
print(f'Test dataframe has {test_df.shape[0]} records and {test_df.shape[1]} features')

Train dataframe has 1460 rows and 81 columns
Test dataframe has 1459 rows and 80 columns


The column that does not have test dataframe is the one we are looking to predict

Checking data types

In [27]:
print('Train dataframe features data types:')
train_df.dtypes.value_counts()

Train dataframe features data types:


object     43
int64      35
float64     3
Name: count, dtype: int64

Train dataframe has 81 features: 38 are numerical and 43 are categorical