<a href="https://colab.research.google.com/github/jhomolos/data_projects/blob/main/Non_linear_Regression_Exercise_424.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Non-linear Regression Methods**

The new mission now is to predict house prices using a non-linear regression method. All the steps of training the models in Random Forest, AdaBoost and Gradient Boosting and extracting the feature importance for each algorithm will be carried out.

A description of the features follows:

**ID:** unique key for each house

**MSSubClass:** The building class

**LotFrontage:** Linear feet of street connected to property

**LotArea:** Lot size in square feet

**OverallQual:** Overall material and finish quality

**OverallCond:** Overall condition rating

**YearBuilt:** Original construction date

**YearRemodAdd:** Remodel date

**MasVnrArea:** Masonry veneer area in square feet

**ExterQual:** Exterior material quality

**ExterCond:** Present condition of the material on the exterior

**BsmtFinSF1:** Type 1 finished square feet

**BsmtFinSF2:** Type 2 finished square feet

**BsmtUnfSF:** Unfinished square feet of basement area

**TotalBsmtSF:** Total square feet of basement area

**1stFlrSF:** First Floor square feet

**2ndFlrSF:** Second floor square feet

**LowQualFinSF:** Low quality finished square feet (all floors)

**GrLivArea:** Above grade (ground) living area square feet

**BsmtFullBath:** Basement full bathrooms

**BsmtHalfBath:** Basement half bathrooms

**FullBath:** Full bathrooms above grade

**HalfBath:** Half baths above grade

**BedroomAbvGr:** Number of bedrooms above grade

**KitchenAbvGr:** Number of kitches above grade

**TotRmsAbvGrd:** Total rooms above grade (does not include bathrooms)

**Fireplaces:** Number of fireplaces

**GarageYrBlt:** Year garage was built

**GarageCars:** Size of garage in car capacity

**GarageArea:** Size of garage in square feet

**WoodDeckSF:** Wood deck area in square feet

**OpenPorchSF:** Open porch area in square feet

**EnclosedPorch:** Enclosed porch area in square feet

**3SsnPorch:** Three season porch area in square feet

**ScreenPorch:** Screen porch area in square feet

**PoolArea:** Pool area in square feet

**MiscVal:** $Value of miscellaneous feature

**MoSold:** Month Sold

**YrSold:** Year Sold

**SalePrice:** the property's sale price in dollars. This is the target variable
that you're trying to predict.


In [7]:
#importing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn import metrics

In [8]:
#reading excel file with dataset
house = pd.read_excel("house.xlsx")
house.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,196.0,Gd,...,0,61,0,0,0,0,0,2,2008,208500
1,2,20,80.0,9600,6,8,1976,1976,0.0,TA,...,298,0,0,0,0,0,0,5,2007,181500
2,3,60,68.0,11250,7,5,2001,2002,162.0,Gd,...,0,42,0,0,0,0,0,9,2008,223500
3,4,70,60.0,9550,7,5,1915,1970,0.0,TA,...,0,35,272,0,0,0,0,2,2006,140000
4,5,60,84.0,14260,8,5,2000,2000,350.0,Gd,...,192,84,0,0,0,0,0,12,2008,250000


In [27]:
# now we will verity the datatypes to see if some features need to be encoded from categorical to numerical
house.dtypes

MSSubClass         int64
LotFrontage      float64
LotArea            int64
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
MasVnrArea       float64
ExterQual          int64
ExterCond          int64
BsmtFinSF1         int64
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
TotRmsAbvGrd       int64
Fireplaces         int64
GarageYrBlt      float64
GarageCars         int64
GarageArea         int64
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
MiscVal            int64
MoSold             int64
YrSold             int64
SalePrice          int64
dtype: object

the info method shows that there are two categorical columns: ExterQual and ExterCond, and both cannot be used without conversion.

In [10]:
# using Label Encoder to transform categorical features to numerical.
categoricalColumns = ['ExterQual', 'ExterCond']
le = LabelEncoder()
for col in categoricalColumns:
    house[col] = le.fit_transform(house[col])
house.head()


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,196.0,2,...,0,61,0,0,0,0,0,2,2008,208500
1,2,20,80.0,9600,6,8,1976,1976,0.0,3,...,298,0,0,0,0,0,0,5,2007,181500
2,3,60,68.0,11250,7,5,2001,2002,162.0,2,...,0,42,0,0,0,0,0,9,2008,223500
3,4,70,60.0,9550,7,5,1915,1970,0.0,3,...,0,35,272,0,0,0,0,2,2006,140000
4,5,60,84.0,14260,8,5,2000,2000,350.0,2,...,192,84,0,0,0,0,0,12,2008,250000


In [23]:
# Correcting null values
mean_LotFrontage = house['LotFrontage'].mean()
house['LotFrontage'].fillna(mean_LotFrontage, inplace=True)

mean_MasVnrArea = house['MasVnrArea'].mean()
house['MasVnrArea'].fillna(mean_MasVnrArea, inplace=True)

mean_GarageYrBlt = house['GarageYrBlt'].mean()
house['GarageYrBlt'].fillna(mean_GarageYrBlt, inplace=True)

house.isna().sum()

Id               0
MSSubClass       0
LotFrontage      0
LotArea          0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
MasVnrArea       0
ExterQual        0
ExterCond        0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
TotRmsAbvGrd     0
Fireplaces       0
GarageYrBlt      0
GarageCars       0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
MiscVal          0
MoSold           0
YrSold           0
SalePrice        0
dtype: int64

 It shows also that the columns LotFrontage, MasVnrArea and GarageYrBlt contain null values.

In [22]:
#verifying duplicates
print("There are {} duplicates".format(house.duplicated().sum()))

There are 0 duplicates


For modeling purposes, we will drop the column 'Id', because it doesn't contain relevant information.

In [26]:
house_id = house['Id']
house.drop('Id', axis=1, inplace=True)

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,ExterCond,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,60,65.0,8450,7,5,2003,2003,196.0,2,4,...,0,61,0,0,0,0,0,2,2008,208500
1,20,80.0,9600,6,8,1976,1976,0.0,3,4,...,298,0,0,0,0,0,0,5,2007,181500
2,60,68.0,11250,7,5,2001,2002,162.0,2,4,...,0,42,0,0,0,0,0,9,2008,223500
3,70,60.0,9550,7,5,1915,1970,0.0,3,4,...,0,35,272,0,0,0,0,2,2006,140000
4,60,84.0,14260,8,5,2000,2000,350.0,2,4,...,192,84,0,0,0,0,0,12,2008,250000


In [31]:
# Now we will split our data into test and training
x = house.iloc[:, :-1]
y = house.iloc[:, -1]
x_list = list(x.columns)
x = np.array(x)
y = np.array(y)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [32]:
# Random Forest Regression
rfr = RandomForestRegressor(n_estimators=100, random_state=42)
rfr.fit(x_train, y_train)
y_pred_rfr = rfr.predict(x_test)
r_sq_score = rfr.score(x_train, y_train)
print(f"R²: {r_sq_score}")

R²: 0.9785355978814705


In [33]:
# Metrics for the Random Forest Regression
print('MAE: {}'.format(metrics.mean_absolute_error(y_test, y_pred_rfr)))
print('MSE: {}'.format(metrics.mean_squared_error(y_test, y_pred_rfr)))
print('RMSE: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_pred_rfr))))

MAE: 17966.99546803653
MSE: 839136437.8418005
RMSE: 28967.85179887871


In [34]:
# AdaBoost
adb = AdaBoostRegressor(n_estimators=100, random_state=42)
adb.fit(x_train, y_train)
y_pred_adb = adb.predict(x_test)
adb_sq_score = adb.score(x_train, y_train)
print(f"R²: {adb_sq_score}")

R²: 0.8719496242910916


In [35]:
# Metrics for AdaBoost Regression
print("MAE: {}".format(metrics.mean_absolute_error(y_test, y_pred_adb)))
print("MSE: {}".format(metrics.mean_squared_error(y_test, y_pred_adb)))
print("RMSE: {}".format(np.sqrt(metrics.mean_squared_error(y_test, y_pred_adb))))


MAE: 24665.647787939375
MSE: 1249478513.4101741
RMSE: 35347.96335590177


In [36]:
# Gradient Boosting Regression
gbr = GradientBoostingRegressor(n_estimators = 100, random_state = 42)
gbr.fit(x_train, y_train)
y_pred_gbr = gbr.predict(x_test)
gbr_sq_score = gbr.score(x_train, y_train)
print(f"R²: {gbr_sq_score}")

R²: 0.9641021607500599


In [37]:
# Metrics for Gradient Boosting Regression
print("MAE: {}".format(metrics.mean_absolute_error(y_test, y_pred_gbr)))
print("MSE: {}".format(metrics.mean_squared_error(y_test, y_pred_gbr)))
print("RMSE: {}".format(np.sqrt(metrics.mean_squared_error(y_test, y_pred_gbr))))

MAE: 17547.37690312577
MSE: 822246541.5578891
RMSE: 28674.8416134752


Among all regressors, Gradient Boosting delivered the best results.

# **Tree Visualization**

### **Random Forest**

In [42]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rfr.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree_rfr.dot', feature_names = x_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree_rfr.dot')
# Write graph to a png file
graph.write_png('tree_rfr.png')

### **AdaBoost**

In [None]:
# Pull out one tree from the forest
tree = adb.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree_adb.dot', feature_names = x_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree_adb.dot')
# Write graph to a png file
graph.write_png('tree_adb.png')

### **Gradient Boosting**

In [None]:
# Pull out one tree from the forest
tree = gbr.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree_gbr.dot', feature_names = x_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree_gbr.dot')
# Write graph to a png file
graph.write_png('tree_gbr.png')

# **Feature Importance**

In [39]:
# Get numerical feature importances
importances = list(rfr.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(x_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: OverallQual          Importance: 0.56
Variable: GrLivArea            Importance: 0.12
Variable: TotalBsmtSF          Importance: 0.04
Variable: 2ndFlrSF             Importance: 0.04
Variable: BsmtFinSF1           Importance: 0.03
Variable: 1stFlrSF             Importance: 0.03
Variable: LotArea              Importance: 0.02
Variable: YearBuilt            Importance: 0.02
Variable: GarageArea           Importance: 0.02
Variable: LotFrontage          Importance: 0.01
Variable: OverallCond          Importance: 0.01
Variable: YearRemodAdd         Importance: 0.01
Variable: BsmtUnfSF            Importance: 0.01
Variable: FullBath             Importance: 0.01
Variable: TotRmsAbvGrd         Importance: 0.01
Variable: GarageYrBlt          Importance: 0.01
Variable: GarageCars           Importance: 0.01
Variable: WoodDeckSF           Importance: 0.01
Variable: OpenPorchSF          Importance: 0.01
Variable: MSSubClass           Importance: 0.0
Variable: MasVnrArea           Importance

In [40]:
# Get numerical feature importances
importances = list(adb.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(x_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: OverallQual          Importance: 0.22
Variable: GrLivArea            Importance: 0.17
Variable: 2ndFlrSF             Importance: 0.14
Variable: GarageCars           Importance: 0.08
Variable: OpenPorchSF          Importance: 0.04
Variable: LotFrontage          Importance: 0.03
Variable: LotArea              Importance: 0.03
Variable: BsmtFinSF1           Importance: 0.03
Variable: TotalBsmtSF          Importance: 0.03
Variable: 1stFlrSF             Importance: 0.03
Variable: GarageYrBlt          Importance: 0.03
Variable: ScreenPorch          Importance: 0.03
Variable: YearBuilt            Importance: 0.02
Variable: MoSold               Importance: 0.02
Variable: YearRemodAdd         Importance: 0.01
Variable: BsmtFullBath         Importance: 0.01
Variable: FullBath             Importance: 0.01
Variable: BedroomAbvGr         Importance: 0.01
Variable: Fireplaces           Importance: 0.01
Variable: WoodDeckSF           Importance: 0.01
Variable: PoolArea             Importanc

In [41]:
# Get numerical feature importances
importances = list(gbr.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(x_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: OverallQual          Importance: 0.52
Variable: GrLivArea            Importance: 0.14
Variable: GarageCars           Importance: 0.05
Variable: BsmtFinSF1           Importance: 0.04
Variable: TotalBsmtSF          Importance: 0.04
Variable: 1stFlrSF             Importance: 0.03
Variable: 2ndFlrSF             Importance: 0.03
Variable: LotArea              Importance: 0.02
Variable: YearBuilt            Importance: 0.02
Variable: YearRemodAdd         Importance: 0.02
Variable: OverallCond          Importance: 0.01
Variable: ExterQual            Importance: 0.01
Variable: FullBath             Importance: 0.01
Variable: Fireplaces           Importance: 0.01
Variable: GarageYrBlt          Importance: 0.01
Variable: MSSubClass           Importance: 0.0
Variable: LotFrontage          Importance: 0.0
Variable: MasVnrArea           Importance: 0.0
Variable: ExterCond            Importance: 0.0
Variable: BsmtFinSF2           Importance: 0.0
Variable: BsmtUnfSF            Importance: 0.