<a href="https://colab.research.google.com/github/hkaragah/google_colab_repo/blob/main/hands_on_ml_exercises/07_ensemble_learning_Histogram_BasedGradient_Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensamble Learning: Histogram-Based Gradient Boosting

__Disclaimer:__ This exercise is adopted from `"Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow (Third Edition)"` book written by `_Aurelien Geron_` publoshed by `_O'Reilly_`. I broke them down into smaller digestable snippets, made some modifications, and added some explanations so that I can undersatand them better. The porpuse of this notebook is just for me to understand the concept and have hands-on practice while reading the book material.

## Objective
Apply Histogram-based gradient boosting for ensamble learning

## Load Dataset

In [25]:
# The following dataset does not have 'ocean_proximity' needed for this exercise
# from sklearn.datasets import fetch_california_housing

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from graphviz import Source
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import randint, uniform

from copy import deepcopy
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, mean_squared_error
from IPython.display import Image
from pathlib import Path
import tarfile
import urllib.request
import time
import math


In [26]:
def load_housing_data():
    tarball_path = Path("datasets") / "housing.tgz"
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [27]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [29]:
housing['ocean_proximity'].value_counts()

Unnamed: 0_level_0,count
ocean_proximity,Unnamed: 1_level_1
<1H OCEAN,9136
INLAND,6551
NEAR OCEAN,2658
NEAR BAY,2290
ISLAND,5


In [30]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housting_labels = train_set['median_house_value']
housing = train_set.drop('median_house_value', axis=1)

In [31]:
housting_labels.head()

Unnamed: 0,median_house_value
14196,291000.0
8267,156100.0
17445,353900.0
14265,241200.0
2271,53800.0


In [36]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
14196,-117.22,32.75,34.0,6001.0,1111.0,2654.0,1072.0,4.5878,NEAR OCEAN
8267,-117.03,32.69,10.0,901.0,163.0,698.0,167.0,4.6648,NEAR OCEAN
17445,-122.27,37.74,28.0,6909.0,1554.0,2974.0,1484.0,3.6875,NEAR BAY
14265,-121.82,37.25,25.0,4021.0,634.0,2178.0,650.0,5.1663,<1H OCEAN
2271,-115.98,33.32,8.0,240.0,46.0,63.0,24.0,1.4688,INLAND


In [53]:
transform = make_column_transformer(
    (OrdinalEncoder(), ['ocean_proximity']),
    remainder='drop')

housing_copy = deepcopy(housing)
td = transform.fit_transform(housing_copy)
print(td)

[[4.]
 [4.]
 [3.]
 ...
 [1.]
 [1.]
 [0.]]


In [48]:
# Access categories
transform.named_transformers_['ordinalencoder'].categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

In [49]:
# Alternatively use the following
transform.transformers_[0][1].categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

As seen above,
* 1st and 2nd entries are 4 corresponding to 'NEAR OCEAN'
* 3rd entry is 3 corresponding to 'NEAR BAY'
* and so on
These match the 'ocean_proximity' column of the un-transformed dataframe shown above. Let's use this practice to transform the 'ocean_proximity' column to a categorical data.

In [54]:
hgb_reg = make_pipeline(
    make_column_transformer(
        (OrdinalEncoder(), ['ocean_proximity']),
        remainder='passthrough'),
    HistGradientBoostingRegressor(
        categorical_features=[0], # Because the categorical data is in column at index 0 and other column (passthrough) are at index 1 of the transformer output
        random_state=42))

hgb_reg.fit(train_set, housting_labels)

In [55]:
hgb_rmse = -cross_val_score(hgb_reg, train_set, housting_labels, scoring='neg_root_mean_squared_error', cv=10)
hgb_rmse

array([ 871.41754844,  838.23401336, 1274.26447575,  880.33595677,
       1006.13955721,  842.88526531, 1068.29118665, 1068.11085804,
       1044.96786475, 1049.78833514])

In [58]:
print(pd.Series(hgb_rmse).describe())

count      10.000000
mean      994.443506
std       137.622639
min       838.234013
25%       873.647151
50%      1025.553711
75%      1063.530227
max      1274.264476
dtype: float64
