# Gradient Boosted Trees

### Introduction

In this notebook, I will use three implementations of gradient boosting (XGBoost, LightGBM and CatBoost) to predict income (over or under 50k) based on demographic data. Teh datasets  originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income), but a more detailed explanation about the dataset and it's processing can be found on the finding-donors project on this same repository.

The main objective of this notebook is to ilustrate some gradient boosting implementation, to complement a presentation I gave on a Machine Learning Meetup in 2019. Because of this, I will use the default hyperparameters of each model instead of doing a full hyperparameter search that should be expected in a more detailed project. 

Finally, the pdf of the Meetup presentation can be found  in this repository.

### Importing the data

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

%matplotlib inline

# Load the Census dataset
data = pd.read_csv("../finding_donors/census.csv")

# Success - Display the first record
display(data.head(10))

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
5,37,Private,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
6,49,Private,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
7,52,Self-emp-not-inc,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
8,31,Private,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
9,42,Private,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K


### Data processing

In [9]:
# Split the data into features and target label
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

# Log-transform the skewed features
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data = features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))


# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])

# Show an example of a record with scaling applied
display(features_log_minmax_transform.head(n = 5))

# TODO: One-hot encode the 'features_log_minmax_transform' data using pandas.get_dummies()
features_final = pd.get_dummies(data=features_log_minmax_transform) 

# TODO: Encode the 'income_raw' data to numerical values
income = income_raw.map(lambda x: 0 if x=="<=50K" else 1)

  return self.partial_fit(X, y)


Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,0.30137,State-gov,Bachelors,0.8,Never-married,Adm-clerical,Not-in-family,White,Male,0.667492,0.0,0.397959,United-States
1,0.452055,Self-emp-not-inc,Bachelors,0.8,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,0.122449,United-States
2,0.287671,Private,HS-grad,0.533333,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,0.397959,United-States
3,0.493151,Private,11th,0.4,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,0.397959,United-States
4,0.150685,Private,Bachelors,0.8,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,0.397959,Cuba


### Shuffling and Splitiing the Data


In [10]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    income, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 36177 samples.
Testing set has 9045 samples.


In [11]:
X_train.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
13181,0.410959,0.6,0.0,0.0,0.5,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
10342,0.438356,0.533333,0.0,0.0,0.397959,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
20881,0.054795,0.666667,0.0,0.0,0.357143,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
24972,0.30137,0.866667,0.0,0.905759,0.44898,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
43867,0.246575,0.6,0.0,0.0,0.5,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0


### XGBoost

In [45]:
import xgboost as xgb

start = time() 
xg = xgb.XGBClassifier(n_estimators=200, random_state=42, n_jobs=4)
xg.fit(X_train, y_train)
end = time() 

xgb_training_time = end - start

train_predictions = xg.predict(X_train)
test_predictions = xg.predict(X_test)

xgb_train_auc = accuracy_score(train_predictions, y_train)
xgb_test_auc = accuracy_score(test_predictions, y_test)

print('Training time: {}, train_acc: {}, test_acc = {}'.format(round(xgb_training_time,2), 
                                                               round(xgb_train_auc,4), 
                                                               round(xgb_test_auc,4)))

Training time: 3.87, train_acc: 0.8715, test_acc = 0.8672


### LightGBM

In [41]:
import lightgbm as lgb

start = time()         
lg = lgb.LGBMClassifier(n_estimators=200, random_state=42, n_jobs=4)
lg.fit(X_train, y_train)
end = time() 
lgb_training_time = end - start

train_predictions = lg.predict(X_train)
test_predictions = lg.predict(X_test)

lgb_train_auc = accuracy_score(train_predictions, y_train)
lgb_test_auc = accuracy_score(test_predictions, y_test)

print('Training time: {}, train_acc: {}, test_acc = {}'.format(round(lgb_training_time,2), 
                                                               round(lgb_train_auc,4), 
                                                               round(lgb_test_auc,4)))

Training time: 0.46, train_acc: 0.8862, test_acc = 0.8704


### CatBoost

In [44]:
import catboost as ctb

start = time()  
cb = ctb.CatBoostClassifier(n_estimators=200, verbose=False, random_state=42)
cb.fit(X_train, y_train)
end = time() 
cb_training_time = end - start

train_predictions = cb.predict(X_train)
test_predictions = cb.predict(X_test)

cb_train_auc = accuracy_score(train_predictions, y_train)
cb_test_auc = accuracy_score(test_predictions, y_test)

print('Training time: {}, train_acc: {}, test_acc = {}'.format(round(cb_training_time,2), 
                                                               round(cb_train_auc,4), 
                                                               round(cb_test_auc,4)))

Training time: 9.21, train_acc: 0.8769, test_acc = 0.871
