# Intro to Gradient Boosting

![gradient boosting image](https://media.geeksforgeeks.org/wp-content/uploads/20200721214745/gradientboosting.PNG)

Image thanks to [Geeks for Geeks](https://www.geeksforgeeks.org/ml-gradient-boosting/)

In this assignment you will:
1. import and prepare a dataset for modeling
2. test and evaluate 3 different boosting models and compare the fit times of each.
3. tune the hyperparameters of the best model to reduce overfitting and improve performance.

In [46]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, plot_confusion_matrix
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In this assignment you will be working with census data.  Your goal is to predict whether a person will make more or less than $50k per year in income.

The data is available [here](https://drive.google.com/file/d/1drlRzq-lIY7rxQnvv_3fsxfIfLsjQ4A-/view?usp=sharing)

In [47]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [48]:
df = pd.read_csv('/content/drive/MyDrive/Coding Dojo/Raw Data/census_income - census_income.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-class
0,0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Prepare your dataset for modeling.

Remember to: 
1. Check for missing data, bad data, and duplicates.
2. Check your target class balance.
3. Perform your validation split
4. Create a preprocessing pipeline to use with your models.
5. Fit and evaluate your models using pipelines

In [49]:
# drop unnecessary columns
df.drop(columns = 'Unnamed: 0', inplace = True)

In [50]:
# check for null values
df.isna().sum()

age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income-class      0
dtype: int64

In [51]:
# check for duplicates
df.duplicated().any()

True

In [52]:
# drop duplicates
df.drop_duplicates(inplace = True)

In [53]:
# confirm duplicates are dropped
df.duplicated().any()

False

In [54]:
# see range of numeric values
# this is to check for values that are not logical
df.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,29096.0,29096.0,29096.0,29096.0
mean,39.25134,1197.802206,97.175179,40.63782
std,13.687157,7778.22522,424.008232,12.735418
min,17.0,0.0,0.0,1.0
25%,28.0,0.0,0.0,40.0
50%,38.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,45.0
max,90.0,99999.0,4356.0,99.0


In [55]:
# check value types
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29096 entries, 0 to 32560
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             29096 non-null  int64 
 1   workclass       29096 non-null  object
 2   education       29096 non-null  object
 3   marital-status  29096 non-null  object
 4   occupation      29096 non-null  object
 5   relationship    29096 non-null  object
 6   race            29096 non-null  object
 7   sex             29096 non-null  object
 8   capital-gain    29096 non-null  int64 
 9   capital-loss    29096 non-null  int64 
 10  hours-per-week  29096 non-null  int64 
 11  native-country  29096 non-null  object
 12  income-class    29096 non-null  object
dtypes: int64(4), object(9)
memory usage: 3.1+ MB


## Pre-processing

Now that we have confirmed the data is clean, we will begin pre-processing for modeling.

In [56]:
# check target balance
df['income-class'].value_counts(normalize = 'true')

<=50K    0.7522
>50K     0.2478
Name: income-class, dtype: float64

In [57]:
# declare features matrix and target vector
X = df.drop(columns = 'income-class')
y = df['income-class']

In [58]:
# split into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [59]:
# instantiate column selectors
num_selector = make_column_selector(dtype_include= 'number')
cat_selector = make_column_selector(dtype_include = 'object')

In [60]:
# instantiate OneHotEncoder and StandardScaler
ohe = OneHotEncoder(sparse = False, handle_unknown= 'ignore')
scaler = StandardScaler()

In [61]:
# instantiate tuples
ohe_tuple = (ohe, cat_selector)
scale_tuple = (scaler, num_selector)

In [62]:
# transform the data using tuples
col_transformer = make_column_transformer(ohe_tuple, scale_tuple, remainder = 'passthrough')

# eXtreme Gradient Boosting
We are going to compare both metrics and fit times for our models.  Notice the 'cell magic' in the top of the cell below.  By putting `%%time` at the top of a notebook cell, we can tell it to output how long that cell took to run.  We can use this to compare the speed of each of our different models.  Fit times can be very important for models in deployment, especially with very large dataset and/or many features.

Instantiate an eXtreme Gradient Boosting Classifier (XGBClassifier) below, fit it, and print out a classification report.  Take note of the accuracy, recall, precision, and f1-score, as well as the run time of the cell to compare to our next models.

In [63]:
# instantiate and fit XGBoost
%%time
xgb = XGBClassifier(random_state = 42)
xgb_pipe = make_pipeline(col_transformer, xgb)
xgb_pipe.fit(X_train, y_train)

CPU times: user 5.69 s, sys: 34.5 ms, total: 5.72 s
Wall time: 5.76 s


In [64]:
# evaluate XGBoost model
xgb_train = xgb_pipe.predict(X_train)
xgb_test = xgb_pipe.predict(X_test)
print('Training Classification Report:\n', classification_report(y_train, xgb_train))
print('Test Classification Report:\n', classification_report(y_test, xgb_test))

Training Classification Report:
               precision    recall  f1-score   support

       <=50K       0.87      0.96      0.91     16414
        >50K       0.81      0.57      0.67      5408

    accuracy                           0.86     21822
   macro avg       0.84      0.76      0.79     21822
weighted avg       0.86      0.86      0.85     21822

Test Classification Report:
               precision    recall  f1-score   support

       <=50K       0.87      0.95      0.91      5472
        >50K       0.80      0.57      0.66      1802

    accuracy                           0.86      7274
   macro avg       0.83      0.76      0.79      7274
weighted avg       0.85      0.86      0.85      7274



Which target class is your model better at predicting?  Is it significantly overfit?

The XGBoost model is better at predicting the <=50K class.

The model is not significantly overfit.

# More Gradient Boosting

Now fit and evaluate a Light Gradient Boosting Machine and a the Scikit Learn (sklearn) gradient boost model.  Remember to use the `%%time` cell magic command to get the run time.

## LightGBM

In [65]:
# instantiate and fit Light GBM
%%time
light = LGBMClassifier(random_state = 42)
lgbm_pipe = make_pipeline(col_transformer, light)
lgbm_pipe.fit(X_train, y_train)

CPU times: user 564 ms, sys: 15.7 ms, total: 580 ms
Wall time: 585 ms


In [66]:
# evaluate LightGBM
lgbm_train = lgbm_pipe.predict(X_train)
lgbm_test = lgbm_pipe.predict(X_test)
print('Training Classification Report:\n', classification_report(y_train, lgbm_train))
print('Test Classification Report:\n', classification_report(y_test, lgbm_test))

Training Classification Report:
               precision    recall  f1-score   support

       <=50K       0.90      0.95      0.92     16414
        >50K       0.81      0.69      0.74      5408

    accuracy                           0.88     21822
   macro avg       0.86      0.82      0.83     21822
weighted avg       0.88      0.88      0.88     21822

Test Classification Report:
               precision    recall  f1-score   support

       <=50K       0.89      0.94      0.91      5472
        >50K       0.77      0.65      0.71      1802

    accuracy                           0.87      7274
   macro avg       0.83      0.79      0.81      7274
weighted avg       0.86      0.87      0.86      7274



## GradientBoostingClassifier

In [67]:
# instantiate and fit Gradient Boosting Classifier
%%time
gbc = GradientBoostingClassifier(random_state = 42)
gbc_pipe = make_pipeline(col_transformer, gbc)
gbc_pipe.fit(X_train, y_train)

CPU times: user 8.44 s, sys: 27.5 ms, total: 8.47 s
Wall time: 8.46 s


In [68]:
# evaluate Gradint Boosting Classifier
gbc_train = gbc_pipe.predict(X_train)
gbc_test = gbc_pipe.predict(X_test)
print('Training Classification Report:\n', classification_report(y_train, gbc_train))
print('Test Classification Report:\n', classification_report(y_test, gbc_test))

Training Classification Report:
               precision    recall  f1-score   support

       <=50K       0.88      0.96      0.91     16414
        >50K       0.82      0.59      0.69      5408

    accuracy                           0.87     21822
   macro avg       0.85      0.77      0.80     21822
weighted avg       0.86      0.87      0.86     21822

Test Classification Report:
               precision    recall  f1-score   support

       <=50K       0.87      0.95      0.91      5472
        >50K       0.80      0.59      0.68      1802

    accuracy                           0.86      7274
   macro avg       0.84      0.77      0.79      7274
weighted avg       0.86      0.86      0.85      7274




# Tuning Gradient Boosting Models

Tree-based gradient boosting models have a LOT of hyperparameters to tune.  Here are the documentation pages for each of the 3 models you used today:

1. [XGBoost Hyperparameter Documentation](https://xgboost.readthedocs.io/en/latest/parameter.html)
2. [LightGBM Hyperparameter Documentation](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)
3. [Scikit-learn Gradient Boosting Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

Choose the model you felt performed the best when comparing multiple metrics and the runtime for fitting, and use GridSearchCV to try at least 2 different values each for 3 different hyper parameters in boosting model you chose.

See if you can create a model with an accuracy between 86 and 90.


In [69]:
lgbm_pipe.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fd46d914350>),
                                                 ('standardscaler',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fd46d914310>)])),
                ('lgbmclassifier', LGBMClassifier(random_state=42))])>

In [72]:
params = {'lgbmclassifier__max_depth': range(1,25),
          'lgbmclassifier__n_estimators': [50, 100, 200],
          'lgbmclassifier__boosting_type': ['gbdt', 'rf']}

In [73]:
%%time
grid = GridSearchCV(lgbm_pipe, params)
grid.fit(X_train, y_train)

360 fits failed out of a total of 720.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
360 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py", line 744, in fit
    callbacks=callbacks)
  File "/usr/local/lib/python3.7/dist-packages/lightgbm/sklearn.py", line 544, in fit
    callbacks=callbacks)
  File "/usr/local/lib/python3.7/dist-packages/l

CPU times: user 3min 28s, sys: 2.45 s, total: 3min 30s
Wall time: 3min 30s


In [74]:
grid.best_params_

{'lgbmclassifier__boosting_type': 'gbdt',
 'lgbmclassifier__max_depth': 17,
 'lgbmclassifier__n_estimators': 100}

In [75]:
best = grid.best_estimator_

# Evaluation

Evaluate your model using a classifiation report and/or a confusion matrix.  Explain in text how your model performed in terms of precision, recall, and it's ability to predict each of the two classes.  Also talk about the benefits or drawbacks of the computation time of that model.

In [76]:
best_train = best.predict(X_train)
best_test = best.predict(X_test)
print('Training Classification Report:\n', classification_report(y_train, best_train))
print('Test Classification Report:\n', classification_report(y_test, best_test))

Training Classification Report:
               precision    recall  f1-score   support

       <=50K       0.90      0.95      0.92     16414
        >50K       0.81      0.69      0.75      5408

    accuracy                           0.88     21822
   macro avg       0.86      0.82      0.84     21822
weighted avg       0.88      0.88      0.88     21822

Test Classification Report:
               precision    recall  f1-score   support

       <=50K       0.89      0.93      0.91      5472
        >50K       0.77      0.65      0.71      1802

    accuracy                           0.87      7274
   macro avg       0.83      0.79      0.81      7274
weighted avg       0.86      0.87      0.86      7274



## Evaluation

Our optimized model is 87% accurate.  It struggles to predict which individuals will earn over 50k, likely due to the unbalanced dataset. It performs well predicting individuals that will earn less than 50k. It is able to identify 93% of the individuals that will earn less than 50k, and 89% of the individuals it predicts to earn less than 50k, do earn less than 50k.

# Conclusion

In this assignment you practiced:
1. data cleaning
2. instantiating, fitting, and evaluating boosting models using multiple metrics
3. timing how long it takes a model to fit and comparing run times between multiple models
4. and choosing a final model based on multiple metrics.

