# Machine Learning

In this notebook, it is time to play with ML. I'm gonna use some tree based algorithms (xgboost and lgbm), some neural nets, and bayesian optimization to hyperparameter tunning. Hope to get good results.

In [1]:
import numpy as np
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import datetime

In [2]:
os.getcwd()

'C:\\Users\\hugo_\\OneDrive\\Documentos\\DataScience\\Repos\\kaggle_credit_risk\\notebooks'

In [3]:
# Importing utils 
os.chdir('C:\\Users\\hugo_\\OneDrive\\Documentos\\DataScience\\Repos\\kaggle_credit_risk\\code')

from utils import *

# Data directory
os.chdir('C:\\Users\\hugo_\\OneDrive\\Documentos\\DataScience\\Repos\\kaggle_credit_risk\\data\\treated')

In [4]:
train = pd.read_csv('train_eng.csv')
test = pd.read_csv('test_eng.csv')

In [6]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,EXT_SOURCE_2^3,EXT_SOURCE_2^2 EXT_SOURCE_3,EXT_SOURCE_2^2 DAYS_BIRTH,EXT_SOURCE_2 EXT_SOURCE_3^2,EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH,EXT_SOURCE_2 DAYS_BIRTH^2,EXT_SOURCE_3^3,EXT_SOURCE_3^2 DAYS_BIRTH,EXT_SOURCE_3 DAYS_BIRTH^2,DAYS_BIRTH^3
0,100002,1,0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,...,0.018181,0.009637,-654.152107,0.005108,-346.733022,23536670.0,0.002707,-183.785678,12475600.0,-846859000000.0
1,100003,0,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,...,0.240927,0.197797,-6491.237078,0.162388,-5329.19219,174891600.0,0.133318,-4375.173647,143583000.0,-4712058000000.0
2,100004,0,0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046,-225,...,0.171798,0.225464,-5885.942404,0.295894,-7724.580288,201657200.0,0.388325,-10137.567875,264650400.0,-6908939000000.0
3,100006,0,0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005,-3039,...,0.275185,0.216129,-8040.528832,0.169746,-6314.981929,234933100.0,0.133318,-4959.747997,184515000.0,-6864416000000.0
4,100007,0,0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932,-3038,...,0.033616,0.05321,-2076.117157,0.084225,-3286.224555,128219000.0,0.133318,-5201.667828,202954000.0,-7918677000000.0


This data was previously treated and engineered (by hand, since automated feature engineering was quite expansive and my computer couldn't handle it). Let's begin the ML process.

## Machine Learning process

In [1]:
# Importing validation process
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold

# Ensemble of trees classifiers
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, BaggingClassifier

# Decision tree just for visualization
from sklearn.tree import DecisionTreeClassifier

# Our validation metric
from sklearn.metrics import roc_auc_score

# Importing lgbm and xgboost
import lightgbm as lgb
import xgboost as xgb

Just like on Fail Fast, I'll use some algorithms with default parameters. Then, I'll use xgboost, lgbm and neural nets. And for the best model, I'll tune hyperparameters with Bayesian Optimization.

In [23]:
# models 
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
ada = AdaBoostClassifier()
bc = BaggingClassifier()
etc = ExtraTreesClassifier()
gbc = GradientBoostingClassifier()

# classifiers dicts
classifiers = {
    'Decision Tree': dt,
    'Random Forest': rf,
    'AdaBoost': ada,
    'Bagging Classifier': bc,
    'Extra Tree Classifier': etc,
    'Gradient Boosting': gbc
}

As a validation method, I'll use Stratified Shuffle Split. It creates n different splits on the data, shuffling the samples. This method is necessary in order to find out if our algorithm is overfitting in the training data. A good algorithm should be able to perform well on data not previously seen in the training process. This is called generalization, and it is the key points of machine learning.

In [8]:
sss = StratifiedShuffleSplit(n_splits = 5)

Cool! Let's make a for loop iteration over classifiers and getting metrics. But first, I'll drop the ID column and target column from the training and testing dataframes.

In [5]:
ID_train = train.SK_ID_CURR.values
ID_test = test.SK_ID_CURR.values

y_train = train.TARGET.values

X_train = train.drop(['SK_ID_CURR', 'TARGET'], axis=1).values
X_test = test.drop(['SK_ID_CURR'], axis=1).values

In [75]:
tic_model = datetime.datetime.now()
for model, clf in classifiers.items():
    
    roc_aucs = []
    
    tic_cv = datetime.datetime.now()
    for train_index, val_index in sss.split(X_train, y_train):
        X_train_, y_train_ = X_train[train_index, :], y_train[train_index]
        X_val_, y_val_ = X_train[val_index, :], y_train[val_index]
        
        clf.fit(X_train_, y_train_)
        roc_aucs.append(roc_auc_score(y_val_, clf.predict_proba(X_val_)[:,1]))
    toc_cv = datetime.datetime.now()
    
    print('Classifier: ' + model)
    print('-------------------')
    print(roc_aucs)
    print('AUC: {} +/- {}'.format(np.array(roc_aucs).mean(), np.array(roc_aucs).std()))
    print("Time elapsed: {} minutes and {} seconds".format(int((toc_cv - tic_cv).seconds / 60), 
                                                           int((toc_cv - tic_cv).seconds % 60)))
    print('='*20)

toc_model = datetime.datetime.now()
print()   
print("Total time elapsed: {} minutes and {} seconds".format(int((toc_model - tic_model).seconds / 60),
                                                             int((toc_model - tic_model).seconds % 60)))

Classifier: Decision Tree
-------------------
[0.539914882234249, 0.539380105635225, 0.5400495700880246, 0.5406251134834923, 0.5475818608028813]
AUC: 0.5415103064487745 +/- 0.0030615101994478666
Time elapsed: 6 minutes and 55 seconds
Classifier: Random Forest
-------------------
[0.6413934611027277, 0.6486663273968813, 0.6504979055497364, 0.649072114803174, 0.6494542883827652]
AUC: 0.6478168194470569 +/- 0.003268837402287209
Time elapsed: 3 minutes and 9 seconds
Classifier: AdaBoost
-------------------
[0.7464020613082756, 0.7574549705694789, 0.7592203730209601, 0.7467749389470386, 0.7537020176123673]
AUC: 0.752710872291624 +/- 0.005308461885649792
Time elapsed: 16 minutes and 19 seconds
Classifier: Bagging Classifier
-------------------
[0.646937282402861, 0.6487526407417195, 0.6475231546214709, 0.6504828468379277, 0.6317010074392173]
AUC: 0.6450793864086393 +/- 0.006798455507822953
Time elapsed: 44 minutes and 5 seconds




Classifier: Extra Tree Classifier
-------------------
[0.6542854565027115, 0.6393832584194475, 0.6525781305305951, 0.6507107861563624, 0.6511619064112601]
AUC: 0.6496239076040753 +/- 0.0052702013675929435
Time elapsed: 1 minutes and 46 seconds
Classifier: Gradient Boosting
-------------------
[0.7647890191132665, 0.7649323475618499, 0.7483139620885462, 0.7570576171815314, 0.7612006862270643]
AUC: 0.7592587264344516 +/- 0.006183495292752015
Time elapsed: 50 minutes and 8 seconds

Total time elapsed: 122 minutes and 24 seconds


That's good. GradientBoosting had the best performance. Let's submit a file and see what the score.

In [18]:
# creating a new gradient boosting classifier
gbc = GradientBoostingClassifier()

# training
gbc.fit(X_train, y_train)

# creating submission file
pd.DataFrame(
    {
        'SK_ID_CURR': ID_test,
        'TARGET': gbc.predict_proba(X_test)[:,1]
    }
).to_csv('..\\..\\submissions\\submission_gbc.csv', index = None)

Just to see if the results are satisfatory, let's calculate the AUC on training data.

In [19]:
print('AUC on training data is: {}'.format(roc_auc_score(y_train, gbc.predict_proba(X_train)[:,1])))

AUC on training data is: 0.7648329039998469


Good! 

The submission got __0.74089__ on private leaderboard. It is my best result so far. But can I make it better? The awnser is: off course! Let's use better models, such as LightGBM and XGBoost.

# LightGBM

LightGBM has became very popular because it implements gradient boosting machines in such a light way compared to XGBoost and others implementations. It is perfect for people with low RAM (such as me). Let's see how the algorithm works.

LightGBM is a gradient boosting framework that uses tree based learning algorithms. But how it differs from other tree based algorithms?

LightGBM grows trees __vertically__ while other algorithm grows trees horizontally, meaning that LightGBM grows trees __leaf-wise__ while other algorithms grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, leaf-wise algorithms can reduce more loss than a level-wise one.

![Explains how LightGBM works](leaf-wise.png)

![Explains how other algorithms works](level-wise.png)

This algorithm is called light because it requires low memory to run with large data. Also, it supports GPUs, so that's a bonus.

Let's check a few important parameters in order to understand that the algorithm is doing.

- __boosting_type__ (string, optional (default='gbdt')) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradient-based One-Side Sampling. ‘rf’, Random Forest.

- __num_leaves__ (int, optional (default=31)) – Maximum tree leaves for base learners.

- __max_depth__ (int, optional (default=-1)) – Maximum tree depth for base learners, <=0 means no limit.

- __learning_rate__ (float, optional (default=0.1)) – Boosting learning rate. You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback. Note, that this will ignore the learning_rate argument in training.

- __n_estimators__ (int, optional (default=100)) – Number of boosted trees to fit.

- __subsample_for_bin__ (int, optional (default=200000)) – Number of samples for constructing bins.

- __objective__ (string, callable or None, optional (default=None)) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). Default: ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.

- __class_weight__ (dict, 'balanced' or None, optional (default=None)) – Weights associated with classes in the form {class_label: weight}. This parameter must be used only for multi-class classification task; for binary classification task you may use __is_unbalance__ or __scale_pos_weight__ parameters. Note, that the usage of all these parameters will result in poor estimates of the individual class probabilities. You may want to consider performing probability calibration (https://scikit-learn.org/stable/modules/calibration.html) of your model. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). If None, all classes are supposed to have weight one. Note, that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

- __min_split_gain__ (float, optional (default=0.)) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

- __min_child_weight__ (float, optional (default=1e-3)) – Minimum sum of instance weight (hessian) needed in a child (leaf).

- __min_child_samples__ (int, optional (default=20)) – Minimum number of data needed in a child (leaf).

- __subsample__ (float, optional (default=1.)) – Subsample ratio of the training instance.

- __subsample_freq__ (int, optional (default=0)) – Frequence of subsample, <=0 means no enable.

- __colsample_bytree__ (float, optional (default=1.)) – Subsample ratio of columns when constructing each tree.

- __reg_alpha__ (float, optional (default=0.)) – L1 regularization term on weights.

- __reg_lambda__ (float, optional (default=0.)) – L2 regularization term on weights.

- __random_state__ (int or None, optional (default=None)) – Random number seed. If None, default seeds in C++ code will be used.

- __n_jobs__ (int, optional (default=-1)) – Number of parallel threads.

- __silent__ (bool, optional (default=True)) – Whether to print messages while running boosting.

- __importance_type__ (string, optional (default='split')) – The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.

For more information, [this link](https://lightgbm.readthedocs.io/en/latest/Python-API.html) should be helpfull.


In [20]:
lgb_params = {
        'objective': 'binary',
        'boosting': 'gbdt',
        'learning_rate': 0.2 ,
        'verbose': 0,
        'num_leaves': 100,
        'bagging_fraction': 0.95,
        'bagging_freq': 1,
        'bagging_seed': 1,
        'feature_fraction': 0.9,
        'feature_fraction_seed': 1,
        'max_bin': 256,
        'num_rounds': 100,
        'metric' : 'auc'
    }

In [28]:
model = lgb.LGBMClassifier(**lgb_params)

In [22]:
tic_cv = datetime.datetime.now()
roc_aucs = []
for train_index, val_index in sss.split(X_train, y_train):
    X_train_, y_train_ = X_train[train_index, :], y_train[train_index]
    X_val_, y_val_ = X_train[val_index, :], y_train[val_index]

    model.fit(X_train_, y_train_)
    roc_aucs.append(roc_auc_score(y_val_, model.predict_proba(X_val_)[:,1]))
    
toc_cv = datetime.datetime.now()

print('Classifier: LGBM')
print('-------------------')
print(roc_aucs)
print('AUC: {} +/- {}'.format(np.array(roc_aucs).mean(), np.array(roc_aucs).std()))
print("Time elapsed: {} minutes and {} seconds".format(int((toc_cv - tic_cv).seconds / 60), 
                                                       int((toc_cv - tic_cv).seconds % 60)))
print('='*20)



Classifier: LGBM
-------------------
[0.7520520985269431, 0.747631775374966, 0.7390433233155147, 0.754214483953404, 0.748605890247179]
AUC: 0.7483095142836013 +/- 0.005201635654617002
Time elapsed: 1 minutes and 41 seconds


It is possible to see good results in a few minutes. Let's create a submission file and see the score.

In [29]:
# training on full data
model.fit(X_train, y_train)

# creating submission file
pd.DataFrame(
    {
        'SK_ID_CURR': ID_test,
        'TARGET': model.predict_proba(X_test)[:,1]
    }
).to_csv('..\\..\\submissions\\submission_lgbm.csv', index = None)



In [30]:
print('AUC on training data is: {}'.format(roc_auc_score(y_train, model.predict_proba(X_train)[:,1])))

AUC on training data is: 0.9152767085082014


This submission made __0.73124__ on private leaderboard. Let's check XGBoost.

# XGBoost

XGBoost is an excellent algorithm, it is also an ensemble of trees, optimized by gradient boosting. Its implementation is heavier than LightGBM, but it still can provide good results (top Kagglers use this algorithm a lot!). I'm not gonna dive into the explanation of XGBoost (you will find [here](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) some helpfull information). Just keep in mind that it works like LightGBM, except that it grows its trees level-wise.

In [10]:
model = xgb.XGBClassifier(n_estimators = 10) # for memory reasons

In [None]:
tic_cv = datetime.datetime.now()
roc_aucs = []
for train_index, val_index in sss.split(X_train, y_train):
    X_train_, y_train_ = X_train[train_index, :], y_train[train_index]
    X_val_, y_val_ = X_train[val_index, :], y_train[val_index]

    model.fit(X_train_, y_train_)
    roc_aucs.append(roc_auc_score(y_val_, model.predict_proba(X_val_)[:,1]))
    
toc_cv = datetime.datetime.now()

print('Classifier: LGBM')
print('-------------------')
print(roc_aucs)
print('AUC: {} +/- {}'.format(np.array(roc_aucs).mean(), np.array(roc_aucs).std()))
print("Time elapsed: {} minutes and {} seconds".format(int((toc_cv - tic_cv).seconds / 60), 
                                                       int((toc_cv - tic_cv).seconds % 60)))
print('='*20)

In [None]:
# training on full data
model.fit(X_train, y_train)

# creating submission file
pd.DataFrame(
    {
        'SK_ID_CURR': ID_test,
        'TARGET': model.predict_proba(X_test)[:,1]
    }
).to_csv('..\\..\\submissions\\submission_xgb.csv', index = None)

In [None]:
print('AUC on training data is: {}'.format(roc_auc_score(y_train, model.predict_proba(X_train)[:,1])))

We've seen so far that gradient boosting trees are better models for this particular task. In the next kernel I'll be doing hyperparameter tunning with random search and bayesian optimization.