# 1. Ensemble Learning overview
Classical Machine Learning algorithms are usually shown to be poor when handling real-world datasets. Models fit from these algorithms often suffer from two problems: high bias and high variance; such a model is called a *weak learner*. In this topic, we are going through some elegant techniques that combine multiple algorithms to form a powerful model, which produces an improved overall result. This is referred to generally as [Ensemble Learning](https://en.wikipedia.org/wiki/Ensemble_learning). Enemble Learning methods in fact have proved their effectiveness in many Machine Learing competitions.

## 1.1. Voting
[Voting](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) (for classification) or [Averaging](https://scikit-learn.org/stable/modules/ensemble.html#voting-regressor) (for regression) is the simplest ensembling method. When doing voting for classification, there are two strategies can be applied: marjority voting on predicted results (hard voting) and taking argmax of the weighted average of predicted probabilities (soft voting).

In [28]:
import numpy as np
import pandas as pd
np.set_printoptions(precision=4, floatmode='maxprec')

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier, VotingRegressor

In [24]:
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

In [25]:
clf1 = KNeighborsClassifier()
clf2 = DecisionTreeClassifier()
clf3 = GaussianNB()

models = [('knn', clf1), ('tree', clf2), ('gnb', clf3)]
ensembler = VotingClassifier(models, voting='soft')

In [26]:
models = [clf1, clf2, clf3, ensembler]
names = ['kNN Classifier', 'Decision Tree', 'Naive Bayes', 'Voting Classifier']
yTrue = y

for name, model in zip(names, models):
    model = model.fit(X, y)
    yPred = model.predict(X)
    auc = roc_auc_score(yTrue, yPred)
    print(f'AUC = {auc:.4f} ({name})')

AUC = 0.9379 (kNN Classifier)
AUC = 1.0000 (Decision Tree)
AUC = 0.9317 (Naive Bayes)
AUC = 0.9783 (Voting Classifier)


## 1.2. Stacking
[Stacking](https://scikit-learn.org/stable/modules/ensemble.html#stacked-generalization) involes in these steps:
- Fit a number of weak learners to the dataset.

## 1.3. Bagging
[Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) (Boostrap Aggregating) uses averaging method in order to reduce variance of the model. For specifically, bagging is divided into 2 parts: boostrapping and aggregating.
- Boostrapping: We generate `bootstrap samples` of size n from an initial dataset of size N by `randomly with replacement` n observations. It's mean some obs can be repeat in each sample. Sampling with replacement ensures each bootstrap is independent from others.
- Aggregating: After generating boostrap sample, we fit each of them into the weak learner. All the results of models will be combine by averaging (for regression) or voting the output (for classification). 

*Some assumptions come with bagging*:
- The size for all bootstrap samples is fixed, it can be a number of observations or a ratio of the original size.
- The distribution of each samples has to be the same and representativity for the initial dataset:
 + Firsly, the size N of the initial dataset should be large enough to capture most of the complexity of the underlying distribution so that sampling from the dataset is a good approximation of sampling from the real distribution
 + Secondly, the size N of the dataset should be large enough compared to the size B of the bootstrap samples so that samples are not too much correlated.

## 1.4. Boosting 
The idea of boosting is to fit models iteratively such that the training of model at a given step depends on the models fitted at the previous steps in order to reduce bias.\
Unlike bagging, boosting use the result of previous model to improve the next model in the sequence. Observations in the dataset which were badly handled by the previous models will be weighted and fitting by next model. Intuitively, each new model focus its efforts on the most difficult observations to fit up to now, so at the end of the process, we have a strong learner with lower bias. 

# 2. Random Forest

## 1.1. Random Forest
*Reference: [Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)*


Random Forest is a popular bagging method consists of a large number of Decision Trees, they operate as a committee outperforms any individual weak model. This wonderful effect - the wisdom of crowds - can be explained that trees protect each other from their individual errors. If trees share the same behaviors, they also make the same mistakes. So, the low correlation between trees is the key. 

### Variable randomness
In a normal Decision Tree, every input variable is considered to find the best split at a node (so greedy). In contrast, each tree of a Random Forest only picks a random subset of variables, this forces even more variation amongst trees. Compare variable randomness to bootstrap aggregating, they both do sampling, but one selects rows, the other selects columns.

### Implementation
Notable hyperparameters:

Hyperparameter|Meaning|Default value|Common values|
:---|:---|:---|:---|
`n_estimators`|Number of trees in the forest|`100`||
`criterion`|Measure of impurity for each tree|`gini`|`entropy` `gini`|
`max_features`|Number or ratio of variables used in each tree|`auto`|`0.8` `0.9`|
`max_samples`|Number or ratio of instances used in each tree|`None`|`0.5`|

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
credit = pd.read_excel('data/credit_scoring.xlsx')
credit = credit.dropna().reset_index()
credit.head()

Unnamed: 0,index,bad_customer,credit_balance_percent,age,num_of_group1_pastdue,debt_ratio,income,num_of_loans,num_of_times_late_90days,num_of_estate_loans,num_of_group2_pastdue,num_of_dependents
0,0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [3]:
y = credit.Bad_customer.values
x = credit.drop(columns='Bad_customer')

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [5]:
params = {
    'max_features': [0.8, 0.9],
    'max_samples': [0.5]
}

forest = RandomForestClassifier()
forest = GridSearchCV(forest, params, cv=5)
forest = forest.fit(x_train, y_train)
forest.best_params_

{'max_features': 0.8, 'max_samples': 0.5}

In [7]:
y_train_pred = forest.predict(x_train)
y_test_pred = forest.predict(x_test)

In [8]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98     89543
           1       0.99      0.58      0.73      6672

    accuracy                           0.97     96215
   macro avg       0.98      0.79      0.86     96215
weighted avg       0.97      0.97      0.97     96215



In [9]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.96     22369
           1       0.55      0.17      0.26      1685

    accuracy                           0.93     24054
   macro avg       0.75      0.58      0.61     24054
weighted avg       0.91      0.93      0.92     24054



# 3. AdaBoost

## 2.1. AdaBoost

*Reference: [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)*

AdaBoost is a specific boosting algorithm developed for binary classification. Adaboost usually come with decision tree has 1 node 2 leaves (called `stump`). Main idea of Adaboost are:
- Combines several of stumps to make a strong learner
- Stumps can be weighted so some stumps are important than others
- Each stumps is made by taking mistake from previous stump, so the order of each stump in adaboost is important.

**The workflow** \
Assume that we have $n$ samples (each sample has $a$ feature and has the same `sample weight` equal to $1/n$ - Total of sample weight is 1). To find the order of stump, we made stumps with each feature, calculate the gini index for each stump then the stump has the smallest gini will be the first.

After that, we determine the weight of stump based on how well it classified the samples - or we call `Amount of say`. The total error of stump is `the sum of the weight assigned to the incorrect classified samples` (eg: if we have 2 incorrect samples then this stump will has $2/n$ total error). The formular of amount of say:

$$\mbox{Amount of say} = L*ln(\frac{1-\mbox{total error}}{\mbox{total error}})$$

With $L$ is learning rate and this equation mean the less total error the larger amount of say.

The third idea of adaboost is the next stump is made by taking mistake from previous stump. For the next stump to relize the incorrect sample from first stump, we need to increase the sample weight of incorrect sample and decrease the others. Formular of new sample weight:
- For incorrect sample:

$$\mbox{new sample weight} = \mbox{sample weight} * e^{\mbox{amount of say}}$$

- For others:

$$\mbox{new sample weight} = \mbox{sample weight} * e^{-\mbox{amount of say}}$$

Then we normalize all new sample weight to add up to 1. Using the new sample weigth for the next stump until we get all stump done.
Lastly, to made prediction with adaboost, we have 2 group of stumps: classify on A or classify on B. If sum of amout of say of group A is bigger, the observation will be in class A and and vice versa.

### Implementation

Notable hyperparameters:

Hyperparameter|Meaning|Default value|Common values|
:---|:---|:---|:---|
`base_estimator`|Weak learner to build adaboost|`decisiontree`||
`n_estimators`|Number of stumps in the forest|`50`|`100`|
`learning_rate`||`1`|`1/2`|
`algorithm`||`SAMME.R`||


In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [9]:
credit = pd.read_excel('data/credit_scoring.xlsx')
credit = credit.dropna().reset_index()
credit.head()

Unnamed: 0,index,Bad_customer,credit_balance_percent,age,num_of_group1_pastdue,debt_ratio,income,num_of_loans,num_of_times_late_90days,num_of_estate_loans,num_of_group2_pastdue,num_of_dependents
0,0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [12]:
y = credit.Bad_customer.values
x = credit.drop(columns='Bad_customer')

In [13]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [19]:
params = {
    'learning_rate': [0.5, 0.8, 1],
    'n_estimators': [100],
    'algorithm': ['SAMME','SAMME.R']
}

forest = AdaBoostClassifier()
forest = GridSearchCV(forest, params, cv=5)
forest = forest.fit(x_train, y_train)
forest.best_params_

{'algorithm': 'SAMME.R', 'learning_rate': 0.5, 'n_estimators': 100}

In [20]:
y_train_pred = forest.predict(x_train)
y_test_pred = forest.predict(x_test)

In [21]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     89510
           1       0.57      0.20      0.30      6705

    accuracy                           0.93     96215
   macro avg       0.76      0.60      0.63     96215
weighted avg       0.92      0.93      0.92     96215



In [22]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     22402
           1       0.55      0.19      0.29      1652

    accuracy                           0.93     24054
   macro avg       0.75      0.59      0.63     24054
weighted avg       0.92      0.93      0.92     24054



# 4. Gradient Boosting Trees

## 2.2. Gradient Boosting
*Reference: [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier)*\
Gradient boosting is a powerful algorithm can work with both regression and classification problems. The purpose of GBM is minimizes the overall prediction error. The key idea is to set the target outcomes for the next model in the sequence in order to minimize the error.\
As same as adaboost, gradient boost has built the sequence of tree which the next tree is made by the error of the previous tree. There are some different between 2 algorithms:

Feature|Adaboost|Gradient boost|
:---|:---|:---|
Weak learners| Stumps|Trees have the same sizes that larger than stumps (usually have 8-32 maximize leaves)|
Weight of tree|Amount of say of each stump is different| All the trees have the same weight

### Gradient boosting for regression

Assume that we have input data $(x_i, y_i)_n$ and loss function $L(y_i, F(x))$. For gradient boost, the common loss function is a differentiable function:
$$L = \frac{1}{2}(y-\hat{y})^2$$
The main purpose of algorithm is to find the argmin of loss function by using Gradient descent (for complicated function). 

Firstly, gradient boost will calculate the first argmin of loss function $F_0(x)$ - precisely the average of dependent variables $\bar{y}$  and assign this value to the single node.\
Step 2 is a loop for $m$ trees run from 1 to $M$:\
**(A)** The algorithm will calculate the error of each prediction - called `pseudo residual`.
$$r_{i,m} = y - F_{m-1}(x)$$  for $i = 1,2,\dots,n$

**(B)** The next tree in sequence will be fit with previous residuals. Each leaf in a tree is called `terminal region` $R_{j,m}$. The output value of each region is the average of residuals in that region. Formular of output of $R_{j,m}$ is:
$$\gamma_{j,m} = argmin\sum_{x\in R_{i,j}}{L(y_i, F_{m-1}(x_i)+\gamma)}$$

**(C)** Then we made the prediction for each sample by combining the previous trees and the residuals in new tree:
$$F_m(x)=F_{m-1}(x)+\nu\sum_{j=1}^{J_m}{\gamma_{j,m}}$$

With $\nu$ is the learning rate to prevent high variance. All tree are scaled by 1 learning rate. Each time adding a tree in prediction, the residual get smaller - meaning the predict get more accurate. The $F_m(x)$ is the output of gradient boosting.

The process will stop when the residuals doesn't get smaller anymore or the model reach the maximum trees.

### Gradient boosting for classification
GBM for classification is as same as GBM for regression but instead of predict continuos values, algorithm will predict probabilities of the label to drop in class 0 or 1. 

Loss function of GBM now is the transformation of negative log(likelihood) - which predicted log(odds):
$$L = - y * log(odds) + log(1+e^{log(odds)})$$
The same as regression, the initial for prediction is $log(odds)$ - the argmin of loss function.\
Repeat $odds = \frac{p}{1-p}$ with $p$ is probability of class 1.

Use log(odd) to predict probability of a observation with formular:
$$p = \frac{e^{log(odds)}}{1+e^{log(odds)}}$$

Step 2 is a loop for $m$ trees run from 1 to $M$:\
**(A)** The algorithm will calculate the `pseudo residual`.
$$r = obs - p$$  

**(B)** The next tree in sequence will be fit with previous residuals. . Exactly the equation of output of each leaf is:
$$\gamma_{j,m} = argmin\sum{L(y_i, F_{m-1}(x_i)+\gamma)}$$

$$=> \gamma_{j,m} = \sum\frac{residual}{p*(1-p)}$$


**(C)** The new $log(odds)$ for each leaf will be result of combining between initial values, previous tree and residuals:
$$F_m(x)=F_{m-1}(x)+\nu\sum{\gamma_{j,m}}$$
with $\nu$ is learning rate

Lastly, the prob of new tree:
$$p = \frac{e^{log(odds)}}{1+e^{log(odds)}}$$

Continue until get the smallest residual or reach the maximum of trees.

### Implementation
Notable hyperparameters:

Hyperparameter|Meaning|Default value|Common values|
:---|:---|:---|:---|
`loss`|The loss function|`deviance (logistic reg)`|`deviance`|
`n_estimators`|Number of boosting stage|`100`||
`learning_rate`||`0.1`||
`subsample`|Fraction of sample used to fitting|`1.0`||
`criterion`|Mearsure quality of split|`friedman_mse`||
Tree parameters group|Some parameters prevent over fitting|`min_sample_split`,`min_sample_leaf`,`max_depth`,...||| 


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
credit = pd.read_excel('data/credit_scoring.xlsx')
credit = credit.dropna().reset_index()
credit.head()

Unnamed: 0,index,Bad_customer,credit_balance_percent,age,num_of_group1_pastdue,debt_ratio,income,num_of_loans,num_of_times_late_90days,num_of_estate_loans,num_of_group2_pastdue,num_of_dependents
0,0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [3]:
y = credit.Bad_customer.values
x = credit.drop(columns='Bad_customer')

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [6]:
params = {
    'learning_rate': [0.1,0.001],
    'n_estimators': [100],
    'subsample': [1],
    'criterion':['friedman_mse']
#     'min_sample_split':[10,5,2]  
}

gbm = GradientBoostingClassifier()
gbm = GridSearchCV(gbm, params, cv=3)
gbm = gbm.fit(x_train, y_train)
gbm.best_params_

{'criterion': 'friedman_mse',
 'learning_rate': 0.1,
 'n_estimators': 100,
 'subsample': 1}

In [7]:
y_train_pred = gbm.predict(x_train)
y_test_pred = gbm.predict(x_test)

In [8]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     89550
           1       0.62      0.20      0.30      6665

    accuracy                           0.94     96215
   macro avg       0.78      0.60      0.64     96215
weighted avg       0.92      0.94      0.92     96215



In [9]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     22362
           1       0.60      0.21      0.31      1692

    accuracy                           0.93     24054
   macro avg       0.77      0.60      0.64     24054
weighted avg       0.92      0.93      0.92     24054



# 5. XGBoost

## 2.3. XGBoost

**Signing**:
- $O_{value}$: Output value
- $\lambda$ : Regulaziration parameter
- $\gamma$: Tree complexity parameter
- $\eta$: Learning rate
- Gradient ($g$): First derivative of loss function
- Hessian ($h$): Second derivative of loss function
- Residual :$r$
- Number of residual: $N_r$

### The idea and mathematical behind xgboost

The first step of fitting XGboost to the training data is making the initial prediction $p_0$ (default is `0.5` for both regression and classification). Like GBM, XGboost fit a tree to the residual of each observation but each tree in XGboost is unique. XGB build a tree which try to minimize the equation:
$$\sum_{i=1}^{n}L(y_i,p_{i}^0+O_{value}) + \frac{1}{2}\lambda*{O^2_{value}}  \space \mbox{(1)}$$
with $L$ is *loss function* and ($\frac{1}{2}\lambda*{O^2_{value}}$) is *regularization term*.\
The goal of XGB is finding the optimal output value to minimize the equation $(1)$.

To make an Xgboost tree, firstly we calculate the residuals of each obs different to initial prediction then making similariry score for the root:
$$\mbox{Similariry Score} = \frac{(\sum gradient)^2}{\sum hessian + \lambda}$$
Similarity score tell us the similarity between residual of observations, if residuals is different, they cancel out the others so that the similarity score will be small. And $\lambda$ is the regularization parameter, it's intend to reduce the prediction's sensitivity to individual observations and then prevent overfitting. When $\lambda$ increase, the similarity score wil be decrease - the amount of decrease is inversely proportional to the number of residual in the node.\
After having the similarity for the root, we find an optimal split point for the tree. We'll choose one threshold to split the obs and then calculating similarity score for each node. To quantify how much better the leaves similar residual than the root, we calculate the gain:
$$\mbox{gain} = \mbox{left}_{similarity} + \mbox{right}_{similarity} - \mbox{root}_{similarity}$$

We can compare the gain of this thresold to the others and the threshold which made the largest gain will be chosen. Continue building tree with other feature until getting the maximum split.

Now we have the full tree, to prune the tree, we use $\gamma$ parameter. If $gain - \gamma < 0$, we prune that leave.

Next step is find the output for each leaf of the tree. To do that, we derivate the equation $(1)$ and we get the formular for optimal output:
$$O_{value}=\frac{-\sum gradient}{\sum hessian + \lambda}$$

The larger $\lambda$, the closer output value to 0.\
Finally, we use the output value plus learning rate to make the new prediction:
$$p_0 + \sum_{m=1}^{m}\eta * O_{value}$$

Like GBM, the process go on with the next tree which is build based on residual of previous tree. And ending by reaching the smallest residual or maximum tree.


### Application in regression

In regression, XGB use loss function just like GBM:
$$L(y_i,p_i) = \frac{1}{2}(y_i-p_i)^2$$

Then gradient and hessian of loss function is :
$g_i = -(y_i-p_i)$ and $h_i = 1$

Replace to $O_{value}$ and similarity score we will have:
$$O_{value} = \frac{\sum r}{N_r + \lambda}$$

$$\mbox{similarity score} = \frac{\sum r^2}{N_{r} + \lambda} $$

### Application in classification

In classification, the loss function is the negative likelihood:
$$L(y_i,p_i) = -[y_i*log(p_i) + (1-y_i)*log(1-p_i)]$$
Then gradient and hessian of loss function is: 
$g_i = -(y_i-p_i)$ and $h_i = p_i*(1-p_i)$

Replace we have:
$$O_{value} = \frac{\sum r}{N_r + \lambda}$$

$$\mbox{similarity score} = \frac{\sum r^2}{cover+ \lambda} $$
with $Cover = \sum h_i= \sum[\mbox{previous prob}*(1-\mbox{previous prob})]$ 

Despite making probability as a new prediction, we predict $log(odds)$ of prediction then convert to probability (because ouput value is the $log(odds)$ now):

$$log(odds) prediction = log(\frac{p_0}{1-p_0}) + \eta*O_{value}$$

### Optimization in xgboost

### Implementation

# 6. LightGBM

## 2.4. LightGBM

LightGBM is a gradient boosting framework which use tree based learning algorithm. 

The reason why lightgbm became powerful are:
1. Optimization in accuracy: 
- LigthGBM uses *leaf_wise* stategy to grow the tree vertically. It will choose the leaf with max delta  loss to grow. 
- Optimal split for category feature. Other boosting can only use one-hot encoding for category feature but lightGBM can treat feature values equally although values in numeric (i.e: gender 0-unknow, 1-man, 2-woman)
2.  Optimization in speed and memory usage: 
- LightGBM uses histogram-based algorithms, which bucket continuous feature values into discrete bins. Algo will find the best split between bins.
- LightGBM uses Gradient-based one side sampling (GOSS) method. By usual, subsample process is done by taking random sample from dataset. But GOSS not only perform a faster way to do this but also keep distribution of the data. The instances with larger gradients (under-trained), contribute a lot more to the tree building process so we keep all instances with large gradient, but to prevent changing the distribution, we need to perform random sampling on instances with small gradients. Assume we have 
$n$ data instances, if we keep $a$ instances with large gradients and randomly samples $x$% $(n-a)$ instances with small gradients, we have the sampled data with size $a+ x$% $*(n-a)$. 
3. Sparse Optimization:
- LightGBM handle sparse data by using Exclusive feature bundling (EFB). EFB is a technique that uses a greedy algorithm to combine (or bundle) these mutually exclusive (means never take non-zero in the same time) into a single feature and thus reduce the dimensionality.
4. Optimization in Distributed Learning:
- Distributed learning allows the use of multiple machines to produce a single model.
- Feature parallel: Data had split vertically, each worker handle a subset of feature to find the best split on local feature set then they communicate to others to get the best one. 
- Data parallel: Data had split horizontally, each worker will construct the local histograms then merge them to make global histogram of the data and find the best split on global histogram. 
- Voting parallel: It is the special case of data parallel when communication cost is constant.

||Data is small| Data is large
|:--|:--|:--
**Feature  is small**|Feature parallel|Data parallel|
**Feature is large**|Feaure parallel|Voting parallel|

<img src='data/parallel_lgbm.png' style='height:500px; margin: 0 auto 20px;'>

LightGBM also supports GPU learning. The disadvantage of LGBM is it can easily overfit with small dataset and can't be use with one-hot encoding because of optimal category feature


### Implementation

*Reference*: [LightGBM docs](https://lightgbm.readthedocs.io/en/latest/Parameters.html)\
There are several parameters in LGBM which separate into groups:
1. Core parameters
- task: Type of task which perform on your data. Default is `train`, common is `train`/`predict`
- object: Type of problem - regression, binary or multiclass. Default is `regression`.
- boosting: Type of algorithms. Default is `GDBT`. Common value is `GOSS`.
- num_iterations: Number of interation. Default is 100
- learning_rate: Default is `0.1`. Common values are `0.001`, `0.003`
- num_leaves: Max number of leaves in full tree. Default is `31`
- tree_learner: type of distributed learning. Default is `serial`. Common values are: `feature`,`data`,`voting`.
- device_type: `cpu` or `gpu`. GPU can perform better than CPU
- data: path of training data
- valid: path of validation data. Support multiple validation data
2. Learning control parameter
- max_depth: The maximum depth of tree. Default is `-1` means no limit.
- min_data_in_leaf: The minimum observations a leaf. Default is `20`
- feature_fraction: Subset of feature in each tree. Default is `1.0`
- bagging_fraction: Subsample data. Default is `1.0`
- early_stopping_round: Model will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds. Default is `0`.
- lambda: lambda specifies regularization (L1/L2). Default is `0.0`
- min_gain_to_split:the minimal gain to perform split. Default is `0.0`
- max_cat_threshold:Limit number of split points considered for categorical features. Default is `32`
3. IO parameter
- max_bin: Max number of bins that feature values will be bucketed in. Default is `255`
- categorical_feature: used to specify categorical features. Fill with index or name of columns.
- is_enable_sparse: used to enable/disable sparse optimization. Default is `true`
- enable_bundle: used EFB to bunlding feature. Default is `true`
- weight_column: used to specify the weight column. Fill with index or name of column
- save_binary: if true, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next time. Default is `true`
4. Metric parameter
- metric: metric(s) to be evaluated on the evaluation set(s). Default is `""`. Common values are `mae`,`mse`,`auc`,`binary_logloss`,..


In [4]:
import lightgbm as lgbm

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

In [5]:
credit = pd.read_excel('data/credit_scoring.xlsx')
credit = credit.dropna().reset_index()
credit.head()

Unnamed: 0,index,Bad_customer,credit_balance_percent,age,num_of_group1_pastdue,debt_ratio,income,num_of_loans,num_of_times_late_90days,num_of_estate_loans,num_of_group2_pastdue,num_of_dependents
0,0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [6]:
y = credit.Bad_customer.values
x = credit.drop(columns='Bad_customer')

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [8]:
d_train = lgbm.Dataset(x_train, label=y_train)

In [None]:
params = {
    'learning_rate': [0.001],
    'boosting_type': 'goss',
    'objective': [1],
    'criterion':['friedman_mse']
#     'min_sample_split':[10,5,2]  
}

gbm = GradientBoostingClassifier()
gbm = GridSearchCV(gbm, params, cv=3)
gbm = gbm.fit(x_train, y_train)
gbm.best_params_

In [None]:
y_train_pred = gbm.predict(x_train)
y_test_pred = gbm.predict(x_test)

# 7. CatBoost

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*