<a href="https://colab.research.google.com/github/hamagami/is2021/blob/main/boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Various ensemble learning, Boosting, Random Forest, GradientBoosting, XGBoost

You will compare several discrimination algorithms　including some ensemble learnings. All can be run with the scikit-learn module. It's great.

In [20]:
import pandas as pd #pandas dataframe
from sklearn.datasets import load_breast_cancer #dataset
from sklearn.model_selection import train_test_split # Library to split data into training  and validation 
from sklearn import model_selection

In [21]:
from sklearn.linear_model import LogisticRegression   #Logistic regression
from sklearn.tree import DecisionTreeClassifier       #Decision Tree
from sklearn.ensemble import RandomForestClassifier   #Random Forest(ensemble)
from sklearn.ensemble import AdaBoostClassifier       #AdaBoost(ensemble)
from sklearn.ensemble import GradientBoostingClassifier #GradientBoosting(ensemble)
from xgboost import XGBClassifier                     #XGBoost(ensemble)


## prepare datset

In [24]:
seed = 0
# sample data
loaded_data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(loaded_data.data, loaded_data.target, random_state=seed)
kfold = model_selection.KFold(n_splits = 5) 
scores = {}

## Logistic regression
The simplest discriminant learner (learned in Part 2). It discrimates the classes linearly by regressively approximating the boundary between the two classes with a logistic function. Naturally, it cannot deal with nonlinear problems.

In [26]:
# Logistic regression
lr_clf = LogisticRegression(solver='lbfgs', max_iter=10000) 
lr_clf.fit(X_train, y_train)
results = model_selection.cross_val_score(lr_clf, X_test, y_test, cv = kfold) 
scores[('1.Logistic_regression', 'train_score')] = results.mean() 
scores[('1.Logistic_regression', 'test_score')] = lr_clf.score(X_test, y_test)


## Decison tree
Decision trees were covered in the previous article, Part 6. You will discrimate the classes while separating the data using simple rules. This improves the performance on training data, but the problem is that the generalization performance is low (over fittin) and the accuracy on test data is not improved.

In [27]:
# Decision tree
dtc_clf = DecisionTreeClassifier(random_state=seed) 
dtc_clf.fit(X_train, y_train)
results = model_selection.cross_val_score(dtc_clf, X_test, y_test, cv = kfold) 
scores[('2.decision_tree', 'train_score')] = results.mean() 
scores[('2.decision_tree', 'test_score')] = dtc_clf.score(X_test, y_test)


## Random forest
Random forests are ensemble learning with decision trees as the weak learner. The number of trees, their depth, and the number of attributes to be bootstrap sampled need to be carefully designed, but the algorithm tends to perform well on average.

In [28]:
# Random forest
rfc_clf = RandomForestClassifier(max_depth=5, random_state=seed) 
rfc_clf.fit(X_train, y_train)
results = model_selection.cross_val_score(rfc_clf, X_test, y_test, cv = kfold) 
scores[('3.Random Forest', 'train_score')] = results.mean()
scores[('3.Random Forest', 'test_score')] = rfc_clf.score(X_test, y_test)


## AdaBoost (Adaptive Boosting)
Boosting is a method of improving performance by sequentially adding weak learners; Adaboost "adaptively" adds new learners to compensate for the weaknesses of the weak learners created so far. The feature of Adaboost is that it can achieve high performance with a small amount of computation.

In [29]:
# AdaBoost
adb_clf = AdaBoostClassifier(n_estimators=100, random_state=seed) 
adb_clf.fit(X_train, y_train)
results = model_selection.cross_val_score(adb_clf, X_test, y_test, cv = kfold) 
scores[('4.AdaBoost', 'train_score')] = results.mean()
scores[('4.AdaBoost', 'test_score')] = adb_clf.score(X_test, y_test)


## GradientBoosting
Like AdaBoost, Gradient Boosting continuously modifies and adds a new weak-learner to the ensemble of weak learners; while AdaBoost changes the weights of the data, Gradient Boosting modifies the predictors to fit the residual errors.

In [30]:
# GradientBoosting (GBM)
gbm_clf = GradientBoostingClassifier(random_state=seed)
gbm_clf.fit(X_train, y_train)
results = model_selection.cross_val_score(gbm_clf, X_test, y_test, cv = kfold) 
scores[('5.GBM', 'train_score')] = results.mean()
scores[('5.GBM', 'test_score')] = gbm_clf.score(X_test, y_test)


## XGBoost
XGboost stands for "eXtreme Gradient Boosting". XGboost is a combination of gradient boosting and decision trees, and has very high generalization capability, and is often used in the top level programs in Kaggle.

In [32]:
# xgboost
xgb_clf = XGBClassifier() 
xgb_clf.fit(X_train, y_train)
results = model_selection.cross_val_score(xgb_clf, X_test, y_test, cv = kfold) 
scores[('6.xgboost', 'train_score')] = results.mean()
scores[('6.xgboost', 'test_score')] = xgb_clf.score(X_test, y_test)


## Evaluation of all algorithms.

In [33]:
# モデル評価
pd.Series(scores).unstack()

Unnamed: 0,test_score,train_score
1.Logistic_regression,0.951049,0.936946
2.decision_tree,0.881119,0.91601
3.Random Forest,0.972028,0.944335
4.AdaBoost,0.986014,0.943842
5.GBM,0.965035,0.936946
6.xgboost,0.979021,0.930049
