*Python Machine Learning 3rd Edition* by [Sebastian Raschka](https://sebastianraschka.com), Packt Publishing Ltd. 2019

Code Repository: https://github.com/rasbt/python-machine-learning-book-3rd-edition

Code License: [MIT License](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/LICENSE.txt)

# Chapter 7 - Combining Different Models for Ensemble Learning

<br>
<br>

### Overview

- [Learning with ensembles](#Learning-with-ensembles)
- [Combining classifiers via majority vote](#Combining-classifiers-via-majority-vote)
    - [Implementing a simple majority vote classifier](#Implementing-a-simple-majority-vote-classifier)
    - [Using the majority voting principle to make predictions](#Using-the-majority-voting-principle-to-make-predictions)
    - [Evaluating and tuning the ensemble classifier](#Evaluating-and-tuning-the-ensemble-classifier)
- [Bagging – building an ensemble of classifiers from bootstrap samples](#Bagging----Building-an-ensemble-of-classifiers-from-bootstrap-samples)
    - [Bagging in a nutshell](#Bagging-in-a-nutshell)
    - [Applying bagging to classify examples in the Wine dataset](#Applying-bagging-to-classify-examples-in-the-Wine-dataset)
- [Leveraging weak learners via adaptive boosting](#Leveraging-weak-learners-via-adaptive-boosting)
    - [How boosting works](#How-boosting-works)
    - [Applying AdaBoost using scikit-learn](#Applying-AdaBoost-using-scikit-learn)
- [Summary](#Summary)

<br>
<br>

# Bagging -- Building an ensemble of classifiers from bootstrap samples

In [1]:
Image(filename='./images/07_06.png', width=500) 

NameError: name 'Image' is not defined

## Bagging in a nutshell

In [None]:
Image(filename='./images/07_07.png', width=400) 

標本データセットから、ランダムにデータを選ぶ（ブートストラップ）  
一つのデータセットに同じデータが含まれることがある

## Applying bagging to classify examples in the Wine dataset

In [None]:
import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

# if the Wine dataset is temporarily unavailable from the
# UCI machine learning repository, un-comment the following line
# of code to load the dataset from a local path:

# df_wine = pd.read_csv('wine.data', header=None)

# drop 1 class
df_wine = df_wine[df_wine['Class label'] != 1]

y = df_wine['Class label'].values
X = df_wine[['Alcohol', 'OD280/OD315 of diluted wines']].values

データをインポートし、`Class label`を目的変数に、`Alcohol`と`OD280/OD315 of diluted wines`を説明変数にする

In [None]:
df_wine.info()

型は`float`か`int`で、欠損値なし

In [None]:
#エンコード前のy
y

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test =\
            train_test_split(X, y, 
                             test_size=0.2, 
                             random_state=1,
                             stratify=y)

In [None]:
#エンコード後のy
y

### LabelEncoder
>Encode target labels with value between 0 and n_classes-1  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
### train_test_split
>Split arrays or matrices into random train and test subsets  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

##### パラメータ  
>random_state：データを分割する前に、データの行の順番がランダムにされています。(defaultではshuffle=True) 、random_stateとはこの時のデータのランダムな行の順番を固定する引数  
stratify：層化抽出法   
https://bellcurve.jp/statistics/course/8007.html


In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#決定木モデル
tree = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=None,
                              random_state=1)
#バギング
bag = BaggingClassifier(base_estimator=tree,
                        n_estimators=500, 
                        max_samples=1.0, 
                        max_features=1.0, 
                        bootstrap=True, 
                        bootstrap_features=False, 
                        n_jobs=1, 
                        random_state=1)

### BaggingClassifier
>n_jobs：並列に処理を実行する数  
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

In [None]:
from sklearn.metrics import accuracy_score

#決定木モデルの予測
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
#決定木モデルのトレーニングデータとテストデータのスコア
tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'
      % (tree_train, tree_test))

#バギングの予測
bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)
#バギングのトレーニングデータとテストデータのスコア
bag_train = accuracy_score(y_train, y_train_pred) 
bag_test = accuracy_score(y_test, y_test_pred) 
print('Bagging train/test accuracies %.3f/%.3f'
      % (bag_train, bag_test))

### accuracy_score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [None]:
import numpy as np
import matplotlib.pyplot as plt

#最大値と最小値を出す->メッシュのの範囲を作る
x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1

#メッシュグリッド
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

#プロットの箱を設定
f, axarr = plt.subplots(nrows=1, ncols=2, 
                        sharex='col', 
                        sharey='row', 
                        figsize=(8, 3))

#描写
for idx, clf, tt in zip([0, 1],　#index,classifier,title
                        [tree, bag],
                        ['Decision tree', 'Bagging']):
    clf.fit(X_train, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) #ravel->1次元のリスト化
    Z = Z.reshape(xx.shape) #1D(2430,)->2Dに変換(45, 54)

    axarr[idx].contourf(xx, yy, Z, alpha=0.3) #contourf->塗りつぶした高等線 #alpha->不透明度
    axarr[idx].scatter(X_train[y_train == 0, 0], #`Class label`のデータが0の`Alcohol`の値
                       X_train[y_train == 0, 1], #`Class label`のデータが0の`OD280/OD315 of diluted wines`の値
                       c='blue', marker='^')

    axarr[idx].scatter(X_train[y_train == 1, 0], #`Class label`のデータが1の`Alcohol`の値
                       X_train[y_train == 1, 1], #`Class label`のデータが1の`OD280/OD315 of diluted wines`の値
                       c='green', marker='o')

    axarr[idx].set_title(tt)

axarr[0].set_ylabel('OD280/OD315 of diluted wines', fontsize=12)

plt.tight_layout()
plt.text(0, -0.2,
         s='Alcohol',
         ha='center', #h->horizontal
         va='center', #v->verticalal
         fontsize=12,
         transform=axarr[1].transAxes) #Axes内の相対位置に変換
plt.show()

<br>
<br>

# Leveraging weak learners via adaptive boosting

## How boosting works

In [None]:
Image(filename='images/07_09.png', width=400) 

In [None]:
Image(filename='images/07_10.png', width=500) 

## Applying AdaBoost using scikit-learn

In [None]:
from sklearn.ensemble import AdaBoostClassifier

#弱学習器の作成
tree = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=1,
                              random_state=1)
#アダブーストの作成
ada = AdaBoostClassifier(base_estimator=tree,
                         n_estimators=500, 
                         learning_rate=0.1, #デフォルト値
                         random_state=1)

In [None]:
#弱学習器の予測
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
#弱学習器のスコア
tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'
      % (tree_train, tree_test))

#アダブーストの予測
ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
#アダブーストのスコア
ada_train = accuracy_score(y_train, y_train_pred) 
ada_test = accuracy_score(y_test, y_test_pred) 
print('AdaBoost train/test accuracies %.3f/%.3f'
      % (ada_train, ada_test))

In [None]:
#最大値と最小値を出す->メッシュのの範囲を作る
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
#メッシュグリッド
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
#プロットの箱を設定
f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(8, 3))

#描写
for idx, clf, tt in zip([0, 1], #index,classifier,title
                        [tree, ada],
                        ['Decision tree', 'AdaBoost']):
    clf.fit(X_train, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) #ravel->1次元のリスト化
    Z = Z.reshape(xx.shape) #1D(2430,)->2Dに変換(45, 54)

    axarr[idx].contourf(xx, yy, Z, alpha=0.3) #contourf->塗りつぶした高等線 #alpha->不透明度
    axarr[idx].scatter(X_train[y_train == 0, 0], #`Class label`のデータが0の`Alcohol`の値
                       X_train[y_train == 0, 1], #`Class label`のデータが0の`OD280/OD315 of diluted wines`の値
                       c='blue', marker='^')
    axarr[idx].scatter(X_train[y_train == 1, 0], #`Class label`のデータが1の`Alcohol`の値
                       X_train[y_train == 1, 1], #`Class label`のデータが1の`OD280/OD315 of diluted wines`の値
                       c='green', marker='o')
    axarr[idx].set_title(tt)

axarr[0].set_ylabel('OD280/OD315 of diluted wines', fontsize=12)

plt.tight_layout()
plt.text(0, -0.2,
         s='Alcohol',
         ha='center', #h->horizontal
         va='center', #v->verticalal
         fontsize=12,
         transform=axarr[1].transAxes) #Axes内の相対位置に変換
plt.show()

<br>
<br>

# Summary

...

---

Readers may ignore the next cell.

In [None]:
! python ../.convert_notebook_to_script.py --input ch07.ipynb --output ch07.py