<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.1: Bagging

INSTRUCTIONS:

- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the scenario below.
- The baseline results (minimum) are:
    - **Accuracy** = 0.9667
    - **ROC AUC**  = 0.9614
- Try to achieve better results!

# Foreword
It is common that companies and professionals start with the data immediately available. Although this approach works, ideally the first step is to identify the problem or question and only then identify and obtain the set of data that can help to solve or answer the problem.

Also, given the current abundance of data, processing power and some particular machine learning methods, there could be a temptation to use ALL the data available. **Quality** is _**better**_ than **Quantity**!

Part of calling this discipline **Data Science** is that it is supposed to follow a process and not reach conclusions without support from evidence.

Moreover, it is a creative, exploratory, laborious, iterative and interactive process. It is part of the process to repeat, review and change when finding a dead-end.

## Scenario: Predicting Breast Cancer
The dataset you are going to be using for this laboratory is popularly known as the **Wisconsin Breast Cancer** dataset. The task related to it is Classification.

The dataset contains a total number of _10_ features labelled in either **benign** or **malignant** classes. The features have _699_ instances out of which _16_ feature values are missing. The dataset only contains numeric values.

In [57]:
import itertools
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

from sklearn import datasets

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import GaussianNB

from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions

from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score

# Step 1: Define the problem or question
Identify the subject matter and the given or obvious questions that would be relevant in the field.

## Potential Questions
List the given or obvious questions.

## Actual Question
Choose the **one** question that should be answered.

# Step 2: Find the Data
### Wisconsin Breast Cancer DataSet
- **Citation Request**

    This breast cancer databases was obtained from the **University of Wisconsin Hospitals**, **Madison** from **Dr. William H. Wolberg**. If you publish results when using this database, then please include this information in your acknowledgements.

- **Title**

    Wisconsin Breast Cancer Database (January 8, 1991)

- **Sources**
    - **Creator**
            Dr. William H. Wolberg (physician)
            University of Wisconsin Hospitals
            Madison, Wisconsin
            USA
    - **Donor**
            Olvi Mangasarian (mangasarian@cs.wisc.edu)
            Received by David W. Aha (aha@cs.jhu.edu)
    - **Date**
            15 July 1992
        
### UCI - Machine Learning Repository
- Center for Machine Learning and Intelligent Systems

The [**UCI Machine Learning Repository**](http://archive.ics.uci.edu/ml/about.html) is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

# Step 3: Read the Data
- Read the data
- Perform some basic structural cleaning to facilitate the work

In [2]:
df = pd.read_csv('breast-cancer-wisconsin-data-old.csv')

In [5]:
df.head()

Unnamed: 0,1000025,5,1,1.1,1.2,2,1.3,3,1.4,1.5,2.1
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2
3,1017023,4,1,1,3,2,1,3,1,1,2
4,1017122,8,10,10,8,7,10,9,7,1,4


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   1000025  698 non-null    int64 
 1   5        698 non-null    int64 
 2   1        698 non-null    int64 
 3   1.1      698 non-null    int64 
 4   1.2      698 non-null    int64 
 5   2        698 non-null    int64 
 6   1.3      698 non-null    object
 7   3        698 non-null    int64 
 8   1.4      698 non-null    int64 
 9   1.5      698 non-null    int64 
 10  2.1      698 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


In [15]:
df.drop(columns = ['1000025'], inplace = True)

In [16]:
for col in df:
    print(f'This {col} have {df[col].unique()} values')

This 5 have [ 5  3  6  4  8  1  2  7 10  9] values
This 1 have [ 4  1  8 10  2  3  7  5  6  9] values
This 1.1 have [ 4  1  8 10  2  3  5  6  7  9] values
This 1.2 have [ 5  1  3  8 10  4  6  2  9  7] values
This 2 have [ 7  2  3  1  6  4  5  8 10  9] values
This 1.3 have ['10' '2' '4' '1' '3' '9' '7' '?' '5' '8' '6'] values
This 3 have [ 3  9  1  2  4  5  7  8  6 10] values
This 1.4 have [ 2  1  7  4  5  3 10  6  9  8] values
This 1.5 have [ 1  5  4  2  3  7 10  8  6] values
This 2.1 have [2 4] values


In [17]:
df['1.3'].value_counts()

1     401
10    132
2      30
5      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: 1.3, dtype: int64

In [19]:
df['1.3'].replace({'?':'1'}, inplace = True)

In [22]:
df['1.3'] = df['1.3'].astype('int64')

# Step 4: Explore and Clean the Data
- Perform some initial simple **EDA** (Exploratory Data Analysis)
- Check for
    - **Number of features**
    - **Data types**
    - **Domains, Intervals**
    - **Outliers** (are they valid or expurious data [read or measure errors])
    - **Null** (values not present or coded [as zero of empty strings])
    - **Missing Values** (coded [as zero of empty strings] or values not present)
    - **Coded content** (classes identified by numbers or codes to represent absence of data)

In [23]:
for col in df:
    print(f'This {col} have {df[col].unique()} values')

This 5 have [ 5  3  6  4  8  1  2  7 10  9] values
This 1 have [ 4  1  8 10  2  3  7  5  6  9] values
This 1.1 have [ 4  1  8 10  2  3  5  6  7  9] values
This 1.2 have [ 5  1  3  8 10  4  6  2  9  7] values
This 2 have [ 7  2  3  1  6  4  5  8 10  9] values
This 1.3 have [10  2  4  1  3  9  7  5  8  6] values
This 3 have [ 3  9  1  2  4  5  7  8  6 10] values
This 1.4 have [ 2  1  7  4  5  3 10  6  9  8] values
This 1.5 have [ 1  5  4  2  3  7 10  8  6] values
This 2.1 have [2 4] values


In [24]:
df.describe()

Unnamed: 0,5,1,1.1,1.2,2,1.3,3,1.4,1.5,2.1
count,698.0,698.0,698.0,698.0,698.0,698.0,698.0,698.0,698.0,698.0
mean,4.416905,3.137536,3.210602,2.809456,3.217765,3.489971,3.438395,2.869628,1.590258,2.690544
std,2.817673,3.052575,2.972867,2.856606,2.215408,3.623301,2.440056,3.055004,1.716162,0.951596
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [25]:
df['2.1'].value_counts()

2    457
4    241
Name: 2.1, dtype: int64

In [26]:
df['2.1'].value_counts(normalize = True)

2    0.654728
4    0.345272
Name: 2.1, dtype: float64

# Step 5: Prepare the Data
- Deal with the data as required by the modelling technique
    - **Outliers** (remove or adjust if possible or necessary)
    - **Null** (remove or interpolate if possible or necessary)
    - **Missing Values** (remove or interpolate if possible or necessary)
    - **Coded content** (transform if possible or necessary [str to number or vice-versa])
    - **Normalisation** (if possible or necessary)
    - **Feature Engeneer** (if useful or necessary)

In [27]:
X = df.drop(columns = ['2.1'])
y = df['2.1']

# Step 6: Modelling
Refer to the Problem and Main Question.
- What are the input variables (features)?
- Is there an output variable (label)?
- If there is an output variable:
    - What is it?
    - What is its type?
- What type of Modelling is it?
    - [ ] Supervised
    - [ ] Unsupervised 
- What type of Modelling is it?
    - [ ] Regression
    - [ ] Classification (binary) 
    - [ ] Classification (multi-class)
    - [ ] Clustering

In [72]:
#Supervised, Binary Classification
np.random.seed(0)

knn_clf = KNeighborsClassifier(n_neighbors = 5)
dt_clf = DecisionTreeClassifier(criterion = 'entropy', max_depth = 2, random_state =0)
reg_clf = LogisticRegression(random_state = 0)
svm_clf = svm.SVC(kernel = 'linear')
nb_clf = GaussianNB()

bagging_knn = BaggingClassifier(
    base_estimator = knn_clf,
    n_estimators = 10,
    max_samples = 0.8,
    max_features = 0.8)

bagging_dt = BaggingClassifier(
    base_estimator = dt_clf,
    n_estimators = 10,
    max_samples = 0.8,
    max_features = 0.8)

bagging_reg = BaggingClassifier(
    base_estimator = reg_clf,
    n_estimators = 10,
    max_samples = 0.8,
    max_features = 0.8)

bagging_svm = BaggingClassifier(
    base_estimator = svm_clf,
    n_estimators = 10,
    max_samples = 0.8,
    max_features = 0.8)

bagging_nb = BaggingClassifier(
    base_estimator = nb_clf,
    n_estimators = 10,
    max_samples = 0.8,
    max_features = 0.8)

# Step 7: Split the Data

Need to check for **Supervised** modelling:
- Number of known cases or observations
- Define the split in Training/Test or Training/Validation/Test and their proportions
- Check for unbalanced classes and how to keep or avoid it when spliting

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =42)

# Step 8: Define and Fit Models

Define the model and its hyper-parameters.

Consider the parameters and hyper-parameters of each model at each (re)run and after checking the efficiency of a model against the training and test datasets.

In [55]:
labels = ['K-NN', 'Decision Tree','Logistic Regression','SVM','Gaussian NB',
         'Bagging K-NN', 'Bagging Tree','Bagging Logistic','Bagging SVM','Bagging NB']

clf_list = [knn_clf, dt_clf,reg_clf,svm_clf,nb_clf,
            bagging_knn, bagging_dt,bagging_reg,bagging_svm,bagging_nb]

In [90]:
fig = plt.figure(figsize = (10,8))
gs = gridspec.GridSpec(5,2)
grid = itertools.product([0,1],repeat = 1)


<Figure size 720x576 with 0 Axes>

In [87]:
for clf, label in zip(clf_list,labels):
    scores = cross_val_score(clf, X_train, y_train, cv = 5, scoring = 'accuracy')
    print('Accuracy for train cross val: %.4f (+/- %.4f) [%s]' % (scores.mean(), scores.std(), label))
    
    clf.fit(X_train,y_train)
    print('Accuracy for train: %.4f [%s]' % (clf.score(X_train,y_train), label))
    print('Accuracy for test: %.4f [%s]' % (clf.score(X_test,y_test), label))
    yt_pred = clf.predict(X_train)
    
    print('ROC/AUC Scores: %.4f [%s]\n' % (roc_auc_score(y_train,yt_pred),label))

Accuracy for train cross val: 0.9677 (+/- 0.0092) [K-NN]
Accuracy for train: 0.9803 [K-NN]
Accuracy for test: 0.9857 [K-NN]
ROC/AUC Scores: 0.9812 [K-NN]

Accuracy for train cross val: 0.9140 (+/- 0.0122) [Decision Tree]
Accuracy for train: 0.9211 [Decision Tree]
Accuracy for test: 0.9357 [Decision Tree]
ROC/AUC Scores: 0.9275 [Decision Tree]

Accuracy for train cross val: 0.9642 (+/- 0.0126) [Logistic Regression]
Accuracy for train: 0.9677 [Logistic Regression]
Accuracy for test: 0.9714 [Logistic Regression]
ROC/AUC Scores: 0.9628 [Logistic Regression]

Accuracy for train cross val: 0.9624 (+/- 0.0131) [SVM]
Accuracy for train: 0.9677 [SVM]
Accuracy for test: 0.9786 [SVM]
ROC/AUC Scores: 0.9641 [SVM]

Accuracy for train cross val: 0.9588 (+/- 0.0090) [Gaussian NB]
Accuracy for train: 0.9552 [Gaussian NB]
Accuracy for test: 0.9714 [Gaussian NB]
ROC/AUC Scores: 0.9571 [Gaussian NB]

Accuracy for train cross val: 0.9642 (+/- 0.0056) [Bagging K-NN]
Accuracy for train: 0.9785 [Bagging K-NN

# Step 9: Verify and Evaluate the Training Model
- Use the **training** data to make predictions
- Check for overfitting
- What metrics are appropriate for the modelling approach used
- For **Supervised** models:
    - Check the **Training Results** with the **Training Predictions** during development
- Analyse, modify the parameters and hyper-parameters and repeat (within reason) until the model does not improve

In [118]:
knn_clf.fit(X_train,y_train)

params2 = {'n_neighbors':[2,5,10],
          'weights':['uniform','distance'],
          'algorithm': ['auto','brute','kd_tree','ball_tree'],
          'leaf_size': [10,30,50]}

In [119]:
%%time

grid_knn = GridSearchCV(knn_clf,params2)

grid_knn.fit(X_train,y_train)

CPU times: user 1.88 s, sys: 2.56 s, total: 4.44 s
Wall time: 1.35 s


GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'algorithm': ['auto', 'brute', 'kd_tree', 'ball_tree'],
                         'leaf_size': [10, 30, 50], 'n_neighbors': [2, 5, 10],
                         'weights': ['uniform', 'distance']})

In [122]:
grid_knn.best_score_

0.9677284427284427

In [121]:
grid_knn.best_params_

{'algorithm': 'auto', 'leaf_size': 30, 'n_neighbors': 5, 'weights': 'uniform'}

In [108]:
from sklearn.model_selection import GridSearchCV

bagging_knn.fit(X_train,y_train)

params = {'n_estimators':[10,100],
          'max_samples':[0.2,0.8,1.0],
          'max_features': [0.2,0.8,1.0],
          'bootstrap':[True,False],
          'bootstrap_features':[True,False],
          'random_state': [42]}

In [109]:
import time

In [110]:
%%time

grid = GridSearchCV(bagging_knn,params)
grid.fit(X_train,y_train)

CPU times: user 20.3 s, sys: 184 ms, total: 20.4 s
Wall time: 20.6 s


GridSearchCV(estimator=BaggingClassifier(base_estimator=KNeighborsClassifier(),
                                         max_features=0.8, max_samples=0.8),
             param_grid={'bootstrap': [True, False],
                         'bootstrap_features': [True, False],
                         'max_features': [0.2, 0.8, 1.0],
                         'max_samples': [0.2, 0.8, 1.0],
                         'n_estimators': [10, 100], 'random_state': [42]})

In [111]:
grid.best_score_

0.9713481338481339

In [112]:
grid.best_estimator_

BaggingClassifier(base_estimator=KNeighborsClassifier(),
                  bootstrap_features=True, max_features=0.8, max_samples=0.8,
                  random_state=42)

In [113]:
grid.best_params_

{'bootstrap': True,
 'bootstrap_features': True,
 'max_features': 0.8,
 'max_samples': 0.8,
 'n_estimators': 10,
 'random_state': 42}

# Step 10: Make Predictions and Evaluate the Test Model
**NOTE**: **Do this only after not making any more improvements in the model**.

- Use the **test** data to make predictions
- For **Supervised** models:
    - Check the **Test Results** with the **Test Predictions**

# Step 11: Solve the Problem or Answer the Question
The results of an analysis or modelling can be used:
- As part of a product or process, so the model can make predictions when new input data is available
- As part of a report including text and charts to help understand the problem
- As input for further questions



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



