## __Reducing Errors with Ensembles__
Let's look at how to reduce errors with ensembles.

## Step 1: Import the Required Libraries and Load the Data Set

- Import pandas, NumPy, SimpleImputer, and MinMaxScaler
- Load the breast cancer data set
- SimpleImputer is used for treating missing values.
- Instead of using standard sklearn, we will use MinMaxScaler.


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

In [None]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                'Normal Nucleoli', 'Mitoses','Class']

In [None]:
data.head()

Unnamed: 0,Sample code,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


__Observation:__
- Here, we can see the first few rows of the breast cancer data.

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code                  699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


__Observation:__
- Here, we can see information about the data.

In [None]:
data.drop(['Sample code'], axis =1, inplace=True)
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


__Observations:__
- Here, we have dropped the sample codes, which are of no use.
- Then we described the data and checked for descriptive statistics.
- From the data information, we can see that Bare Nuclei is defined as an object, but it's an integer.
- Let's find out what went wrong.

In [None]:
data['Bare Nuclei'].value_counts()

1     402
10    132
2      30
5      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare Nuclei, dtype: int64

__Observations:__
- As we can see, there is a question mark, and that's why it was considered an object.
- Replace the question mark with 0 and convert it into an integer


In [None]:
data.replace('?', 0, inplace=True)
data['Bare Nuclei'] = data['Bare Nuclei'].astype('int64')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Clump Thickness              699 non-null    int64
 1   Uniformity of Cell Size      699 non-null    int64
 2   Uniformity of Cell Shape     699 non-null    int64
 3   Marginal Adhesion            699 non-null    int64
 4   Single Epithelial Cell Size  699 non-null    int64
 5   Bare Nuclei                  699 non-null    int64
 6   Bland Chromatin              699 non-null    int64
 7   Normal Nucleoli              699 non-null    int64
 8   Mitoses                      699 non-null    int64
 9   Class                        699 non-null    int64
dtypes: int64(10)
memory usage: 54.7 KB


__Observations:__
- Let's check the data information again.
- Now, all columns are defined as integers.

## Step 2: Apply a SimpleImputer and Normalize the Data Using MinMaxScaler

In [None]:
values = data.values

imputer = SimpleImputer()
imputeData = imputer.fit_transform(values)

- Keep the **range** between **0** and **1**

In [None]:
scaler = MinMaxScaler(feature_range=(0,1))
normalizeddata = scaler.fit_transform(imputeData)

- Now, the data is normalized.

## Step 3: Import the BaggingClassifier and DecisionTreeClassifier

- Prepare the data set for training
- Train and evaluate BaggingClassifier using DecisionTreeClassifier and 10-fold cross-validation
- Train and evaluate AdaBoostClassifier using 10-fold cross-validation


In [None]:
from sklearn import model_selection 
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

- Define the feature matrix X and the target vector y

In [None]:
X = normalizeddata[:, 0:9]
y = normalizeddata[:, 9]

- Create **kfold** with **n-split** as **10**
- Select the **BaggingClassifier** and check the cross-value score

In [None]:
kfold =  model_selection.KFold(n_splits=10, random_state=7, shuffle=True)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(estimator=cart, n_estimators=num_trees, random_state=7)
results = model_selection.cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.9584886128364388


__Observation:__
- With the basic models of BaggingClassifier and DecisionTree, we have an accuracy of 95%.

## Step 4: Use an Ensemble with AdaBoost and Reduce the Errors

**AdaBoost Classifier**

- Set the seed value and the number of trees for the AdaBoost Classifier
- Create an AdaBoost Classifier model
- Evaluate the model using cross-validation
- Print the mean of the cross-validation results

In [None]:
from sklearn.ensemble import AdaBoostClassifier
seed = 7
num_trees = 70
kfold = model_selection.KFold(n_splits=10, random_state=seed, shuffle=True)
model =  AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, y, cv = kfold)
print(results.mean())

0.9599378881987578


__Observation:__
- Here, we can observe a slight increase in accuracy when compared to BaggingClassifier and DecisionTree.