----

# **Reducing Errors with Ensamble Learning Techinque**

## **Author**   :  **Muhammad Adil Naeem**

## **Contact**   :   **madilnaeem0@gmail.com**
<br>

----



### **Importing Libraries**

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import BaggingClassifier , AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import model_selection
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import warnings
warnings.filterwarnings('ignore')

### **Load Dataset**

In [2]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',
                   header=None)
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                'Normal Nucleoli', 'Mitoses', 'Class']


### **First 5 Rows of Dataset**

In [4]:
data.head()

Unnamed: 0,Sample code,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


### **Information About Dataseet**

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code                  699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


### **Drop `Sample Code` Column**

In [7]:
data.drop(['Sample code'], axis=1, inplace=True)

### **Descriptive Stataistics**

In [8]:
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


### **`Bare Nuclei` is Numerical Column but we Are not able to Check it's Descriptive Statistics.**
- Let's See What went wrong.

In [9]:
data['Bare Nuclei'].value_counts()

Unnamed: 0_level_0,count
Bare Nuclei,Unnamed: 1_level_1
1,402
10,132
2,30
5,30
3,28
8,21
4,19
?,16
9,9
7,8


**Because of the Presence of `?` we can see it's statistics. Let's Deal with it**

### **Data Cleaning and Type Conversion**

- This code replaces all occurrences of '?' in the dataset `data` with `0`, and then converts the 'Bare Nuclei' column to integer type for further analysis.

In [10]:
data.replace('?', 0, inplace=True)
data['Bare Nuclei'] = data['Bare Nuclei'].astype(int)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Clump Thickness              699 non-null    int64
 1   Uniformity of Cell Size      699 non-null    int64
 2   Uniformity of Cell Shape     699 non-null    int64
 3   Marginal Adhesion            699 non-null    int64
 4   Single Epithelial Cell Size  699 non-null    int64
 5   Bare Nuclei                  699 non-null    int64
 6   Bland Chromatin              699 non-null    int64
 7   Normal Nucleoli              699 non-null    int64
 8   Mitoses                      699 non-null    int64
 9   Class                        699 non-null    int64
dtypes: int64(10)
memory usage: 54.7 KB


### **Apply Simple Imputer**

In [12]:
values = data.values
imputer = SimpleImputer()
scaled = imputer.fit_transform(values)

### **Scale Data using Min-Max Scaler**

In [14]:
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(scaled)

### **Splitting Data into Depeendent and Independent Variables**

In [18]:
X = pd.DataFrame(scaled)
y = data['Class']

### **Create KFold and Split with 10**

In [20]:
kfold = KFold(n_splits=10, shuffle=True, random_state=7)

### **Bagging with Decision Tree Classifier**

- This code initializes a Decision Tree Classifier as the base estimator for a Bagging Classifier, uses 100 estimators, and performs cross-validation on the dataset `X` with labels `y`, printing the mean accuracy of the results.

In [23]:
cart = DecisionTreeClassifier()
model = BaggingClassifier(base_estimator=cart, n_estimators=100, random_state=7)
result = model_selection.cross_val_score(model, X, y, cv=kfold)
print(result.mean())

1.0


### **AdaBoost Classifier with Cross-Validation**

- This code sets a random seed and defines the number of trees for an AdaBoost Classifier. It uses 10-fold cross-validation to evaluate the model on the dataset `X` with labels `y`, printing the mean accuracy of the results.

- Using AdaBoostClassifier to Reduce Error

In [25]:
seed = 7
num_trees = 70
kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
result = model_selection.cross_val_score(model, X, y, cv=kfold)
print(result.mean())

1.0
