<a href="https://colab.research.google.com/github/iAmKankan/MachineLearning_With_Python/blob/master/Supervised/Ensemble-Learning_and_Randon-Forest/ADA_Boost001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



### Implementing the AdaBoost Algorithm From Scratch
![grape](https://user-images.githubusercontent.com/12748752/126882595-d1f5449e-14bb-4ab3-809c-292caf0858a1.png)

### What is AdaBoost
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)
**ADA Boost** short for **Adaptive Boosting** is an **ensemble learning** used in machine learning for **classification** and **regression** problems. The main idea behind AdaBoost is to iteratively train the **weak classifier** on the training dataset with each successive classifier giving more weightage to the data points that are **misclassified**.

The final **ADA Boost** model is decided by combining all the weak classifier that has been used for training with the weightage given to the models according to their accuracies. The weak model which has the highest accuracy is given the highest weightage while the model which has the lowest accuracy is given a lower weightage.

### Institution Behind AdaBost Algorithm
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)

**ADA Boost** techniques combine many weak machine-learning models to create a powerful classification model for the output. The steps to build and combine these models are as

$\large Step \#1 :$ Initialize the weights

For a dataset with ${\color{Purple}N}$ training data points instances, initialize ${\color{Purple}N W_{i}}$ weights for each data point with ${\color{Purple} W_{i} = \frac{1}{N}}$  

$\large Step \#2 :$ Train weak classifiers

Train a weak classifier ${\color{purple}Mk }$ where ${\color{Purple}k}$ is the current iteration
The weak classifier we are training should have an accuracy greater than ${\color{Purple}0.5}$ which means it should be performing better than a naive guess

$\large Step \#3 :$ Calculate the error rate and importance of each weak model ${\color{Purple}Mk }$

Calculate rate error_rate for every weak classifier Mk on the training dataset
Calculate the importance of each model ${\color{Purple}\alpha_k}$ using formula  ${\color{Purple}\alpha_k = \frac{1}{2} \ln{\frac{1 – error_k}{error_k}}}$

$\large Step \#4 :$ Update data point weight for each data point ${\color{Purple}W_i}$

After applying the weak classifier model to the training data we will update the weight assigned to the points using the accuracy of the model. The formula for updating the weights will be ${\color{Purple}w_i = w_i \exp{(-\alpha_k y_i M_k(x_i))}}$ . Here ${\color{Purple}y_i}$ is the true output and ${\color{Purple}X_i}$ is the corresponding input vector

$\large Step \#5 :$ Normalize the Instance weight

We will normalize the instance weight so that they can be summed up to 1 using the ${\color{Purple}formula W_i = W_i / sum(W)}$

$\large Step \#6 :$ Repeat steps ${\color{Purple}2-5}$ for ${\color{Purple}K}$ iterations

We will train ${\color{Purple}K}$ classifiers and will calculate model importance and update the instance weights using the above formula
The final model ${\color{Purple}M(X)}$ will be an ensemble model which is obtained by combining these weak models weighted by their model weights

### Python implementation of ADA Boost
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)

Python provides special packages for applying **ADA Boost** we will see how we can use Python for applying **ADA Boost** on a **machine learning** problem.

In this problem, we are given a dataset containing **3 species of flowers** and the features of these flowers such as **sepal length**, **sepal width**, **petal length** and **petal width** and we have to classify the flowers into these species. The dataset can be downloaded from here

### Import Libraries
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
import warnings
warnings.filterwarnings("ignore")

### Reading  And Describing The Dataset
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)
 After, importing the libraries we will load our dataset using the pandas read_csv method as:

In [13]:
# Reading the dataset from the csv file
# separator is a vertical line, as seen in the dataset
data = pd.read_csv("/Iris.csv")

# Printing the shape of the dataset
print(data.shape)

(150, 6)


In [14]:
data.head()


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [15]:
data = data.drop('Id',axis=1)
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
print("Shape of X is %s and shape of y is %s"%(X.shape,y.shape))

Shape of X is (150, 4) and shape of y is (150,)


### Unique values in our Target Variable
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)
Since this is a classification task we will see the numbers of categories we want to classify our dataset input vector into.

In [16]:
total_classes = y.nunique()
print("Number of unique species in dataset are: ",total_classes)

Number of unique species in dataset are:  3


In [17]:
distribution = y.value_counts()
print(distribution)

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64


### Splitting The Dataset
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)

Now, we will split the dataset for training and validation purposes, the validation set is 25% of the total dataset. For dividing the dataset into training and testing we will use train_test_split method from the sklearn model selection.  

In [18]:
X_train, X_val, Y_train, Y_val = train_test_split(
    X, y, test_size=0.25, random_state=28)

### Applying AdaBoost
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)
After creating the training and validation set we will build our AdaBoost classifier model and fit it over the training set for learning. To fit our AdaBoost model we need our dependent variable y and independent variable x.

In [19]:
# Creating adaboost classifier model
adb = AdaBoostClassifier()
adb_model = adb.fit(X_train,Y_train)

### Accuracy of the AdaBoost Model
![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)
As we fit our model on the train set, we will check the accuracy of our model on the validation set. To check the accuracy of the model we will use the validation dataset that we have created using the train_test_split method.

In [20]:
print("The accuracy of the model on validation set is", adb_model.score(X_val,Y_val))

The accuracy of the model on validation set is 0.9210526315789473
