# Adaboost Implementation

## Adaboost

Adaboost is a ensemble learnign method where multiple models combine together to make a master model.

### Workflow:

Using the training data, we build a model 1. Then the learning from model 1 is used as an input to build a model 2. Then the learning from model 2 is used as an input to build a model 3. And so on.

![](./Images/img1.jpg)



## import libraries

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")


## Dataset Details

We use income.csv file here. There is a feature called **income_level** which tells about whether the person has income less than 50K in an year or greater than 50K in an year.

feature **income_level** has two values: 0 and 1. 
1 means the person has income greater than 50K in an year.
0 means the person has income less than 50K in an year.
it is our target column or dependent variable in our dataset.

<b>Import Dataset<b>

In [6]:
## Import the data
inc_df = pd.read_csv('income.csv')
inc_df.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,income_level
0,39,77516,13,2174,0,40,0
1,50,83311,13,0,0,13,0
2,38,215646,9,0,0,40,0
3,53,234721,7,0,0,40,0
4,28,338409,13,0,0,40,0


Note:

This data is related to loan data. means whether a bank loan is granted to a person or not.

<b>Split the data into dependent and independent variables<b>

In [7]:
X = inc_df.drop('income_level', axis=1)
y = inc_df['income_level']

In [8]:
X.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40


In [9]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: income_level, dtype: int64

<b>Split into training and test set<b>

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 80% training data, 20% test data
# random_state is a seed value that is used to randomize the data

In [17]:
X_train.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
37193,32,50753,9,0,0,40
31093,45,144351,14,0,0,40
33814,35,252217,8,0,0,40
14500,64,69525,9,0,0,20
23399,63,28612,9,0,0,70


In [18]:
y_test.head()

7762     0
23881    0
30507    0
28911    0
19484    0
Name: income_level, dtype: int64

## Model Evaluation and Training

<b>Initiate adaboost classifier<b>

Means we train a adaboost model here.

In [19]:
adaboost_model = AdaBoostClassifier(n_estimators=100, learning_rate=1)

#### Parameters:

1. *nestimators* is the number of weak learners to train iteratively. which means how many individual models we want to train in our final model. *nestimators=100* means train 100 models.

2. *learning_rate* tells what is the default weight should be assigned to 100 models. It uses 1 as default weight value. So weight 1 is assigned to model 1, model 2 and so on.


<b>train adaboost classifier<b>

In [20]:
model = adaboost_model.fit(X_train, y_train)
model

AdaBoostClassifier(learning_rate=1, n_estimators=100)

<b>prediction<b>

In [22]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

<b>model accuracy<b>

In [25]:
accuracy = accuracy_score(y_test, y_pred)
print(f'model accruacy is {accuracy}')

model accruacy is 0.8380591667519706


**Inference:**

the adaboost model has an accuracy of 84%.

## More practical about Adaboost

In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
log_regressor = LogisticRegression()

<b>create adaboost classifier<b>

In [31]:
adaboost_model = AdaBoostClassifier(base_estimator=log_regressor, n_estimators=50, learning_rate=1)

Parameters

1. *base_estimator*: The base estimator is set to ``Logistic Regression`` to train. If ``None``, then the base estimator is a ``DecisionTreeClassifier``. Default estimator is ``Decision Tree Classifier``.
2. *nestimators* is the number of weak learners to train iteratively. which means how many individual models we want to train in our final model. *nestimators=50* means train 50 models.
3. *learning_rate* tells what is the default weight should be assigned to 20 models. It uses 1 as default weight value. So weight 1 is assigned to model 1, model 2 and so on.


This is what data scientist is expcted to do. playing around the data with different models and see which one is the best. Here we check whether ```logistic regression``` gives a better accuracy than ``decision tree classifier``.

In [32]:
adaboost_model

AdaBoostClassifier(base_estimator=LogisticRegression(), learning_rate=1)

<b>Train adaboost classifier<b>

In [33]:
model = adaboost_model.fit(X_train, y_train)
model

AdaBoostClassifier(base_estimator=LogisticRegression(), learning_rate=1)

<b>Prediction<b>

In [34]:
prediction = model.predict(X_test)
prediction

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

<b>model accuracy<b>

In [35]:
accuracy = accuracy_score(y_test, prediction)
print(f'model accruacy is {accuracy}')

model accruacy is 0.7992629747159382


**Inference:**

the adaboost model has an accuracy of 80%. it is not better than decision tree classifier. ie when we set the base estimator to ```logistic regression``` it not gives a better accuracy than ``decision tree classifier``.

Like we tested using logistic regression we can use SVM, Random Forest and other models to compare the model accuracy with decision tree classifier.

We can change the ``base_estimator``, ``n_estimator`` and ``learning_rate`` to see how the model accuracy changes.

### Advantages:

1. Easy to implement.
2. It iteratively corrects the mistakes of weak classifier and improves the accuracy if weak learners.
3. Can use many base classifiers along with adaboost.
4. Adaboost is not prone to overfitting.

### Disadvantages:

1. Sensitive to the noise and outliers.