<a href="https://www.kaggle.com/code/hossamahmedsalah/ensemble-learning-msp?scriptVersionId=142791575" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="padding: 35px;color:white;margin:10;font-size:200%;text-align:center;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://github.com/hossamAhmedSalah/Machine_Learning_MSP/blob/main/Assets/247.jpg?raw=true?)">
<b>
<span style='color:skyblue'>MSP Machine Learning workshop 2023 </span>
</b>
<div>
<span style='color:Salmon'>Ensemble Learning</span>

</div>

</div>

<br>


# You can view other sessions via 
[GitHub - hossamAhmedSalah/Machine_Learning_MSP: MSP 23 workshop of machine learning](https://github.com/hossamAhmedSalah/Machine_Learning_MSP/tree/main)
![MSP Logo](https://github.com/hossamAhmedSalah/Machine_Learning_MSP/blob/main/Assets/image-removebg-preview.png?raw=true)

<a id="key"></a>
# Table of Content 
1) [importing libraries](#1)

2) [Reading the dataset](#2)
   - 2.1 [Heart dataset💖](#21)
   
3) [Preprocessing](#3)
   - 3.1 [Encoding](#3.1)
   - 3.2 [Why we don't need to scale the data in tree🌳_based models](#3.2)
   
4) [EDA](#eda)

5) [Preparing the data](#4)
   - 4.1 [Spliting the data into X features and y the target](#4.1)
   - 4.2 [Spliting the data into Train and Test](#4.2)
   
6) [Modeling](#6)
   - 6.1 [Max Voting](#6.1)
   
7) [Random Forest](#7)
   - [Bootstrapping](#7.1)
   - [How the Forest work?](#7.2) 
   - [Grid search](#7.3)
   - [Out of Bag (OOB) score](#7.4)
   
8) [Ada Boost](#8)
   - [Ada Boost Parameters](#8.1)
   - [Grid Search](#8.2)

# <a id="1">importing libraries </a>
[🌟go to table of content🌟](#key)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# from the ensemble model we import some classifiers
from sklearn.ensemble import (
    RandomForestClassifier, 
    ExtraTreesClassifier,
    VotingClassifier,
    AdaBoostClassifier, 
    GradientBoostingClassifier,
    )
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree


# <a id="2">Reading the dataset </a>
[🌟go to table of content🌟](#key)

In [None]:
# reading the dataset 
heart = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')
heart

# <a id="21">Heart dataset💖</a>
1. Age: age of the patient [years]
2. Sex: sex of the patient [M: Male, F: Female]
3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
4. RestingBP: resting blood pressure [mm Hg]
5. Cholesterol: serum cholesterol [mm/dl]
6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
10. Oldpeak: oldpeak = ST [Numeric value measured in depression]
11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
12. <span style= "color :red">HeartDisease: output class [1: heart disease, 0: Normal]</span>

[🌟go to table of content🌟](#key)

In [None]:
# dataset info
# no messing values
heart.info()

In [None]:
# describing 
heart.describe()

In [None]:
# describing the categories 
heart.describe(include='O')

## <a id="3">Preprocessing</a>
[🌟go to table of content🌟](#key)

## <a id="31">Encoding</a>


In [None]:
# selecting columns that would be encoded
heart.select_dtypes(include="O")

In [None]:
# selecting columns that would be encoded
cols = heart.select_dtypes(include="O").columns
cols

In [None]:
# Encoding 
from sklearn.preprocessing import LabelEncoder

# initiating the encoder class
lenc = LabelEncoder()

# looping over each columns from the object type columns
for col in cols:
    # encode these columns
    heart[col] = lenc.fit_transform(heart[col])

# diplaying 
heart.head(4)

# <a id="32">Why we don't need to scale the data in tree🌳_based models</a>?


<br>

<div style="border-radius:10px;border:salmon solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
    Scaling is a common preprocessing step for many machine learning algorithms, particularly those that rely on <strong><mark style ="background-color:salmon;color:white;border-radius:4px;opacity:1.0">distance-based metrics</mark></strong> or <mark style="background-color:salmon;border-radius:4px;opacity:1.0;color:white"><strong>gradient-based optimization methods</strong></mark>. However, tree-based models work differently, and their inherent structure and algorithmic characteristics make scaling unnecessary
</div>


<br>

# <a id="eda">EDA</a>

In [None]:
import seaborn as sns
sns.pairplot(data=heart, hue="HeartDisease");

In [None]:
fig = plt.figure(figsize=(10,5), dpi=170)
sns.heatmap(heart.corr(), cmap="Blues", annot=True)

# <a id="4">Preparing the data</a>

# <a id="41">Spliting the data into X features and y the target</a>

In [None]:
X = heart.drop(columns=["HeartDisease"])
y = heart["HeartDisease"]

# <a id="42">Spliting the data into Train and Test</a>

In [None]:
from sklearn.model_selection import train_test_split

# spliting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# <a id="6">Modeling</a>

## <a id="61">Max Voting</a>

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
model1 = LogisticRegression(max_iter=1000)
# you may find a warning when using Logistic regression here that the solver didn't converge and reached Limit 
# so I make max_iter=1000 that may be a side effect of not scalling the data for logistic regression
model1.fit(X_train, y_train)
print(f"Training score {model1.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(model1, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {model1.score(X_test, y_test)}")

In [None]:
model2 = DecisionTreeClassifier()
model2.fit(X_train, y_train)
print(f"Training score {model2.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(model2, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {model2.score(X_test, y_test)}")
# overfitting 
# here you may notice how Cross validation is useful

In [None]:
# let's compine these models
combined = VotingClassifier(estimators=[('lr',model1), ('dt', model2)], voting='hard')
combined.fit(X_train, y_train)

print(f"Training score {combined.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(combined, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {combined.score(X_test, y_test)}")
# we could reduce the overfitting 7% and we can do more better if we used more models 
# Cross Validation also show the real efficiency of the model over the training set that was almost like the real test data

# <a id="7"> Random Forest 🌲🌲🌲</a>

## <a id="71">Bootstrapping</a>

<br>

<div style="border-radius:10px;border:lightgreen solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
   <strong>Random Sampling <mark style="color:white; background-color:lightgreen; border-radius:4px;opacity:1.0">with replacement</mark><strong>, allowing the observations to be chosen multiple times 
</div>


<br>

![Bootstrapping](https://github.com/hossamAhmedSalah/Machine_Learning_MSP/blob/main/Assets/OutOfBag.jpg?raw=true)

## <a id="72">How the Forest work?</a>

![Forest](https://github.com/hossamAhmedSalah/Machine_Learning_MSP/blob/main/Assets/RandomForestBagging.jpg?raw=true)

In [None]:
model3 = RandomForestClassifier()
model3.fit(X_train, y_train)

# evaluating
print(f"Training score {model3.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(model3, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {model3.score(X_test, y_test)}")

In [None]:
mo = RandomForestClassifier(ccp_alpha=0.01)
mo.fit(X_train, y_train)

# evaluating
print(f"Training score {mo.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(mo, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {mo.score(X_test, y_test)}")

In [None]:
mo1 = RandomForestClassifier(ccp_alpha=0.001)
mo1.fit(X_train, y_train)

# evaluating
print(f"Training score {mo1.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(mo1, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {mo1.score(X_test, y_test)}")

In [None]:
mo2 = RandomForestClassifier(n_estimators=70)
mo2.fit(X_train, y_train)

# evaluating
print(f"Training score {mo2.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(mo2, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {mo2.score(X_test, y_test)}")

In [None]:
# how about max voting 🫣 
voter = VotingClassifier(estimators=[('mo', mo), ('mo1', mo1), ('mo2', mo2)], n_jobs=-1, voting='hard')

voter.fit(X_train, y_train)
# evaluating
print(f"Training score {voter.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(voter, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {voter.score(X_test, y_test)}")

# <a id="73">Grid Search</a>

In [None]:
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(random_state=42)
# Specify the hyperparameters to search over
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider at each split
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4]  # Minimum number of samples required at each leaf node
}

# Create a grid search instance
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1)

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)


<br>
<div style="border-radius:10px;border:skyblue solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
The max_features hyperparameter in scikit-learn's Random Forest and Decision Tree models controls the number of features to consider when making a split at each node of a tree. It can take on several values, including:

1. **`'auto'`**: This option lets the algorithm automatically determine the maximum number of features to consider for each split. It sets **`max_features`** to the <strong><mark style="background-color:skyblue;color:white;border-radius:4px;opacity:1.0;">square root of the total number of features</mark></strong>. So, if you have 16 features, **`'auto'`** would set **`max_features`** to 4. This is the default setting.
2. **`'sqrt'`**: Similar to **`'auto'`**, **`'sqrt'`** also sets **`max_features`** to the square root of the total number of features.
3. **`'log2'`**: This option sets **`max_features`** to the base-2 logarithm of the total number of features. It's another way to control the number of features considered at each split.
4. An integer value: You can specify an integer value for **`max_features`**, indicating the strong><strong><mark style="background-color:skyblue;color:white;border-radius:4px;opacity:1.0;">exact number of features to consider at each split</mark></strong>. For example, **`max_features=10`** would consider 10 features at each split.
5. A float value between 0 and 1: If you specify a floating-point value between 0 and 1 (e.g., **`max_features=0.5`**), the algorithm will consider that <strong><mark style="background-color:skyblue;color:white;border-radius:4px;opacity:1.0;">fraction of the total features at each split</mark></strong>. This can be useful for controlling feature selection.

The choice of **`max_features`** can impact the behavior of the Random Forest or Decision Tree model. Using smaller values can reduce overfitting, while using larger values can lead to more complex trees and potentially overfitting on the training data. The **`'auto'`**, **`'sqrt'`**, and **`'log2'`** settings are convenient because they adapt to the number of features in your dataset.
</div>
</br>

In [None]:
# Get the best model and its hyperparameters
best_rf = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Evaluate the best model on your test data
test_accuracy = best_rf.score(X_test, y_test)

# Print the best hyperparameters and test accuracy
print("Best Hyperparameters:", best_params)
print("Best Cross-Validation Accuracy:", best_score)
print("Test Accuracy:", test_accuracy)

In [None]:
pd.DataFrame(grid_search.cv_results_)

In [None]:
grid_search.best_params_

## <a id="74">Out Of Bag score (OOB)</a>

In [None]:
model4 = RandomForestClassifier(random_state=42, min_samples_leaf=2, min_samples_split=10, n_estimators=100, oob_score=True)
# Fit the model to the data
model4.fit(X_train, y_train)

# Access the OOB score (accuracy)
oob_score = model4.oob_score_
print("OOB Score:", oob_score)
# note if you didn't state oob_score=True
# and tried to do oob score an error would rise as the model has no attribute called oob_score (false default)

# <a id= "8">Ada Boost</a>

In [None]:
boost_model = AdaBoostClassifier(random_state=42)

# fitting the model
boost_model.fit(X_train, y_train)

# Evaluating 
print(f"Training score {boost_model.score(X_train, y_train)}")
print(f"Cross Validation {cross_val_score(boost_model, X_train, y_train, cv=5, n_jobs=-1).mean()}")
print(f"Testing score {boost_model.score(X_test, y_test)}")

### <a id="81">Ada Boost Parameters</a> 

1. **`n_estimators`**:
    - This hyperparameter determines the number of weak learners (base estimators) to train. Increasing **`n_estimators`** can lead to a more powerful ensemble, but it may also increase the risk of overfitting. You can perform cross-validation or use other methods to find an appropriate value for this parameter.
2. **`base_estimator`**:
    - AdaBoost can use different base estimators (e.g., DecisionTreeClassifier, RandomForestClassifier, etc.). The choice of the base estimator should depend on the characteristics of your data. For example, if your data is non-linear, using a decision tree with deep branching might be suitable.
3. **`learning_rate`**:
    - The learning rate shrinks the contribution of each weak learner in the ensemble. A smaller learning rate (e.g., 0.1 or lower) may improve generalization but require more estimators. A larger learning rate (e.g., 1.0) makes the ensemble learn faster but can lead to overfitting.
4. **`algorithm`**:
    - AdaBoost has two algorithms: 'SAMME' (default) and 'SAMME.R'. 'SAMME.R' can handle continuous and multiclass targets and typically performs better. However, it requires that the base estimator has **`predict_proba`** method to compute class probabilities.
5. **`random_state`**:
    - Set a random seed (**`random_state`**) for reproducibility if needed.
6. **Base Estimator Parameters**:
    - If you use a decision tree as the base estimator, you can also tune its hyperparameters, such as **`max_depth`**, **`min_samples_split`**, and **`min_samples_leaf`**, to control the complexity of the individual trees.
7. **Feature Scaling and Preprocessing**:
    - Depending on the choice of base estimator, you might need to consider feature scaling and preprocessing steps that are suitable for that estimator.

## <a id="82"> Grid search </a>

In [None]:
# Create an AdaBoostClassifier instance
ada_boost = AdaBoostClassifier(random_state=42)

# Define the hyperparameters and their possible values for the grid search
param_grid = {
    'base_estimator': [DecisionTreeClassifier(max_depth=1), LogisticRegression()],
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.1, 0.5, 1.0],
}

# Create a GridSearchCV instance
grid_search = GridSearchCV(estimator=ada_boost, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1)

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)

In [None]:
best_model = grid_search.best_estimator_
best_model

In [None]:
best_model.score(X_test, y_test)

In [None]:
best_model.score(X_train, y_train)

In [None]:
importance = best_model.feature_importances_
features = X_train.columns
plt.figure(figsize=(10,10))
plt.bar(features, importance)
plt.xticks(rotation = 45);

In [None]:
ax = [None for i in range(2)]

plt.figure(figsize=(15,7), dpi=200)

# making the structure 
ax[0] = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax[1] = plt.subplot2grid((1, 2), (0, 1), colspan=1)

# seaborn correlation heatmap at ax[0]
sns.heatmap(data=heart.corr(), annot=True, cmap="Blues", ax=ax[0], fmt=".2f")

# feature importance at ax[1]
ax[1].bar(features, importance)
ax[1].set_xticks(range(len(features)))
ax[1].set_xticklabels(features, rotation=45)  # Use set_xticklabels to rotate x-axis labels
ax[1].set_title("Feature importance")  # Use set_title to set the title

# Adjust layout and display the plots
plt.tight_layout()
plt.show()

[🌟go to table of content🌟](#key)