Agenda

    Understand the theory and applications of bagging, boosting, and stacking.
    Implement ensemble learning techniques using scikit-learn, XGBoost, and LightGBM.
    Evaluate and compare ensemble models with baseline models.
    Interpret feature importance in ensemble models.


Training of model on heart disease dataset

In [8]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier,StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,f1_score
from sklearn.model_selection import cross_val_score


In [2]:
df = pd.read_csv('./datasets/heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [4]:
X = df.drop(columns= ['target'])
y = df['target']

x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=42)

Bagging
Popular methods: Random Forest, Bagging Classifier.
Key idea: Train models in parallel using bootstrap sampling.
Goal: Reduce overfitting by averaging predictions.

In [5]:
#trainig BaggingClassifier with decision tree
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators = 100,
    random_state =42
    )
bagging_model.fit(x_train,y_train)

y_pred = bagging_model.predict(x_test)

bagging_model_accurary = accuracy_score(y_pred,y_test)
print(f'Accuracy Score : {bagging_model_accurary}')

bagging_model = BaggingClassifier(
    estimator=RandomForestClassifier(),
    n_estimators = 100,
    random_state =42
    )
bagging_model.fit(x_train,y_train)

y_pred = bagging_model.predict(x_test)

bagging_model_accurary = accuracy_score(y_pred,y_test)
print(f'Accuracy Score : {bagging_model_accurary}')


Accuracy Score : 0.9853658536585366
Accuracy Score : 0.9853658536585366


Boosting
Popular methods: Gradient Boosting, AdaBoost, XGBoost, LightGBM, CatBoost.
Key idea: Train models sequentially, correcting errors made by previous models.
Goal: Achieve higher accuracy by focusing on hard-to-predict instances.


In [6]:
# Training GradientBosting ,XGBoost,AdaBoost and lightBGM Classifier and evaluation their perfomance metrics
gb_model = GradientBoostingClassifier(
    n_estimators = 100,
    learning_rate = .1,
    random_state = 42)

gb_model.fit(x_train,y_train)
y_pred = gb_model.predict(x_test)

gb_model_accuracy = accuracy_score(y_pred,y_test)
gb_model_f1_score = f1_score(y_pred,y_test)

print(f'Gradient Boosting Model Accuracy: {gb_model_accuracy}')
print(f'Gradient Boosting Model F1 Score: {gb_model_f1_score}')

adaBoost_model = AdaBoostClassifier(
    estimator = RandomForestClassifier(max_depth=10),
    n_estimators = 50 ,
    learning_rate = .1,
    random_state = 42
)


adaBoost_model.fit(x_train,y_train)

y_pred = adaBoost_model.predict(x_test)

adaBoost_model_accuracy = accuracy_score(y_pred,y_test)
adaBoost_model_f1_score = f1_score(y_pred,y_test)


print(f'Ada Boost Model Accuracy: {adaBoost_model_accuracy}')
print(f'Ada Boost Model F1 Score: {adaBoost_model_f1_score}')





Gradient Boosting Model Accuracy: 0.9317073170731708
Gradient Boosting Model F1 Score: 0.9333333333333333
Ada Boost Model Accuracy: 0.9853658536585366
Ada Boost Model F1 Score: 0.9852216748768473


In [14]:
stacking_model = StackingClassifier(
    estimators = [
        ('rf',RandomForestClassifier()),
        ('dt',DecisionTreeClassifier(max_depth=2,random_state=42))
    ],
    final_estimator = GradientBoostingClassifier(n_estimators=100,random_state=42)
)
stacking_model.fit(x_train,y_train)
y_pred = stacking_model.predict(x_test)

stacking_model_accuracy = accuracy_score(y_pred,y_test)
stacking_model_f1_score = f1_score(y_pred,y_test)

print(f'Stacking Model Accuracy : {stacking_model_accuracy}')
print(f'Stacking Model F1 Score : {stacking_model_f1_score}')

scores = cross_val_score(stacking_model,x_train,y_train,cv=10 ,scoring = 'f1')
print(f'Cross Validation Scores : {scores}')

scores = cross_val_score(stacking_model,x_train,y_train,cv=10 ,scoring = 'accuracy')
print(f'Cross Validation Scores : {scores}')

Stacking Model Accuracy : 0.9853658536585366
Stacking Model F1 Score : 0.9852216748768473
Cross Validation Scores : [1.         0.98795181 0.97674419 0.97619048 0.98823529 1.
 0.98823529 0.98850575 0.96385542 1.        ]
Cross Validation Scores : [1.         0.98780488 1.         0.96341463 1.         0.98780488
 0.98780488 0.97560976 0.96341463 1.        ]
