# Table of Contents
- [Bagging](#Bagging)
---

- [Random Forest](#Random-Forest)
  
---
    
- [Boosting](#Boosting)
  - [1_ Gradient Boosting](#1_-Gradient-Boosting)
  - [2_ XGBoost](#2_-XGBoost)
  - [3_ AdaBoost](#3_-AdaBoost)
---

- [Parameters](#Parameters)

---

- [Grid Search](#Grid-Search)

---

- [Randomized Search](#Randomized-Search)


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import xgboost as xgb
import warnings

In [2]:
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('modified_data.csv')

In [4]:
df.drop(columns=['Unnamed: 0'], inplace=True)
df

Unnamed: 0,fasting blood sugar,triglyceride,serum creatinine,systolic,dental caries,Cholesterol,AST_ALT_ratio,waist_weight_ratio,smoking
0,94.0,297.0,1.0,135.0,0,172.0,0.880000,1.350000,1.0
1,122.5,55.0,1.1,146.0,1,194.0,1.173913,1.369231,0.0
2,79.0,197.0,0.8,118.0,0,178.0,0.870967,1.080000,1.0
3,91.0,203.0,1.0,131.0,1,180.0,0.740740,1.105263,0.0
4,91.0,87.0,0.8,121.0,0,155.0,1.461537,1.341667,1.0
...,...,...,...,...,...,...,...,...,...
139301,91.0,248.0,0.9,110.0,0,220.0,0.590909,1.075294,1.0
139302,70.5,47.0,0.8,127.0,0,238.0,0.961538,1.533333,0.0
139303,84.0,45.0,0.6,114.0,0,189.0,1.666665,1.320000,0.0
139304,122.0,148.0,1.1,121.0,0,165.0,1.294117,1.226667,1.0


In [5]:
X = df.drop('smoking', axis=1)
Y = df['smoking']

In [6]:
X_train, x_, y_train,y_ = train_test_split(X, Y, test_size=0.4, random_state=1)
X_cv, X_test, y_cv, y_test = train_test_split(x_, y_, test_size=0.5, random_state=1)

In [12]:
def evaluate_model(name, y_true, y_pred):
    print(f"\n{name} Accuracy: {accuracy_score(y_true, y_pred):.4f}")
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("Classification Report:\n", classification_report(y_true, y_pred))

## Bagging

- Bagging builds multiple models parallely on random subsets of the data and combines their predictions to reduce variance and prevent overfitting.

In [7]:
bagging_model = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)
y_pred_bag = bagging_model.predict(X_cv)

In [97]:
evaluate_model("Bagging", y_cv, y_pred_bag)


Bagging Accuracy: 0.7098
Confusion Matrix:
 [[ 9209  4754]
 [ 3331 10567]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.73      0.66      0.69     13963
         1.0       0.69      0.76      0.72     13898

    accuracy                           0.71     27861
   macro avg       0.71      0.71      0.71     27861
weighted avg       0.71      0.71      0.71     27861



## Random Forest

- Random Forest constructs an ensemble of decision trees using bagging and random feature selection to increase robustness and reduce overfitting.

In [101]:
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_cv)

In [105]:
evaluate_model("Random Forest", y_cv, y_pred_rf)


Random Forest Accuracy: 0.7185
Confusion Matrix:
 [[ 9302  4661]
 [ 3183 10715]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.75      0.67      0.70     13963
         1.0       0.70      0.77      0.73     13898

    accuracy                           0.72     27861
   macro avg       0.72      0.72      0.72     27861
weighted avg       0.72      0.72      0.72     27861



## Boosting
- Boosting creates a sequence of models where each new model focuses on correcting the errors made by the previous ones, improving accuracy.

### 1_ Gradient Boosting

- Gradient Boosting builds models sequentially, with each new model learning to correct the residual errors of the previous one by minimizing a loss function.

In [9]:
gb_model = GradientBoostingClassifier(n_estimators=50, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_cv)

In [103]:
evaluate_model("Gradient Boosting", y_cv, y_pred_gb)


Gradient Boosting Accuracy: 0.7229
Confusion Matrix:
 [[ 9118  4845]
 [ 2875 11023]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.76      0.65      0.70     13963
         1.0       0.69      0.79      0.74     13898

    accuracy                           0.72     27861
   macro avg       0.73      0.72      0.72     27861
weighted avg       0.73      0.72      0.72     27861



### 2_ XGBoost

- XGBoost is an optimized implementation of gradient boosting that adds regularization and advanced system optimizations to improve speed and performance.

In [111]:
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_cv)

In [113]:
evaluate_model("XgBoosting", y_cv, y_pred_xgb)


XgBoosting Accuracy: 0.7250
Confusion Matrix:
 [[ 9124  4839]
 [ 2822 11076]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.76      0.65      0.70     13963
         1.0       0.70      0.80      0.74     13898

    accuracy                           0.73     27861
   macro avg       0.73      0.73      0.72     27861
weighted avg       0.73      0.73      0.72     27861



### 3_ AdaBoost

- AdaBoost increases model accuracy by assigning higher weights to misclassified instances so that subsequent models focus more on the harder cases.

In [115]:
ab_model = AdaBoostClassifier(n_estimators=50, random_state=42)
ab_model.fit(X_train, y_train)
y_pred_ab = ab_model.predict(X_cv)

In [117]:
evaluate_model("AdaBoost", y_cv, y_pred_ab)


AdaBoost Accuracy: 0.7168
Confusion Matrix:
 [[ 9430  4533]
 [ 3358 10540]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.74      0.68      0.71     13963
         1.0       0.70      0.76      0.73     13898

    accuracy                           0.72     27861
   macro avg       0.72      0.72      0.72     27861
weighted avg       0.72      0.72      0.72     27861



## Parameters

## Grid Search

- Grid Search systematically tries all combinations of hyperparameters to find the best model settings.

In [125]:
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 5, 10],
    'min_samples_split': [2, 5, 10]
}

In [17]:
grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=None, scoring='accuracy')
grid_rf.fit(X_train, y_train)

In [18]:
print("\nBest Random Forest Params:", grid_rf.best_params_)
y_pred_rf_tuned = grid_rf.predict(X_cv)
evaluate_model("Tuned Random Forest", y_cv, y_pred_rf_tuned)


Best Random Forest Params: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}

Tuned Random Forest Accuracy: 0.7256
Confusion Matrix:
 [[ 9033  4930]
 [ 2716 11182]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.77      0.65      0.70     13963
         1.0       0.69      0.80      0.75     13898

    accuracy                           0.73     27861
   macro avg       0.73      0.73      0.72     27861
weighted avg       0.73      0.73      0.72     27861



## Randomized Search

- Randomized Search tests random combinations of hyperparameters for a more efficient search of optimal values.

In [123]:
param_dist_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 1.0]
}

In [21]:
random_xgb = RandomizedSearchCV(xgb.XGBClassifier(random_state=42), param_distributions=param_dist_xgb, n_iter=10, cv=None, scoring='accuracy', random_state=42)
random_xgb.fit(X_train, y_train)

In [22]:
print("\nBest XGBoost Params:", random_xgb.best_params_)
y_pred_xgb_tuned = random_xgb.predict(X_cv)
evaluate_model("Tuned XGBoost", y_cv, y_pred_xgb_tuned)


Best XGBoost Params: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.2}

Tuned XGBoost Accuracy: 0.7272
Confusion Matrix:
 [[ 9101  4862]
 [ 2738 11160]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.77      0.65      0.71     13963
         1.0       0.70      0.80      0.75     13898

    accuracy                           0.73     27861
   macro avg       0.73      0.73      0.73     27861
weighted avg       0.73      0.73      0.73     27861



In [23]:
y_pred_xgb_tuned = random_xgb.predict(X_test)
evaluate_model("Tuned XGBoost", y_test, y_pred_xgb_tuned)


Tuned XGBoost Accuracy: 0.7348
Confusion Matrix:
 [[ 9038  4780]
 [ 2609 11435]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.78      0.65      0.71     13818
         1.0       0.71      0.81      0.76     14044

    accuracy                           0.73     27862
   macro avg       0.74      0.73      0.73     27862
weighted avg       0.74      0.73      0.73     27862

