# Classification - Ensemble Methods (without outlier)

An ensemble method is a machine learning technique that combines the predictions of multiple individual models to produce a stronger predictive model. The idea behind ensemble methods is to leverage the diversity of different models to improve overall predictive performance and robustness. Ensemble methods are widely used in machine learning because they often yield better results compared to single models.

In [11]:
# Import packages
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

In [7]:
# We will be using the new_df_without_outliers_copy_smote_resampled.xlsx
x_without_outlier = pd.read_excel('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_without_outliers_copy_smote_resampled.xlsx')
x_without_outlier

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,1.349487,0,1,-0.580455,0.629251,-0.317312,0.263730,-0.974068,0
1,0.149555,0,0,-0.241118,0.629251,-1.649552,-1.579747,-0.974068,0
2,-1.050377,0,0,-0.241118,-0.272192,0.082360,0.263730,1.211318,0
3,-0.681167,0,0,-0.857661,-0.973313,0.015748,0.331039,-0.974068,0
4,1.164882,1,1,-1.384988,-1.173634,0.015748,0.331039,1.211318,0
...,...,...,...,...,...,...,...,...,...
181139,1.349487,0,0,-0.261522,0.979109,-0.628168,1.103451,1.211318,1
181140,1.349487,0,0,-0.241118,-0.172031,0.015748,0.624467,-0.399610,1
181141,-0.727318,0,0,-0.241061,1.025419,0.104564,0.763605,1.211318,1
181142,0.334160,0,0,-0.241118,0.045409,0.015748,-1.579747,1.211318,1


In [8]:
# Split the dataset into features and target variable
X = x_without_outlier.drop(columns=['diabetes'])
y = x_without_outlier['diabetes']

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Best parameters for Gradient Boostin

In [13]:
# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.5, 0.8, 1.0]
}

# Initialize the GradientBoostingClassifier
gb_classifier = GradientBoostingClassifier(random_state=42)

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=gb_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and evaluate on the test set
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best Hyperparameters:", best_params)

In [None]:
# Train a Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(best_params, random_state=42)
gb_classifier.fit(X_train, y_train)

In [None]:
# Evaluate the Gradient Boosting classifier
y_pred_gb = gb_classifier.predict(X_test)
print("Gradient Boosting Classifier Report:")
print(classification_report(y_test, y_pred_gb))