<a href="https://colab.research.google.com/github/kishon45229/Customer-Churn-Prediction-in-Telecom-Industry/blob/main/Evaluation_and_Interpretation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Customer Churn Prediction in Telecom Industry**

# **Evaluation and Interpretation**

In the field of machine learning, understanding the performance and reliability of models is crucial for developing robust predictive systems. Several key aspects contribute to this understanding, including Performance Metrics, Cross-Validation, Model Selection and Hyperparameter Tuning, and the issues of Overfitting and Underfitting. Performance metrics such as accuracy, precision, recall, F1 score, and AUC-ROC provide a quantitative measure of how well a model predicts outcomes. Cross-validation techniques, like K-Fold and Stratified K-Fold, help assess a model's ability to generalize to unseen data by systematically splitting the dataset into training and validation sets. Model selection and hyperparameter tuning, through methods like Grid Search, Random Search, and Bayesian Optimization, ensure that the most effective model is chosen and fine-tuned for optimal performance. Finally, understanding and mitigating overfitting, where a model memorizes the training data but fails to generalize, and underfitting, where a model is too simplistic to capture data patterns, are essential to building models that perform well on both training and unseen datasets.

Gnanaraj Kishon\
ITBIN-2110-0054

## **Import Necessary Libraries**

In [2]:
import pandas as pd
import joblib

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from google.colab import files

## **Add Dataset**

I continued from where data mining part stopped. Therefore, I read the CSV file that generated at the end of data mining part.

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Nature Inspired Algorithms/Mini Project/Data Mining completed dataset.csv')
df.head()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn,MonthlyTenure,MonthlyChargesBinned,TotalChargesBinned
0,0.84507,0.063063,0.171414,0.0,0.197946,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0,2.718129,0,0
1,0.253521,0.06006,0.051674,0.0,0.243191,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0,0.860377,0,0
2,0.169014,0.836336,0.154913,1.0,0.793704,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1,0.185228,4,0
3,0.507042,0.363864,0.23256,0.0,0.221083,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0,0.639141,1,1
4,0.070423,0.107608,0.016489,0.0,0.312633,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.153231,0,0


## **Preprocess the Data**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1406 entries, 0 to 1405
Data columns (total 24 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   tenure                                1406 non-null   float64
 1   MonthlyCharges                        1406 non-null   float64
 2   TotalCharges                          1406 non-null   float64
 3   Cluster                               1406 non-null   float64
 4   PCA2                                  1406 non-null   float64
 5   InternetService_Fiber optic           1406 non-null   float64
 6   InternetService_No                    1406 non-null   float64
 7   OnlineSecurity_No internet service    1406 non-null   float64
 8   OnlineSecurity_Yes                    1406 non-null   float64
 9   OnlineBackup_No internet service      1406 non-null   float64
 10  DeviceProtection_No internet service  1406 non-null   float64
 11  TechSupport_No in

In next steps, I analyzed our models by doing cross validation, analyzing overfitting and undrfitting. If the model have some performance issues i gave solution using hyperparametr tuning.

# **1. Logistic Regression**

I separated the features(`X`) and target variable(`y`). I choose target variable(`y`) as column `chrun` and some other columns as features(`X`).

In [6]:
X = df[['tenure', 'MonthlyCharges', 'TotalCharges']]
y = df['Churn']

Next, I created an instance of `StandardScaler` that standardizes the features by removing the mean and scaling them to unit variance. `scaler.fit_transform(X)` first fits the `StandardScaler` to the data and then transforms the data using these calculated statistics.

In [8]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

* `train_test_split` function splits the data into training and test sets.

* The feature sets `X_train, X_test` for training and testing, respectively.

* The target sets `y_train, y_test `for training and testing, respectively.

* 20% of the data is reserved for testing, while 80% is used for training.

* `random_state=42 `Ensures that the split is reproducible by setting a specific seed for random number generation.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **Cross-Validation**

Cross-validation is a technique used to evaluate the performance of a machine learning model by splitting the data into multiple subsets (folds) and training/testing the model on these different subsets. This provides a more reliable estimate of the model's performance than a single train-test split.



In the next step, I Define 5-fold cross-validation.

In [23]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In below code `cross_val_socre` runs the Logistic Regression model with `max_iter=1000` on the dataset `X_scaled, y `using the 5-fold cross-validation strategy defined by `kf`.

In [24]:
cv_scores_lr = cross_val_score(LogisticRegression(max_iter=1000), X_scaled, y, cv=kf, scoring='accuracy')

Next, I outputted the cross-validation accuracy.

In [25]:
print(f'Cross-Validation Accuracy: {cv_scores_lr.mean():.4f} ± {cv_scores_lr.std():.4f}')

Cross-Validation Accuracy: 0.7824 ± 0.0240


The output indicates that the model consistently performs well across different subsets of the data.

## **Analyze Overfitting and Underfitting**

Overfitting is look for a large difference between training and test accuracy. Underfitting is look for low accuracy on both training and test sets.

I calculated training accuracy and test accuracy for each model. Based on the values we can make some decisions such as:

* **High Train Accuracy & Low Test Accuracy**: This indicates overfitting. The model has learned the training data too well but fails to generalize to new data.
* **Low Train Accuracy & Low Test Accuracy**: This indicates underfitting. The model has not learned the underlying patterns in the training data, leading to poor performance on both the training and test sets.
* **Similar Train and Test Accuracy**: This indicates that the model is well-fitted to the data, with a good balance between bias and variance.

To understand if the model is too complex (overfitting) or too simple (underfitting) I trained the Logistic Regression model using `X_train, y_train` data.

In [26]:
model_lr = LogisticRegression(max_iter=1000, random_state=42)
model_lr.fit(X_train, y_train)

Next, I made predictions on training data. For that, I used `model_lr.predict(X_train)`. To calculate the accuracy of the model I used `accuracy_score(y_train, train_predictions_lr)`.

In [27]:
train_predictions_lr = model_lr.predict(X_train)
train_accuracy_lr = accuracy_score(y_train, train_predictions_lr)

Next I made predictions on test data. For that I used `model_lr.predict(X_test)`. To calculate the accuracy of the model, I used `accuracy_score(y_test, y_test_pred_lr)`.

In [28]:
y_test_pred_lr = model_lr.predict(X_test)
test_accuracy_lr = accuracy_score(y_test, y_test_pred_lr)

The below code output the training accuracy and test accuracy of the Logistic Regresssion model.

In [29]:
print(f'Training Accuracy: {train_accuracy_lr:.4f}')
print(f'Test Accuracy: {test_accuracy_lr:.4f}')

Training Accuracy: 0.7874
Test Accuracy: 0.7589


A higher training accuracy compared to test accuracy can sometimes be a sign of overfitting, but in this case, the difference is small, so overfitting is not a significant concern here.

The small differences between cross-validation accuracy, training accuracy, and test accuracy suggest that the model generalizes well and does not suffer significantly from overfitting or underfitting.

With cross-validation accuracy close to training and test accuracies, the Logistic Regression model appears to be reliable and performs consistently on new data.

Therefore, no need to do hyperparameter tuning for Logistic Regression model.

By using the Logistic Regression model I made predictions on test set.

In [30]:
churn_probabilities = model_lr.predict_proba(X_test)[:, 1]

X_test['Churn_Probabilities'] = churn_probabilities
X_test['Churn_Predictions'] = y_test_pred_lr

X_test

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn_Probabilities,Churn_Predictions
1075,0.042254,0.377377,0.024973,0.469186,0
1015,0.774648,0.411411,0.397472,0.065765,0
650,1.000000,0.457958,0.534290,0.034022,0
447,0.873239,0.008008,0.138280,0.019468,0
1290,1.000000,0.711712,0.750367,0.057853,0
...,...,...,...,...,...
188,0.732394,0.773273,0.580040,0.164714,0
1237,0.197183,0.832833,0.190544,0.637892,1
380,0.507042,0.011011,0.081756,0.062786,0
354,0.464789,0.463463,0.247211,0.198289,0


In the output you can see there are two new columns indicate the `Churn_Probabilities` and `Churn_Predictions` based on the data.

I filtered those two columns and added a percentage column for easy to unnderstand.

In [31]:
X_test['Churn_Probabilities_Percentage'] = X_test['Churn_Probabilities'] * 100
X_test[['Churn_Probabilities', 'Churn_Predictions', 'Churn_Probabilities_Percentage']]

Unnamed: 0,Churn_Probabilities,Churn_Predictions,Churn_Probabilities_Percentage
1075,0.469186,0,46.918560
1015,0.065765,0,6.576502
650,0.034022,0,3.402208
447,0.019468,0,1.946755
1290,0.057853,0,5.785318
...,...,...,...
188,0.164714,0,16.471432
1237,0.637892,1,63.789196
380,0.062786,0,6.278558
354,0.198289,0,19.828854


For all the other values less than 50% has 0 in the `Churn_Predictions`.

Therefore we can understand that if the `Churn_Probabilities_Percentage` is greater than or equal to 50% there is more chance that the particular customer can leave the service provider. If the `Churn_Probabilities_Percentage` is less than 50% there is less chance to that particular customer leave the service provider.

Also, based on the `Churn_Probabilities_Percentage` you can make decisions for a particular customer.

I also included the visual representation.

In [32]:
joblib.dump(model_lr, 'churn_prediction_model.pkl')
files.download('churn_prediction_model.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **2. Decision Tree**

The `X_test` is changed so I again defined the features.

I separated the features(`X`) and target variable(`y`). I choose target variable(`y`) as column `chrun` and some other columns as features(`X`).

In [35]:
X = df[['tenure', 'MonthlyCharges', 'TotalCharges']]
y = df['Churn']

Next, I split the dataset into train and test.

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **Cross-Validation**

In below code `cross_val_socre` runs the Decision Tree model on the dataset `X_scaled, y `using the 5-fold cross-validation strategy defined by `kf`.

In [37]:
cv_scores_dt = cross_val_score(DecisionTreeClassifier(random_state=42), X_scaled, y, cv=kf, scoring='accuracy')

Next, I outputted the cross-validation accuracy.

In [38]:
print(f'Cross-Validation Accuracy: {cv_scores_dt.mean():.4f} ± {cv_scores_dt.std():.4f}')

Cross-Validation Accuracy: 0.7119 ± 0.0143


The output indicates that the model consistently performs well across different subsets of the data.

## **Analyze Overfitting and Underfitting**

To find if the Decision Tree model is overfitting or underfitting, I started with train the model using train data.

In [39]:
model_dt = DecisionTreeClassifier(random_state=42)
model_dt.fit(X_train, y_train)

Next, I made predictions on training data. For that, I used `model_dt.predict(X_train)`. To calculate the accuracy of the model I used `accuracy_score(y_train, train_predictions_dt)`.

In [40]:
train_predictions_dt = model_dt.predict(X_train)
train_accuracy_dt = accuracy_score(y_train, train_predictions_dt)

Next I made predictions on test data. For that I used `model_dt.predict(X_test)`. To calculate the accuracy of the model, I used `accuracy_score(y_test, y_test_pred_dt)`.

In [41]:
y_test_pred_dt = model_dt.predict(X_test)
test_accuracy_dt = accuracy_score(y_test, y_test_pred_dt)

The below code output the training accuracy and test accuracy of the Decision Tree model.

In [42]:
print(f'Training Accuracy: {train_accuracy_dt:.4f}')
print(f'Test Accuracy: {test_accuracy_dt:.4f}')

Training Accuracy: 0.9956
Test Accuracy: 0.7057


The high training accuracy often suggests that the model might be overfitting the training data, capturing noise or overly specific patterns that might not generalize well.

The test accuracy indicating that the model performs reasonably well on unseen data.

However, there is a noticeable gap between test accuracy and training accuracy, which reinforces the idea that the model might be overfitting the training data.

Therefore, I did hyperparameter tuning for Decision Tree model.

## **Hyperparameter Tuning**

Hyperparameter tuning is a process in machine learning that involves finding the best values for a learning algorithm's hyperparameters. Hyperparameters are specific to the algorithm and cannot be calculated from the data. Tuning hyperparameters can help ensure that the model performs well and produces the best possible results.

I tuned the maximum depth of the tree (`max_depth`), the minimum samples required to split an internal node (`min_samples_split`), and the minimum samples required to be at a leaf node (`min_samples_leaf`).

In [None]:
param_grids_dt = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

The `grid_search_dt` will creates a `GridSearchCV` object with Decision Tree model and the specified hyperparameter grid. It uses 5-fold cross-validation (`cv=5`) to evaluate the performance of each combination of parameters, optimizing for accuracy (`scoring='accuracy'`).

In [None]:
grid_search_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grids_dt, cv=5, scoring='accuracy')
grid_search_dt.fit(X_train, y_train)

I outputted the best parameter and best cross validation accuracy for Decision Tree model.

In [None]:
print(f'Best parameters: {grid_search_dt.best_params_}')
print(f'Best cross-validation accuracy: {grid_search_dt.best_score_:.4f}\n')

To evaluate the performance of the best Decision Tree model identified by `GridSearchCV` on the test data, I retrieved best model from the grid search using `best_estimator_`.

In [None]:
best_model_dt = grid_search_dt.best_estimator_

Next, I used to make predictions on the test data (`X_test`).

In [None]:
y_pred_dt = best_model_dt.predict(X_test)

I calculated and outputed the performance metrics of the Decision Tree model.

Performance metrics are used to evaluate how well your model performs on the given data. Common metrics include:
*   **Accuracy**: The proportion of correctly classified instances.
*   **Precision**: The proportion of true positive predictions relative to the total positive predictions.
*   **Recall**: The proportion of true positive predictions relative to the total actual positives.
*   **F1 Score**: The harmonic mean of precision and recall, balancing the two metrics.
*   **AUC-ROC**: A performance measurement for classification problems at various threshold settings.











In [None]:
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt  = f1_score(y_test, y_pred_dt)
roc_auc_dt  = roc_auc_score(y_test, best_model_dt.predict_proba(X_test)[:, 1])

print(f'Accuracy: {accuracy_dt :.4f}')
print(f'Precision: {precision_dt :.4f}')
print(f'Recall: {recall_dt:.4f}')
print(f'F1 Score: {f1_dt :.4f}')
print(f'ROC AUC: {roc_auc_dt :.4f}')

The above output indicates the performance metrics of the best Decision Tree model.

To find if the best Decision Tree model is overfitting or underfitting, I started with train the model using train data.

In [None]:
best_model_dt.fit(X_train, y_train)

To check if the best Decision Tree model works well or not, I made predictions on training data. For that, I used `best_model_rf.predict(X_train)`. To calculate the accuracy of the model I used `accuracy_score(y_train, train_predictions_rf)`.

In [None]:
best_train_predictions_dt = best_model_dt.predict(X_train)
best_train_accuracy_dt = accuracy_score(y_train, best_train_predictions_dt)

Next I made predictions on test data. For that I used `best_model_rf.predict(X_test)`. To calculate the accuracy of the model, I used `accuracy_score(y_test, y_test_pred_rf)`.

In [None]:
best_y_test_pred_dt = best_model_dt.predict(X_test)
best_test_accuracy_dt = accuracy_score(y_test, best_y_test_pred_dt)

The below code output the training accuracy and test accuracy of the
best Decision Tree model.

In [None]:
print(f'Training Accuracy: {best_train_accuracy_dt:.4f}')
print(f'Test Accuracy: {best_test_accuracy_dt:.4f}')

After hyperparameter tuning, the training accuracy decreased to 0.8683, while the test accuracy improved slightly to 0.7163. The decrease in training accuracy suggests that the model has become less complex and is no longer overfitting as severely. The test accuracy's slight improvement indicates better generalization to unseen data.

Overall, hyperparameter tuning has helped reduce overfitting and improved the model's reliability on unseen data. But still the model is showing signs of overfitting. Therefore, Decision Tree model is not that much suitable for our project.

# **3. Random Forest**

## **Cross-Validation**

In below code `cross_val_socre` runs the Random Forest model on the dataset `X_scaled, y `using the 5-fold cross-validation strategy defined by `kf`.

In [None]:
cv_scores_rf = cross_val_score(RandomForestClassifier(random_state=42), X_scaled, y, cv=kf, scoring='accuracy')

Next, I outputted the cross-validation accuracy.

In [None]:
print(f'Cross-Validation Accuracy: {cv_scores_rf.mean():.4f} ± {cv_scores_rf.std():.4f}')

The output indicates that the model consistently performs well across different subsets of the data.

## **Analyze Overfitting and Underfitting**

To find if the Random Forest model is overfitting or underfitting, I started with train the model using train data.

In [None]:
model_rf = RandomForestClassifier(random_state=42)
model_rf.fit(X_train, y_train)

Next, I made predictions on training data. For that, I used `model_rf.predict(X_train)`. To calculate the accuracy of the model I used `accuracy_score(y_train, train_predictions_dt)`.

In [None]:
train_predictions_rf = model_rf.predict(X_train)
train_accuracy_rf = accuracy_score(y_train, train_predictions_rf)

Next I made predictions on test data. For that I used `model_rf.predict(X_test)`. To calculate the accuracy of the model, I used `accuracy_score(y_test, y_test_pred_rf)`.

In [None]:
y_test_pred_rf = model_rf.predict(X_test)
test_accuracy_rf = accuracy_score(y_test, y_test_pred_rf)

The below code output the training accuracy and test accuracy of the Random Forest model.

In [None]:
print(f'Training Accuracy: {train_accuracy_rf:.4f}')
print(f'Test Accuracy: {test_accuracy_rf:.4f}')

The model performs exceptionally well on the training data, which could indicate overfitting, as it's nearly perfect on the data it was trained on.

The model's performance on unseen data is significantly lower than on the training data. This accuracy is fairly close to the cross-validation accuracy, suggesting some generalization but still indicating potential overfitting.

Therefore, I performed hyperparameter tuning for Random Forest model.

## **Hyperparameter Tuning**

The `param_grids_rf` of Random Forest model is similar to the Decision Tree but with an additional parameter for the number of trees (n_estimators).

In [None]:
param_grids_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

The `grid_search_dt` will creates a `GridSearchCV` object with Radom Forest model and the specified hyperparameter grid. It uses 5-fold cross-validation (`cv=5`) to evaluate the performance of each combination of parameters, optimizing for accuracy (`scoring='accuracy'`).

In [None]:
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grids_rf, cv=5, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)

I outputted the best parameter and best cross validation accuracy for Random Forest model.

In [None]:
print(f'Best parameters: {grid_search_rf.best_params_}')
print(f'Best cross-validation accuracy: {grid_search_rf.best_score_:.4f}\n')

The output indicates that the best hyperparameters and best cross-validtion accuracy for the best Random Forest model.

To evaluate the performance of the best Random Forest model identified by `GridSearchCV` on the test data, I retrieved best model from the grid search using `best_estimator_`.

In [None]:
best_model_rf = grid_search_rf.best_estimator_

Next, I used to make predictions on the test data (`X_test`).

In [None]:
y_pred_rf = best_model_rf.predict(X_test)

Finaly, I calculated and outputed the performance metrics of the best Random Forest model.

In [None]:
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf  = f1_score(y_test, y_pred_rf)
roc_auc_rf  = roc_auc_score(y_test, best_model_rf.predict_proba(X_test)[:, 1])

print(f'Accuracy: {accuracy_rf :.4f}')
print(f'Precision: {precision_rf :.4f}')
print(f'Recall: {recall_rf:.4f}')
print(f'F1 Score: {f1_rf :.4f}')
print(f'ROC AUC: {roc_auc_rf :.4f}')

The above output provides insights into the model's accuracy, precision, recall, F1 score, and its ability to separate classes (ROC AUC).

To find if the best Random Forest model is overfitting or underfitting, I started with train the model using train data.

In [None]:
best_model_rf.fit(X_train, y_train)

To check if the best Random Forest model works well or not, I made predictions on training data. For that, I used `best_model_rf.predict(X_train)`. To calculate the accuracy of the model I used `accuracy_score(y_train, train_predictions_rf)`.

In [None]:
train_predictions_rf = best_model_rf.predict(X_train)
train_accuracy_rf = accuracy_score(y_train, train_predictions_rf)

Next, I made predictions on test data. For that I used `best_model_rf.predict(X_test)`. To calculate the accuracy of the model, I used `accuracy_score(y_test, y_test_pred_rf)`.

In [None]:
y_test_pred_rf = best_model_rf.predict(X_test)
test_accuracy_rf = accuracy_score(y_test, y_test_pred_rf)

The below code output the training accuracy and test accuracy of the best Random Forest model.

In [None]:
print(f'Training Accuracy: {train_accuracy_rf:.4f}')
print(f'Test Accuracy: {test_accuracy_rf:.4f}')

After hyperparameter tuning, the decrease in training accuracy (from 0.9956 to 0.8692) indicates that the model is less overfitted and more generalized. However, the test accuracy slightly increased from 0.7340 to 0.7553, which is still close to the original performance.

This suggests that the hyperparameter tuning has helped in making the model more balanced between training and test performance, but the slight drop in test accuracy indicates the need for further adjustments or possibly different strategies to improve generalization.

Therefor, Random Forest model is comparatively suitable for our project.

# **4. Gradient Boosting**

## **Cross-Validation**

In below code `cross_val_socre` runs the Gradient Boosting model on the dataset `X_scaled, y `using the 5-fold cross-validation strategy defined by `kf`.

In [None]:
cv_scores_gb = cross_val_score(GradientBoostingClassifier(random_state=42), X_scaled, y, cv=kf, scoring='accuracy')

Next, I outputted the cross-validation accuracy.

In [None]:
print(f'Cross-Validation Accuracy: {cv_scores_gb.mean():.4f} ± {cv_scores_gb.std():.4f}')

The output indicates that the model consistently performs well across different subsets of the data.

## **Analyze Overfitting and Underfitting**

To find if the Gradient Boosting model is overfitting or underfitting, I started with train the model using train data.

In [None]:
model_gb = GradientBoostingClassifier(random_state=42)
model_gb.fit(X_train, y_train)

Next, I made predictions on training data. For that, I used `model_gb.predict(X_train)`. To calculate the accuracy of the model I used `accuracy_score(y_train, train_predictions_gb)`.

In [None]:
train_predictions_gb = model_gb.predict(X_train)
train_accuracy_gb = accuracy_score(y_train, train_predictions_gb)

Next I made predictions on test data. For that I used `model_gb.predict(X_test)`. To calculate the accuracy of the model, I used `accuracy_score(y_test, y_test_pred_gb)`.

In [None]:
y_test_pred_gb = model_gb.predict(X_test)
test_accuracy_gb = accuracy_score(y_test, y_test_pred_gb)

The below code output the training accuracy and test accuracy of the Gradient Boosting model.

In [None]:
print(f'Training Accuracy: {train_accuracy_gb:.4f}')
print(f'Test Accuracy: {test_accuracy_gb:.4f}')

The training accuracy is somewhat higher than the test accuracy, which suggests that there might be slight overfitting

Anyway I performed hyperparameter tuning for Gradient Boosting model.

## **Hyperparameter Tuning**

I tuned the number of boosting stages(`n_estimators`), the learning rate (`learning_rate`), and the maximum depth(`max_depth`) of the individual regression estimators.

In [None]:
param_grids_gb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}

The `grid_search_gb` will creates a `GridSearchCV` object with Gradient Boosting model and the specified hyperparameter grid. It uses 5-fold cross-validation (`cv=5`) to evaluate the performance of each combination of parameters, optimizing for accuracy (`scoring='accuracy'`).

In [None]:
grid_search_gb = GridSearchCV(GradientBoostingClassifier(), param_grids_gb, cv=5, scoring='accuracy')
grid_search_gb.fit(X_train, y_train)

I outputted the best parameter and best cross validation accuracy for Gradient Boosting model.

In [None]:
print(f'Best parameters: {grid_search_gb.best_params_}')
print(f'Best cross-validation accuracy: {grid_search_gb.best_score_:.4f}\n')

The output indicates that the best hyperparameters and best cross-validtion accuracy for the best Gradient Boosting model.

To evaluate the performance of the best Gradient Boosting model identified by `GridSearchCV` on the test data, I retrieved best model from the grid search using `best_estimator_`.

In [None]:
best_model_gb = grid_search_gb.best_estimator_

Next, I used to make predictions on the test data (`X_test`).

In [None]:
y_pred_gb = best_model_gb.predict(X_test)

Finaly, I calculated and outputed the performance metrics of the best Random Forest model.

In [None]:
accuracy_gb= accuracy_score(y_test, y_pred_gb)
precision_gb = precision_score(y_test, y_pred_gb)
recall_gb = recall_score(y_test, y_pred_gb)
f1_gb  = f1_score(y_test, y_pred_gb)
roc_auc_gb  = roc_auc_score(y_test, best_model_gb.predict_proba(X_test)[:, 1])

print(f'Accuracy: {accuracy_gb :.4f}')
print(f'Precision: {precision_gb :.4f}')
print(f'Recall: {recall_gb:.4f}')
print(f'F1 Score: {f1_gb :.4f}')
print(f'ROC AUC: {roc_auc_gb :.4f}')

The above output provides insights into the model's accuracy, precision, recall, F1 score, and its ability to separate classes (ROC AUC).

To find if the best Gradient Boosting model is overfitting or underfitting, I started with train the model using train data.

In [None]:
best_model_gb.fit(X_train, y_train)

To check if the best Gradient Boosting model works well or not, I made predictions on training data. For that, I used `best_model_gb.predict(X_train)`. To calculate the accuracy of the model I used `accuracy_score(y_train, train_predictions_gb)`.

In [None]:
train_predictions_gb = best_model_gb.predict(X_train)
train_accuracy_gb = accuracy_score(y_train, train_predictions_gb)

Next, I made predictions on test data. For that I used `best_model_gb.predict(X_test)`. To calculate the accuracy of the model, I used `accuracy_score(y_test, y_test_pred_gb)`.

In [None]:
y_test_pred_gb = best_model_gb.predict(X_test)
test_accuracy_gb = accuracy_score(y_test, y_test_pred_gb)

The below code output the training accuracy and test accuracy of the best Gradient Boosting model.

In [None]:
print(f'Training Accuracy: {train_accuracy_gb:.4f}')
print(f'Test Accuracy: {test_accuracy_gb:.4f}')

The hyperparameter tuning process slightly reduced the training accuracy while slightly improving the test accuracy.

The smaller gap between training and test accuracy suggests that the model's tendency to overfit has decreased, indicating better generalization to unseen data.

Therefore, Gradient Boosting model is suitable for our project.

# **Conclusion**

In conclusion, the thorough evaluation of the models using performance metrics, cross-validation, and hyperparameter tuning provided valuable insights into their reliability and effectiveness. While initial results indicated potential overfitting, particularly in the decision tree and random forest models, hyperparameter tuning significantly improved the balance between training and test accuracy. The final models, including logistic regression and gradient boosting, demonstrated solid generalization capabilities, minimizing overfitting. This process underscored the importance of iterative model refinement and careful assessment to achieve models that perform well on both training and unseen data.