<h1 style='text-align:center;border-radius:20px ;border:3px solid #c54E58 ;color : white;  padding: 15px; font-size: 15pt; background-color:blue'>DESIGN AND DEVELOPMENT OF A WEB-BASED BANK CUSTOMER CHURN PREDICTION PLATFORM USING ADVANCED MACHINE LEARNING MODELS TO ENHANCE CUSTOMER RETENTION STRATEGIES AND USER EXPERIENCE</h1>

<div style="border-radius:20px ;border:3px solid red ;color : Blue;  padding: 15px; font-size: 14pt; background-color: ; text-align:left">

This project aims to predict customer churn in banks, helping institutions identify at-risk customers and improve retention. By analyzing factors like credit scores, account balances, and customer activity, the model will provide insights to enhance loyalty strategies.

Using Python and machine learning, the project will develop a predictive model based on historical customer data. Feature engineering, exploratory data analysis (EDA), and visualization techniques will reveal key churn patterns. Advanced models, including ensemble learning, will ensure high accuracy.

The final output will be a web-based platform where banks can input customer data, receive churn predictions, and access retention recommendations. With real-time insights, banks can improve customer experience, reduce attrition, and optimize retention strategies.
<div/>

<div style="border-radius:20px ;border:3px solid red ;color : blue; padding: 15px; background-color: rgba(135, 206, 235, 0.4); text-align:left"> <a id="met"></a> <h2> Methodology for Analyzing Features</h2>
    
The methodology for analyzing features in this bank customer churn prediction project follows a structured approach to uncover key patterns and factors influencing customer attrition.

* **Data Overview:**

    - Inspect dataset for missing values, duplicates, and inconsistencies.
    - Analyze distributions of key features like credit scores, account balances, and tenure.
    - Identify outliers and assess data quality.
    
* **Feature Categorization:**

    - Classify features into demographic (e.g., age, gender, country), financial (e.g., balance, estimated salary), and engagement-related (e.g., active member status, number of products used).
    
* **Formulate Hypotheses:**

    - Define the null hypothesis (H0): There is no significant relationship between customer demographics, financial behaviors, and churn.
    - Test hypotheses to determine the strongest churn predictors.

* **Statistical Analysis:**

    - Correlation Analysis to assess relationships between numerical variables like credit score, account balance, and churn probability.
    - Chi-square tests for categorical variables to evaluate their association with churn.
    
* **Visual Analysis:**

    - Use heatmaps, bar charts, and scatter plots to explore trends in customer churn.
    - Box plots and histograms to detect skewness and distribution of key numerical features.
    
* **Observations & Insights:**

    - Summarize key findings from statistical and visual analyses.
    - Identify the most influential features driving customer churn for model training.
</div>

# <div style="padding:20px;color:white;margin:0;font-size:35px;text-align:center;display:fill;border-radius:10px;background-color:#c54E58;overflow:hidden; font-family: 'Lucida Console'"><b> DATA OVERVIEW </b></div>
<a id="data"></a>

In [1]:
# import dependencies
import pandas as pd
import numpy as np
from collections import Counter
import os

# For visualization
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns


# scikit-learn
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder, LabelBinarizer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import chi2_contingency
from sklearn.utils import resample
from tensorflow.keras.models import load_model
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
import optuna
import joblib
from sklearn.decomposition import PCA

# Classification metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, auc 

ImportError: DLL load failed while importing _errors: The specified procedure could not be found.

In [None]:
# Load the dataset into the DataFrame as df
df = pd.read_csv("Bank Customer Churn Prediction.csv")
df.head()

In [None]:
# Check the shape of the dataset
print(f"Dataframe dimensions: {df.shape}")

The dataset contains 10,000 rows entries and 12 columns

In [None]:
# Additional information about the columns
df.info()

There are no "nulls" in our dataframe.

#### Summary statistics for the numeric features

In [None]:
# Further statistical description of the dataset, givingthe mean, std, IQR, min & max values for each column
df.describe()

From the summary statistics we can conclude that all features look OK. We do not see any extreme values for any feature.

In [None]:
# Check data types and unique values to confirm there are no unusual data entries or types
df.dtypes  ,   df.nunique()

In [None]:
df.duplicated().sum()

To make dataframe easily readable we will drop features not needed for machine learning (customer_id)

In [None]:
# Drop unused features
df.drop(['customer_id'], axis=1, inplace=True)
print(f"Dataframe dimensions: {df.shape}")
df.head()

### Distributions of Numeric Features

In [None]:
# Plot histogram grid
df.hist(figsize=(14,14))

plt.show()

## Histogram Findings:
Credit Score: Mostly normal around 600–700.

Age: Skewed right; most are 30–40. Older customers may churn differently.

Tenure: Evenly spread, but a spike at 10 years.

Balance: Bimodal—many have zero balance, others cluster around 50K–150K. Zero balance might indicate inactivity.

Products Number: Most have 1 or 2; few have 3+. Fewer products might mean higher churn.

Active Member: Majority are active; inactivity might predict churn.

Estimated Salary: Evenly spread, likely not a key churn factor.

Churn: Imbalanced—most didn’t churn, but a significant minority did.

### Distributions of Categorical Features

In [None]:
# Summarize categorical features
df.describe(include=['object'])

This shows us the number of unique classes for each feature. For example, there are more males (5457) than females. And France is most common of 3 geographies in our dataframe. There are no sparse classes.

In [None]:
# Bar plot for "gender"
plt.figure(figsize=(4,4))
df['gender'].value_counts().plot.bar(color=['b', 'g'])
plt.ylabel('Count')
plt.xlabel('gender')
plt.xticks(rotation=0)
plt.show()

print("In our data sample there are more males than females.")

Counter(df.gender) 

This bar chart shows the gender distribution of customers:

Males outnumber females, but not by a huge margin.

In [None]:
# Bar plot for "country"
plt.figure(figsize=(6,4))
df['country'].value_counts().plot.bar(color=['b', 'g', 'r'])
plt.ylabel('Count')
plt.xlabel('country')
plt.xticks(rotation=0)
plt.show()

Counter(df.country)

In [None]:
# Given distribution
country_counts = Counter({'France': 5014, 'Spain': 2477, 'Germany': 2509})
total_entries = sum(country_counts.values())

# Calculate percentages rounded to 2 decimal places
country_percentages = {country: round((count / total_entries) * 100, 2) for country, count in country_counts.items()}

country_percentages

Majority of customers are from France with percentage of about 50%, and from Germany and Spain around 25% each.

In [None]:
# Churn distribution
plt.figure(figsize=(6,4))
sns.countplot(x="churn", data=df, palette="coolwarm")
plt.title("Churn Distribution")
plt.xlabel("Churn (0 = No, 1 = Yes)")
plt.ylabel("Count")
plt.show()

In [None]:
# Distribution of Numerical Features
numerical_cols = ['credit_score', 'age', 'tenure', 'balance', 'products_number', 'estimated_salary']

plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 3, i)
    sns.histplot(df[col], bins=30, kde=True, color="blue")
    plt.title(f"Distribution of {col}")

plt.tight_layout()
plt.show()

In [None]:
# Boxplot of age vs churn
plt.figure(figsize=(6, 4))
sns.boxplot(x="churn", y="age", data=df, palette="coolwarm")
plt.title("Age Distribution by Churn Status")
plt.show()

In [None]:
# Churn rate by country
plt.figure(figsize=(8,5))
sns.countplot(x="country", hue="churn", data=df, palette="coolwarm")
plt.title("Churn Rate by Country")
plt.show()

In [None]:
# Churn rate by active membership
plt.figure(figsize=(6,4))
sns.countplot(x="active_member", hue="churn", data=df, palette="Set1")
plt.title("Churn Rate by Active Membership")
plt.show()

In [None]:
# Churn rate by number of products
plt.figure(figsize=(6,4))
sns.countplot(x="products_number", hue="churn", data=df, palette="Dark2")
plt.title("Churn Rate by Number of Products")
plt.show()

# <div style="padding:20px;color:white;margin:0;font-size:35px;text-align:center;display:fill;border-radius:10px;background-color:#c54E58;overflow:hidden; font-family: 'Lucida Console'"><b> UNIVARIATE ANALYSIS</b></div>
<a id="analysis"></a>

In [None]:
# 1. Skewness & Kurtosis
numerical_cols = ['credit_score', 'age', 'tenure', 'balance', 'products_number', 'estimated_salary']
for col in numerical_cols:
    skewness = df[col].skew()
    kurtosis = df[col].kurt()
    print(f"{col}: Skewness = {skewness:.2f}, Kurtosis = {kurtosis:.2f}")

In [None]:
# 2. Boxplots (Outliers Detection)
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[numerical_cols], palette="Set2")
plt.xticks(rotation=45)
plt.title("Boxplot of Numerical Features")
plt.show()


In [None]:
# 3. Countplot for Categorical Features
categorical_cols = ["country", "gender", "credit_card", "active_member", "churn"]
plt.figure(figsize=(12, 6))
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(2, 3, i)
    sns.countplot(x=df[col], palette="Pastel2")
    plt.title(f"Distribution of {col}")

plt.tight_layout()
plt.show()

# <div style="padding:20px;color:white;margin:0;font-size:35px;text-align:center;display:fill;border-radius:10px;background-color:#c54E58;overflow:hidden; font-family: 'Lucida Console'"><b>MULTIVARIATE ANALYSIS </b></div>
<a id="multivariate"></a>

## Color palette was set to "deep" for clearer visualization

In [None]:
# 1. Pairplot (Numerical Variables & Churn)
sns.pairplot(df, hue="churn", vars=numerical_cols, palette="deep")
plt.show();

This pair plot visualizes relationships between features, with churn (0 = stayed, 1 = left) color-coded. Key insights:

Age & Churn: Higher churn among older customers.

Balance & Churn: Churners seem to cluster in the mid-to-high balance range.

Products Number & Churn: More churn in customers with 2+ products.

Credit Score & Churn: No clear trend, churn is spread across scores.

Estimated Salary & Churn: No strong correlation with churn.

In [None]:
# 2. Churn Rate by Age & Balance
plt.figure(figsize=(10, 5))
sns.scatterplot(x=df["age"], y=df["balance"], hue=df["churn"], palette="deep", alpha=0.6)
plt.title("Age vs Balance by Churn Status")
plt.show()


This scatter plot shows Age vs. Balance, with churn status highlighted:

## Key Insights:

Higher churn (orange) in customers aged 40–60 with mid-to-high balances.

Younger customers (<30) rarely churn, even with low balances.

Churn is less common among elderly customers (70+).

Many customers with zero balance don’t churn, possibly inactive but not exiting.

In [None]:
# 3. Categorical Feature Relationships (Cramér’s V)
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    return np.sqrt(chi2 / (n * (min(confusion_matrix.shape)-1)))

print("\nCramér’s V for Country & Churn:", cramers_v(df["country"], df["churn"]))
print("Cramér’s V for Gender & Churn:", cramers_v(df["gender"], df["churn"]))
print("Cramér’s V for Credit Card & Churn:", cramers_v(df["credit_card"], df["churn"]))

Cramér’s V values indicate the strength of association between categorical variables and churn:

**Country & Churn (0.1735)** → Weak but strongest influence among these three. Country might slightly impact churn.

**Gender & Churn (0.1063)** → Very weak correlation; gender is not a strong churn predictor.

**Credit Card & Churn (0.0069)** → Almost no correlation; credit card ownership doesn’t impact churn.

In [None]:
# Segment "churn" by gender and display the frequency and percentage within each class
grouped = df.groupby('gender')['churn'].agg(Count='value_counts')
grouped

In [None]:
# Calculate percentage within each class
dfgp = grouped.groupby(level=[0]).apply(lambda g: round(g * 100 / g.sum(), 2))
dfgp.rename(columns={'Count': 'Percentage'}, inplace=True)
dfgp

In [None]:
# Reorganize dataframe for plotting percentage
dfgp = dfgp.pivot_table(values='Percentage', index='gender', columns=['churn'])
dfgp

In [None]:
# Reorganize dataframe for plotting count
dfgc = grouped
dfgc = dfgc.pivot_table(values='Count', index='gender', columns=['churn'])
dfgc

In [None]:
# Churn distribution by gender, count + percentage

labels= ['Stays', 'Exits']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

dfgc.plot(kind='bar',
          color=['g', 'r'],
          rot=0, 
          ax=ax1)
ax1.legend(labels)
ax1.set_title('Churn Risk per Gender (Count)', fontsize=14, pad=10)
ax1.set_ylabel('Count',size=12)
ax1.set_xlabel('gender', size=12)


dfgp.plot(kind='bar',
          color=['g', 'r'],
          rot=0, 
          ax=ax2)
ax2.legend(labels)
ax2.set_title('Churn Risk per Gender (Percentage)', fontsize=14, pad=10)
ax2.set_ylabel('Percentage',size=12)
ax2.set_xlabel('gender', size=12)

plt.show()

These charts compare churn rates between males and females:

More males than females overall, but female customers churn at a higher rate (~25% vs. ~15%).

Males are more likely to stay (higher retention percentage).

In [None]:
# Segment "Exited" by geography and display the frequency and percentage within each class
grouped = df.groupby('country')['churn'].agg(Count='value_counts')
grouped

In [None]:
# Reorganize dataframe for plotting count
dfgeoc = grouped
dfgeoc = dfgeoc.pivot_table(values='Count', index='country', columns=['churn'])
dfgeoc

In [None]:
# Reorganize dataframe for plotting count
dfgeoc = grouped
dfgeoc = dfgeoc.pivot_table(values='Count', index='country', columns=['churn'])
dfgeoc

In [None]:
# Calculate percentage within each class
dfgeop = grouped.groupby(level=[0]).apply(lambda g: round(g * 100 / g.sum(), 2))
dfgeop.rename(columns={'Count': 'Percentage'}, inplace=True)
dfgeop

In [None]:
# Churn distribution by geography, count + percentage

labels= ['Stays', 'Exits']

fig, (ax1) = plt.subplots(1, figsize=(12, 4))

dfgeoc.plot(kind='bar',
          color=['g', 'r'],
          rot=0, 
          ax=ax1)
ax1.legend(labels)
ax1.set_title('Churn Risk per Country (Count)', fontsize=14, pad=10)
ax1.set_ylabel('Count',size=12)
ax1.set_xlabel('country', size=12)

plt.show()

## Distributions of the Target Feature

In [None]:
# Encode categorical variables
df = pd.get_dummies(df, columns=["country", "gender"], drop_first=True)


In [None]:
# Define Features and Target
X = df.drop(columns=["churn"])  
y = df["churn"]

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [None]:
# Variance Threshold (Removing low variance features)
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)  
X_train_var = selector.fit_transform(X_train)

In [None]:
# Feature Importance using Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
feature_importances = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Plot Feature Importance
plt.figure(figsize=(10,5))
sns.barplot(x=feature_importances, y=feature_importances.index, palette="coolwarm")
plt.title("Feature Importance using Random Forest")
plt.show()

## Feature Importance (Random Forest)
Age is the most important factor in predicting churn. Older customers might be at higher risk.

Estimated Salary, Credit Score, and Balance are also strong predictors.

Number of Products & Tenure have moderate influence.

Being an Active Member, Country, and Gender have lower impact.

Credit Card ownership has minimal effect, confirming previous findings.

In [None]:
# Bootstrap Resampling to Balance Classes
X_minority = X_train[y_train == 1] 
y_minority = y_train[y_train == 1]

X_minority_resampled, y_minority_resampled = resample(
    X_minority, y_minority, replace=True, 
    n_samples=len(X_train[y_train == 0]), 
    random_state=42
)

# Combine back with majority class
X_train_resampled = np.vstack((X_train[y_train == 0], X_minority_resampled))
y_train_resampled = np.hstack((y_train[y_train == 0], y_minority_resampled))

In [None]:
# 3. Countplot for Categorical Features
categorical_cols = ["credit_card", "active_member", "churn"]
plt.figure(figsize=(12, 6))
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(2, 3, i)
    sns.countplot(x=df[col], palette="Pastel2")
    plt.title(f"Distribution of {col}")

plt.tight_layout()
plt.show()

In [None]:
from sklearn.utils import resample


# Separate the majority and minority class
df_majority = df[df['churn'] == 0]
df_minority = df[df['churn'] == 1]


# 3. **Hybrid Approach**: Balance with both over- and under-sampling
df_majority_reduced = resample(df_majority, 
                               replace=False, 
                               n_samples=int(len(df_majority) * 0.5),  # Reduce majority class
                               random_state=42)

df_minority_increased = resample(df_minority, 
                                 replace=True, 
                                 n_samples=int(len(df_minority) * 1.5),  # Increase minority class
                                 random_state=42)

# Create new balanced datasets
df_balanced_hybrid = pd.concat([df_majority_reduced, df_minority_increased])  # Hybrid approach

# # Save the new dataset
# df_balanced_oversample.to_csv("balanced_dataset.csv", index=False)

print("Dataset balancing complete!")


In [None]:
# Load the original dataset
df_original =  pd.read_csv("Bank Customer Churn Prediction.csv")

# Load the balanced dataset (change filename based on the method used)
df_balanced = pd.read_csv("balanced_dataset.csv")

# Set up the figure
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original dataset distribution
sns.countplot(x='churn', data=df_original, palette="pastel", ax=axes[0])
axes[0].set_title("Original Dataset Distribution")
axes[0].set_xlabel("Churn")
axes[0].set_ylabel("Count")

# Balanced dataset distribution
sns.countplot(x='churn', data=df_balanced, palette="pastel", ax=axes[1])
axes[1].set_title("Balanced Dataset Distribution")
axes[1].set_xlabel("Churn")
axes[1].set_ylabel("Count")

plt.tight_layout()
plt.show()


In [None]:
# Standardize Features 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

In [None]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# 1. Histogram of Scaled Features
def plot_histograms(data, title="Feature Distributions After Standardization"):
    data.hist(figsize=(12, 8), bins=30, edgecolor='black')
    plt.suptitle(title, fontsize=14)
    plt.show()

plot_histograms(X_train_scaled_df)


In [None]:
# Initialize Models
models = {
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss"),
    "LightGBM": LGBMClassifier(),
    "CatBoost": CatBoostClassifier(verbose=0),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Neural Network (MLP)": MLPClassifier(hidden_layer_sizes=(64,32), max_iter=500, random_state=42)
}


In [None]:
# Train & Evaluate Models
results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Fit the model
    if name == "Neural Network (MLP)":
        model.fit(X_train_scaled, y_train_resampled)  # Scaled input for MLP
        y_pred = model.predict(X_test_scaled)
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train_resampled, y_train_resampled)
        y_pred = model.predict(X_test)
        y_proba = model.predict_proba(X_test)[:, 1]
    
    # Compute Metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba)
    
    results[name] = [accuracy, precision, recall, f1, auc]
    
    print(f"{name} - Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}, AUC-ROC: {auc:.4f}")


In [None]:
# Convert results to DataFrame
results_df = pd.DataFrame(results, index=["Accuracy", "Precision", "Recall", "F1-Score", "AUC-ROC"]).T

In [None]:
# Visualize Model Performance
plt.figure(figsize=(10,6))
sns.barplot(x=results_df.index, y=results_df["AUC-ROC"], palette="coolwarm")
plt.ylabel("AUC-ROC Score")
plt.title("Model Performance Comparison (AUC-ROC)")
plt.xticks(rotation=45)
plt.show()

## Model Performance Comparison (AUC-ROC Score)

LightGBM performs the best, closely followed by CatBoost and Random Forest.

XGBoost also performs well, slightly below the top models.

Neural Network (MLP) has the lowest AUC-ROC score, suggesting it may not be the best fit for this problem.

In [None]:
# Print Best Model
best_model = results_df["AUC-ROC"].idxmax()
print(f"\nBest Model: {best_model} with AUC-ROC Score: {results_df.loc[best_model, 'AUC-ROC']:.4f}")

## Hyperparameter Tuning

In [None]:
# Optuna for Hyperparameter Tuning
import lightgbm as lgb

def objective(trial):
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'num_leaves': trial.suggest_int('num_leaves', 20, 50),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.7, 1.0)
    }
    
    model = lgb.LGBMClassifier(**params)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])


In [None]:
# Run Optuna Optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)

# Best Parameters from Optuna
best_params_optuna = study.best_params
print("\nBest Parameters (Optuna):", best_params_optuna)


## Training the Best Balanced Model

In [None]:
# Train Best Optuna Model
best_lgb_optuna = lgb.LGBMClassifier(**best_params_optuna)
best_lgb_optuna.fit(X_train, y_train)

# Predictions
y_pred_optuna = best_lgb_optuna.predict(X_test)
y_proba_optuna = best_lgb_optuna.predict_proba(X_test)[:, 1]

# Evaluate Optuna Model
accuracy_optuna = accuracy_score(y_test, y_pred_optuna)
precision_optuna = precision_score(y_test, y_pred_optuna)
recall_optuna = recall_score(y_test, y_pred_optuna)
f1_optuna = f1_score(y_test, y_pred_optuna)
auc_optuna = roc_auc_score(y_test, y_proba_optuna)


In [None]:
print(f"\nOptuna Tuned LightGBM - Accuracy: {accuracy_optuna:.4f}, Precision: {precision_optuna:.4f}, Recall: {recall_optuna:.4f}, F1-Score: {f1_optuna:.4f}, AUC-ROC: {auc_optuna:.4f}")

In [None]:
# Plot Confusion Matrix
def plot_confusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.title("Confusion Matrix")
    plt.show()

# Plot AUC-ROC Curve
def plot_roc_curve(y_test, y_proba):
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    
    plt.figure(figsize=(6, 5))
    plt.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {roc_auc:.2f}')
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Diagonal line
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()

# Call the functions using your Optuna model's predictions
plot_confusion_matrix(y_test, y_pred_optuna)
plot_roc_curve(y_test, y_proba_optuna)


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

# Load dataset (Assuming df is preloaded and cleaned)
# Define Features and Target
X = df.drop(columns=["churn"])
y = df["churn"]

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize Models
models = {
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss"),
    "LightGBM": LGBMClassifier(),
    "CatBoost": CatBoostClassifier(verbose=0),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Neural Network (MLP)": MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42)
}

def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred),
        "ROC AUC": roc_auc_score(y_test, y_pred)
    }

# Step 1: Train on Imbalanced Dataset
imbalanced_results = {name: evaluate_model(model, X_train, y_train, X_test, y_test) for name, model in models.items()}

# Step 2: Balance Dataset using Bootstrap Resampling
X_minority = X_train[y_train == 1]
y_minority = y_train[y_train == 1]
X_minority_resampled, y_minority_resampled = resample(X_minority, y_minority, replace=True, 
                                                       n_samples=len(X_train[y_train == 0]), random_state=42)
X_train_resampled = np.vstack((X_train[y_train == 0], X_minority_resampled))
y_train_resampled = np.hstack((y_train[y_train == 0], y_minority_resampled))

# Step 2: Train on Balanced Dataset
balanced_results = {name: evaluate_model(model, X_train_resampled, y_train_resampled, X_test, y_test) for name, model in models.items()}

# Step 3: Optimize Best Model (Assuming XGBoost performed best)
best_model = XGBClassifier(use_label_encoder=False, eval_metric="logloss", n_estimators=200, learning_rate=0.05)
best_model.fit(X_train_resampled, y_train_resampled)
y_pred_best = best_model.predict(X_test)
optimized_results = {
    "Accuracy": accuracy_score(y_test, y_pred_best),
    "Precision": precision_score(y_test, y_pred_best),
    "Recall": recall_score(y_test, y_pred_best),
    "F1 Score": f1_score(y_test, y_pred_best),
    "ROC AUC": roc_auc_score(y_test, y_pred_best)
}

# Display Results
print("Imbalanced Dataset Results:", imbalanced_results)
print("Balanced Dataset Results:", balanced_results)
print("Optimized Model Results:", optimized_results)


## Save the Model

In [None]:
# # Save the trained model to a file
# joblib.dump(model, 'best_lgbm_model.pkl')