##Import Libraries

In [0]:
%pip install --upgrade pip
%pip install tensorflow
%pip install -U tensorflow
%pip install scikeras
%pip install imbalanced-learn

In [0]:
dbutils.library.restartPython()

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import seaborn as sns

##Import Dataset

In [0]:
# Read train.csv
train_ps_df = spark.read.csv("dbfs:/FileStore/tables/Series/train_with_ext_indicator.csv", header=True, inferSchema=True)

# Read test.csv
test_ps_df = spark.read.csv("dbfs:/FileStore/tables/Series/test_with_ext_indicator.csv", header=True, inferSchema=True)

# Read sample_submission.csv
submission_ps_df = spark.read.csv("dbfs:/FileStore/tables/Series/sample_submission.csv", header=True, inferSchema=True)


In [0]:
train_ps_df

In [0]:
# Convert Spark DataFrames to pandas DataFrames
df_train = train_ps_df.toPandas()
df_test = test_ps_df.toPandas()
df_submission = submission_ps_df.toPandas()

## Univarite EDA of Train Dataset

Univariate Exploratory Data Analysis (EDA) focuses on examining each variable in isolation to summarize and find patterns in the train dataset.

**To Do:**
> *Understanding Variable Types:* Identifying whether variables are numerical (continuous or discrete) or categorical (ordinal or nominal).

> *Summary Statistics:* For numerical variables, we'll look at measures like mean, median, mode, range, variance, and standard deviation. For categorical variables, we'll identify the number of categories and count the frequency of each category.

> *Visualization:* We'll create visualizations such as histograms, boxplots, or bar charts to understand the distribution of each variable.

> *Identifying Anomalies:* Detecting any outliers or unusual data points.

> *Missing Values* Assessing if there are any missing values in the dataset


### Dataset Head

In [0]:
df_head = df_train.head()
df_col = df_train.columns
df_head, df_col

In [0]:
print("Shape of train_data:", df_train.shape)
print("Shape of test_data:", df_train.shape)

### Data Dictionary
>  *id:* A numerical identifier for each record.

>  *CustomerId:* A unique number assigned to each customer.

>  *Surname:* The surname of the customer.

>  *CreditScore:* A numerical value representing the customer's credit score.

>  *Geography:* The country of the customer.

>  *Gender:* The gender of the customer.

>  *Age:* The age of the customer.

>  *Tenure:* The number of years the customer has been with the bank.

>  *Balance:* The account balance of the customer.

>  *NumOfProducts:* The number of products the customer has with the bank.

>  *HasCrCard:* Indicates whether the customer has a credit card (1) or not (0).

>  *IsActiveMember:* Indicates whether the customer is an active member (1) or not (0).

>  *EstimatedSalary:* The estimated salary of the customer.

>  *Exited:* Indicates whether the customer has exited (1) or not (0)

###Variable Types and Missing Values

In [0]:
# Identifying variable types and checking for missing values
variable_types = df_train.dtypes
missing_values = df_train.isnull().sum()

# Summarize the findings
variable_types, missing_values

In the above output,
> **Numerical Variables**  are: id, CustomerId, CreditScore, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Exited.

>**Categorical Variables** are: Surname, Geography, Gender.

> and No missing values in the dataset.

### Summary Statistics

In [0]:
# Summary statistics for numerical variables
numerical_summary = df_train.describe()
# Summarize the findings
numerical_summary

Summary Statistics for Numerical Variables

> **CreditScore:** Ranges from 350 to 850.

> **Age:** Ranges from 18 to 92 years.

> **Tenure:** Ranges from 0 to 10 years.

> **Balance:** Ranges from 0 to 250,898.09.

> **NumOfProducts:** Ranges from 1 to 4 products.

> **HasCrCard and IsActiveMember:** Binary variables (0 or 1).

> **EstimatedSalary:** Ranges from 11.58 to 199,992.48.

> **Exited:** Indicates customer churn (0 or 1).

### Visualization for Distribution of the the target variable


In [0]:
# Create a subplot with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Count plot for 'Exited' column
sns.countplot(x='Exited', data=df_train, hue='Exited', palette='Blues', ax=axes[1])
axes[1].set_title('Count Plot of Exited')

# Pie chart for 'Exited' column
status_counts = df_train['Exited'].value_counts()
axes[0].pie(status_counts, labels=status_counts.index, autopct='%1.1f%%', startangle=90, colors=sns.color_palette('Blues'))
axes[0].set_title('Distribution of Exited')



plt.tight_layout()
plt.show()

> The above pie chart displays the distribution of the "Exited" class. The light blue segment represents 78.8% of instances labeled as '0' (did not exit), while the dark blue segment represents 21.2% labeled as '1' (exited). Most data points fall into the 'did not exit' category, with the 'exited' category being smaller.

> The bar graph shows the count of instances for each class. it reinforces the observation that most data points belong to the 'did not exit' class. The class imbalance may impact model performance and will be addressed.

> SMOTE can improve classifier sensitivity for the minority class by balancing the classes. It enhances generalization by helping the model learn more general features of each class instead of overfitting to the majority class. SMOTE creates synthetic examples by selecting similar examples in the feature space, drawing a line between them, and generating new examples along that line. This expands the feature space for the minority class, achieving better class distribution balance.

In [0]:
# Filter out the numerical columns
numerical_columns = ['CreditScore','Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard','IsActiveMember', 'EstimatedSalary', 'Exited', 'CUUR0000SA0R_change','DFF_change', 'HOUST_change', 'MPRIME_change', 'UNRATE_change']


In [0]:
for_ratio_numerical_columns1 =  ['CreditScore','Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard','IsActiveMember', 'EstimatedSalary']
for_ratio_numerical_columns2 =  ['CUUR0000SA0R_change','DFF_change', 'HOUST_change', 'MPRIME_change', 'UNRATE_change']

In [0]:
#histograms by Ratio information
plt.figure(figsize=(12, 25))

# Calculating the ratio of exited to not-exited for each bin in the histograms
for i, column in enumerate(for_ratio_numerical_columns1, 1):
    plt.subplot(len(for_ratio_numerical_columns1), 1, i)
    # Plotting the histogram for both Exited = 1 and Exited = 0
    sns.histplot(df_train[df_train['Exited'] == 1], x=column, color='red', label='Exited', kde=True, stat='density')
    sns.histplot(df_train[df_train['Exited'] == 0], x=column, color='blue', label='Not Exited', kde=True, stat='density')
    plt.legend()
    plt.title(f'Distribution of {column} with Exit Status')

plt.tight_layout()
plt.show()


In [0]:
#histograms by Ratio information
plt.figure(figsize=(12, 25))

# Calculating the ratio of exited to not-exited for each bin in the histograms
for i, column in enumerate(for_ratio_numerical_columns2, 1):
    plt.subplot(len(for_ratio_numerical_columns2), 1, i)
    # Plotting the histogram for both Exited = 1 and Exited = 0
    sns.histplot(df_train[df_train['Exited'] == 1], x=column, color='red', label='Exited', kde=True, stat='density')
    sns.histplot(df_train[df_train['Exited'] == 0], x=column, color='blue', label='Not Exited', kde=True, stat='density')
    plt.legend()
    plt.title(f'Distribution of {column} with Exit Status')

plt.tight_layout()
plt.show()

> The distribution of credit scores and age shows that higher credit scores and younger age are associated with customers staying with the bank. This suggests that creditworthiness and age play a role in customer retention. On the other hand, older customers are more likely to exit the bank, indicating age as a significant factor in customer churn.

> However, there is no clear pattern indicating that tenure significantly impacts the likelihood of exiting or staying, while customers with higher balances are slightly more likely to exit. This suggests that financial stability alone does not guarantee customer loyalty.

>Additionally, customers using around two products show a peak in staying, but there is a significant spike in exits for customers using three products or more. Lastly, salary does not appear to be a significant factor in customer churn. Understanding these patterns can help the bank tailor strategies to reduce churn and retain valuable customers.

### Visualization for Distribution of the Numerical values

In [0]:
import math
import matplotlib.pyplot as plt



# Histogram of numerical variables
plt.figure(figsize=(18, 12)) # Increase figure size to (18, 12)

# Calculate the number of rows and columns for subplots
num_variables = len(numerical_columns)
num_rows = math.ceil(num_variables / 3)
num_cols = min(num_variables, 3)

for i, var in enumerate(numerical_columns, 1):
    plt.subplot(num_rows, num_cols, i)
    plt.hist(df_train[var], bins=20, color='skyblue', edgecolor='black')
    plt.title(f'Distribution of {var}')
    plt.xlabel(var)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

The histograms above provide insights into the distribution of each numerical variable in the dataset:

> **CreditScore:** Appears normally distributed with a slight left skew.

> **Age:** Shows a right-skewed distribution, indicating a larger proportion of younger customers.

> **Tenure:** Fairly uniform distribution, with slight decreases at the lowest and highest tenure.

> **Balance:** Significant peak at zero balance, suggesting many customers have no balance, followed by a fairly normal distribution for positive balances.

> **NumOfProducts:** Majority of customers have 1 or 2 products, with few having 3 or 4.

 > **EstimatedSalary:** Uniformly distributed across different salary ranges.


### Visualization for Distribution of the Categoriacal Variables

In [0]:
# pie chart for categorical variables
categorical_vars = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']
plt.figure(figsize=(15, 10))
for i, var in enumerate(categorical_vars, 1):
    plt.subplot(2, 2, i)
    counts = df_train[var].value_counts()
    plt.pie(counts, labels=counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('Blues'))
    plt.title(f'{var} Pie Chart')
plt.tight_layout()
plt.show()


The pie charts above provide insights into the distribution of the categorical variables:

>  France has the largest customer base, accounting for 57.1% of the total. Among the three countries (Germany, Spain, and France), France has the highest number of customers.

>  The customer base is almost evenly split between female (43.6%) and male (56.4%). While there is a slight majority of male customers, the difference is not substantial. 75.4% of customers have a credit card, while 24.6% do not. Men tend to carry more overall debt, including credit card debt, but the difference is not significant.

>  50.2% of customers are active members, while 49.8% are not.

In [0]:
# Bar plot for mean Balance across Geographical Locations
plt.figure(figsize=(8, 6))
sns.barplot(x='Geography', y='Balance', data=df_train, hue='Geography', palette='Blues', estimator=lambda x: sum(x) / len(x))
plt.title('Mean Balance across Geographical Locations')
plt.xlabel('Geography')
plt.ylabel('Mean Balance')
plt.show()

> Germany leads in average balance, indicating German customers maintain higher account balances compared to customers in other countries.

## Bivariate EDA of Train Dataset

Bivariate Exploratory Data Analysis (EDA) involves examining the relationships between two variables in the dataset.

**To Do:**
> *Numerical vs. Numerical:* Scatter plots or correlation coefficients.

> *Categorical vs. Numerical:* Box plots, violin plots.

###Numerical vs. Numerical

In [0]:
# Computing the correlation matrix
correlation_matrix = df_train[numerical_columns].corr()
correlation_matrix

> **CreditScore:** No strong correlations with other variables.
Slight negative correlation with **Exited** (-0.02), suggesting that lower credit scores may be marginally associated with higher exit rates.

> **Age:**
Moderate positive correlation with Exited (0.30), indicating older customers are more likely to leave the bank. Weak negative correlations with NumOfProducts and IsActiveMember.

> **Tenure:**
No significant correlations with other variables.

> **Balance:**
Moderate negative correlation with NumOfProducts (-0.31), suggesting customers with more products tend to have lower balances. Slight positive correlation with Exited (0.12), implying higher balances may be slightly associated with a higher likelihood of leaving the bank.

> **NumOfProducts:**
Moderate negative correlation with Balance.
Moderate positive correlation with IsActiveMember (0.32), suggesting active members tend to use more bank products. Strong negative correlation with Exited (-0.37), indicating customers with more products are less likely to leave.

> **HasCrCard:**
Very weak correlations with all other variables, indicating having a credit card is not strongly associated with other factors in the dataset.

> **IsActiveMember:**
Moderate positive correlation with NumOfProducts.
Moderate negative correlation with Exited (-0.16), suggesting active members are less likely to leave the bank.

> **EstimatedSalary:**
Very weak correlations with all other variables, showing no clear pattern.

In [0]:
# Correlation heatmap for numerical variables
plt.figure(figsize=(10, 8))
sns.heatmap(df_train[for_ratio_numerical_columns1].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap for Numerical Variables')
plt.show()

The above heatmap supports the correlation coefficient and shows that:

> Age, number of products, and active membership status are the most indicative factors in relation to customer exit.

> Balance shows some association with customer exit, though it's not as strong as age or number of products.

> Credit score, tenure, having a credit card, and estimated salary have very weak to negligible correlations with customer exit.

In [0]:
# Convert Pandas DataFrame to PySpark DataFrame
df_train = spark.createDataFrame(df_train)

# Add 'Exited' column to df_train for correlation heatmap
df_train = for_ratio_numerical_columns2.withColumn("Exited", df_train["Exited"].cast("double"))

# Correlation heatmap for numerical variables
plt.figure(figsize=(10, 8))
sns.heatmap(df_train[for_ratio_numerical_columns2].toPandas().corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap for Numerical Variables')
plt.show()

### Continuous Columns Analysis

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

# Filter out continuous columns
continuous_vars = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'CUUR0000SA0R_change','DFF_change', 'HOUST_change', 'MPRIME_change', 'UNRATE_change']

# Define the number of rows and columns for subplots
num_rows = len(continuous_vars)
num_cols = 2  # Two plots for each column (box plot and KDE plot)

# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 5 * num_rows))

# Flatten the axes array if there's only one row
if num_rows == 1:
    axes = axes.reshape(1, -1)

# Color for plots
color = 'skyblue'

# Iterate over each continuous variable and create box plots and KDE plots
for i, column in enumerate(continuous_vars):
    # Box plot
    sns.boxplot(x=df_train[column], ax=axes[i, 0], color=color)
    axes[i, 0].set_title(f'Boxplot of {column}', fontsize=14)
    axes[i, 0].set_xlabel(column, fontsize=12)

    # KDE plot
    sns.kdeplot(data=df_train[column], ax=axes[i, 1], color=color, fill=True)
    axes[i, 1].set_title(f'KDE Plot of {column}', fontsize=14)
    axes[i, 1].set_xlabel(column, fontsize=12)
    axes[i, 1].legend([column], loc='upper right')

# Adjust layout
plt.tight_layout()
plt.show()



> The above analysis explores credit scores, age demographics, account balances, and estimated salaries. The credit score plot indicates most people fall between 600-700, while the age plot shows a majority in late 20s to early 40s. Account balance data reveals many with zero balance and a peak around $100,000. Estimated salary plots suggest a uniform distribution with no concentration at any level. These insights could influence decisions to exit.

> The Kernel Density Estimation (KDE) plots show distributions resembling normal curves, showing valuable insights for interpreting the data.

# Data Preprocessing

Data Preprocessing - This is preparing the data for modelling.

**To Do:**
> *Dropong Irrelevant Features:*

> *Checking for Outliers and Handling them*

> *Processing each columns according to their datatypes*

> *Data Transformation, Encoding and Scalling*

> *Spliting the data for trainning and testing*

### Dropping Irrelevant Features

In [0]:
# Dropping irrelevant features
df_train = df_train.drop(['id', 'CustomerId', 'Surname'], axis=1)
df_test = df_test.drop(['id', 'CustomerId', 'Surname'], axis=1)

# Display the first few rows of the processed dataset
df_train.head(), df_test.head()

> To simplify the model and potentially improve its performance, irrelevant features have been removed, Why? the ID simple row identifier,
CustomerId is unique to each customer but doesn't hold predictive power,
and also the customer's surname is unlikely to influence their decision to churn.

### Checking for Percentage of Outliers Continuous variables

In [0]:
import numpy as np

# Define continuous variables
continuous_vars = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary','CUUR0000SA0R_change','DFF_change', 'HOUST_change', 'MPRIME_change', 'UNRATE_change']

# Improved function to calculate percentage of outliers using IQR method
def percentage_outliers_improved(df, columns):
    outliers_percentage = {}
    for column in columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        total_count = df[column].count()
        outliers_count = df[column][(df[column] < lower_bound) | (df[column] > upper_bound)].count()
        outliers_percentage[column] = (outliers_count / total_count) * 100

    return outliers_percentage

# Calculate the percentage of outliers for each continuous variable
outliers_percentage_improved = percentage_outliers_improved(df_train, continuous_vars)
outliers_percentage_improved

The highest percentage of outliers is found in 'Age', but even this is relatively low (under 4%). The low percentages of outliers in these variables suggest that the data is fairly consistent and doesn't contain many extreme values that could skew the analysis.

Considering these percentages, it seems reasonable to proceed without removing these outliers, as they represent a small portion of the dataset and could contain valuable information for predicting churn.

### Process Data according to thier datatype for Pipeline

In [0]:
# Separate numerical and categorical columns for data_processed
numerical_cols_processed = df_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols_processed = df_train.select_dtypes(include=['object']).columns.tolist()

# Remove 'Exited' column from numerical_cols if present
if 'Exited' in numerical_cols_processed:
    numerical_cols_processed.remove('Exited')

numerical_cols_processed, categorical_cols_processed

### Transformation, Scalling and Encoding for Pipeline

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OrdinalEncoder
import numpy as np
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Log transformation function
log_transform_func = FunctionTransformer(np.log1p)

# Define transformations for numerical and categorical columns
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('ordinal_encoder', OrdinalEncoder())
])

# Define column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols_processed),
        ('cat', categorical_transformer, categorical_cols_processed),
        ('log', log_transform_func, ['Balance', 'EstimatedSalary'])
    ]
)

# Define final pipeline
final_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

In [0]:
final_pipeline

### Spliting the data

In [0]:
# Split train data into features and target variable
X_train = df_train.drop('Exited', axis=1)
y_train = df_train['Exited']

In [0]:
# Split test data into features
X_test = df_test

In [0]:
from sklearn.model_selection import train_test_split

# Split train data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [0]:
# Fit the pipeline on the training data
final_pipeline.fit(X_train, y_train)

### Apply Transformations on the Splited Data

In [0]:
# Apply the pipeline transformations on training and validation data
X_train_transformed = final_pipeline.transform(X_train)
X_val_transformed = final_pipeline.transform(X_val)

# Modelling

Modelling - This is building machine learning models for the project.

**To Do:**
> *Build a Logistic Regression Model with Hyperparameterization and Cross_Validation*

> *Build a Random Forest Model with Cross Validation*

> *Build a Neural Netwoorks with Cross Validation*

> *Train the models, get thier Classification reports and Test them on Test Data*


> *Plot comparison chart for the models*

### Logistic Regression

In [0]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Define the hyperparameter grid
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga']
}

# Create an instance of LogisticRegression
logistic_model = LogisticRegression(random_state=42)

# Create an instance of GridSearchCV
grid_search = GridSearchCV(estimator=logistic_model,
                           param_grid=param_grid,
                           cv=5,
                           scoring='roc_auc',
                           n_jobs=-1)

# Applying SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_transformed, y_train)

# Fit the GridSearchCV object on the resampled training data
grid_search.fit(X_train_resampled, y_train_resampled)

# Access the best hyperparameters and the best estimator
print("Best hyperparameters: ", grid_search.best_params_)
best_model = grid_search.best_estimator_

# Evaluate the best model on the validation data
lr_y_val_pred = best_model.predict(X_val_transformed)
lr_y_val_pred_proba = best_model.predict_proba(X_val_transformed)[:, 1]

# Calculate the accuracy
accuracy = accuracy_score(y_val, lr_y_val_pred)
print(f"Accuracy: {accuracy}")

# Calculate AUC-ROC score
auc_score = roc_auc_score(y_val, lr_y_val_pred_proba)
print(f"AUC-ROC score: {auc_score}")

# Print the classification report
print("Classification Report:")
print(classification_report(y_val, lr_y_val_pred))

# Print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_val, lr_y_val_pred))


The logistic regression model was tuned using GridSearchCV to find the optimal hyperparameters. The best parameters were L1 regularization (penalty='l1'), an inverse regularization strength of 0.1 (C=0.1), and the 'liblinear' solver. On the validation set, this model achieved an accuracy of 0.73 and an AUC-ROC score of 0.80. While the model demonstrated high precision (0.91) for the majority class (not exited), its precision for the minority class (exited) was lower at 0.42, indicating a tendency to misclassify exited customers as not exited (false negatives).

### Random Forest

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix
from imblearn.pipeline import Pipeline as ImbPipeline

# Create an instance of the Random Forest Classifier
rf_classifier = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42))
])
# Perform cross-validation
cv_scores = cross_val_score(rf_classifier, X_train_transformed, y_train, cv=5)

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Calculate and print the mean cross-validation score
mean_cv_score = cv_scores.mean()
print("Mean cross-validation score:", mean_cv_score)

# Train the model on the entire training data
rf_classifier.fit(X_train_transformed, y_train)

# Evaluate the model on the validation data
rf_y_val_pred = rf_classifier.predict(X_val_transformed)
rf_y_val_pred_proba = rf_classifier.predict_proba(X_val_transformed)[:, 1]

# Calculate the accuracy
accuracy = accuracy_score(y_val, rf_y_val_pred)
print(f"Accuracy: {accuracy}")

# Calculate AUC-ROC score
auc_score = roc_auc_score(y_val, rf_y_val_pred_proba)
print(f"AUC-ROC score: {auc_score}")

# Print the classification report
print("Classification Report:")
print(classification_report(y_val, rf_y_val_pred))

# Print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_val, rf_y_val_pred))

The random forest model, trained using a pipeline with SMOTE oversampling and 100 estimators, exhibited the highest performance among the three models. Cross-validation scores indicated a mean accuracy of 0.78, suggesting good generalization capabilities. On the validation set, the model achieved an accuracy of 0.78 and an AUC-ROC score of 0.82, outperforming the logistic regression and neural network models. However, similar to the logistic regression model, the random forest classifier demonstrated a higher precision (0.90) for the majority class and a lower precision (0.48) for the minority class.

### Nueral Networks

In [0]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

# Create an instance of the MLPClassifier
# nn_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300, random_state=42)
nn_model= ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('nn', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300, random_state=42))
])

# Perform cross-validation on the transformed data
scores = cross_val_score(nn_model, X_train_transformed, y_train, cv=5, scoring='accuracy')

# Print the cross-validation scores
print("Cross-validation scores:", scores)
print("Mean accuracy:", scores.mean())

# Fit the model on the entire training data
nn_model.fit(X_train_transformed, y_train)

# Evaluate the best model on the validation data
nn_y_val_pred = best_model.predict(X_val_transformed)
nn_y_val_pred_proba = best_model.predict_proba(X_val_transformed)[:, 1]

# Calculate the accuracy
accuracy = accuracy_score(y_val, nn_y_val_pred)
print(f"Accuracy: {accuracy}")

# Calculate AUC-ROC score
auc_score = roc_auc_score(y_val, nn_y_val_pred_proba)
print(f"AUC-ROC score: {auc_score}")

# Print the classification report
print("Classification Report:")
print(classification_report(y_val, nn_y_val_pred))

# Print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_val, nn_y_val_pred))

The neural network model, a multi-layer perceptron with two hidden layers (100 and 50 nodes) and SMOTE oversampling, exhibited performance comparable to the logistic regression model. Cross-validation scores showed a mean accuracy of 0.76, and on the validation set, the model achieved an accuracy of 0.73 and an AUC-ROC score of 0.80, identical to the logistic regression model's performance. The classification report and confusion matrix for the neural network model were also identical to the logistic regression model, suggesting similar prediction characteristics

#Graphical Represenstaion of Performance of Models

In [0]:
import matplotlib.pyplot as plt

# Define the metrics for comparison
models = ['Logistic Regression', 'Random Forest', 'Neural Network']
accuracy_scores = [0.8177, 0.8378, 0.8177] 
auc_roc_scores = [0.7962, 0.8205, 0.7962]   
precision_scores = [0.65, 0.71, 0.65] 
recall_scores = [0.29, 0.39, 0.29]

# Plot comparison chart
plt.figure(figsize=(10, 6))

# Accuracy comparison
plt.plot(models, accuracy_scores, marker='o', label='Accuracy')

# AUC-ROC comparison
plt.plot(models, auc_roc_scores, marker='o', label='AUC-ROC Score')

# Precision comparison
plt.plot(models, precision_scores, marker='o', label='Precision (Class 1)')

# Recall comparison
plt.plot(models, recall_scores, marker='o', label='Recall (Class 1)')

plt.title('Model Performance Comparison')
plt.xlabel('Models')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
