# Bank Customer

## Objectives

Predict whether a customer will subscribe to a term deposit based on their demographic and account information using logistic regression.

## Dataset Overview

Imagine we have a dataset named [bank_customers.csv](https://www.kaggle.com/datasets/kidoen/bank-customers-data) with the following columns:

* **age**: The age of the customer.
* **job**: Type of job (e.g., admin, technician, services, management).
* **marital**: Marital status (e.g., married, single, divorced).
* **education**: Level of education (e.g., secondary, tertiary, primary, unknown).
* **default**: Has credit in default? (yes, no).
* **balance**: Average yearly balance, in euros.
* **housing**: Has housing loan? (yes, no).
* **loan**: Has personal loan? (yes, no).
* **contact**: Communication type (e.g., unknown, cellular, telephone).
* **day**: Last contact day of the month.
* **month**: Last contact month of the year (e.g., jan, feb, mar, ...).
* **duration**: Last contact duration, in seconds.
* **campaign**: Number of contacts performed during this campaign for this client.
* **pdays**: Number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted).
* **previous**: Number of contacts performed before this campaign and for this client.
* **poutcome**: Outcome of the previous marketing campaign (e.g., unknown, other, failure, success).
* **subscribed**: Has the client subscribed a term deposit? (yes, no) - Target Variable.

## Steps

In [None]:
# Importing necessary libraries for logistic regression
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

### Data Processing

* Load the dataset into a Pandas DataFrame.
* Convert categorical variables into dummy variables.
* Handle missing values if any.
* Convert the target variable subscribed into a binary format (1 for yes, 0 for no).

In [None]:
# Loading a csv file called "BankCustomerData.csv"
df = pd.read_csv("BankCustomerData.csv")

# Displaying the first five rows of the data frame
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,term_deposit
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [None]:
# Converting categorical variables into dummy variables
categorical_var = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
df = pd.get_dummies(df, columns = categorical_var)
df.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,term_deposit,job_admin.,job_blue-collar,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,58,2143,5,261,1,-1,0,no,0,0,...,0,0,1,0,0,0,0,0,0,1
1,44,29,5,151,1,-1,0,no,0,0,...,0,0,1,0,0,0,0,0,0,1
2,33,2,5,76,1,-1,0,no,0,0,...,0,0,1,0,0,0,0,0,0,1
3,47,1506,5,92,1,-1,0,no,0,1,...,0,0,1,0,0,0,0,0,0,1
4,33,1,5,198,1,-1,0,no,0,0,...,0,0,1,0,0,0,0,0,0,1


In [None]:
# Handling missing values
df = df.fillna(df.mean())

  df = df.fillna(df.mean())


In [None]:
# Renaming the 'term_deposit' column to 'subscribed' to avoid confusion
df.rename(columns={'term_deposit': 'subscribed'}, inplace=True)

# Converting 'yes' to 1 and 'no' to 0 in the 'subscribed' column
df['subscribed'] = df['subscribed'].apply(lambda x: 1 if x == 'yes' else 0)
df['subscribed'].head()

0    0
1    0
2    0
3    0
4    0
Name: subscribed, dtype: int64

### Feature Selection

Decide which features to include in the model. You might exclude highly correlated features to
avoid multicollinearity

In [None]:
# Calculating the absolute correlation matrix
corr_matrix = df.corr().abs()

# Creating an upper triangular matrix to mask duplicate correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(bool))

# Finding features with high correlation (> 0.8) to avoid multicollinearity
df_drop = [column for column in upper.columns if any(upper[column] > 0.8)]
df = df.drop(df[df_drop], axis = 1)


### Data Splitting

Split the dataset into training and testing sets (typically a 70-30 or 80-20 split).

In [None]:
# Separating features and target variable
x = df.drop('subscribed', axis = 1)
y = df['subscribed']

# Splitting the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)


### Model Training

Train a logistic regression model on the training set.

In [None]:
# Creating an instance a StandardScaler
scaler = StandardScaler()

# Standardizing the features
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

In [None]:
# Creating a Logistic Regression model
model = LogisticRegression()

# Training the model using the scaled training data
model.fit(x_train_scaled, y_train)

### Model Evaluation

Evaluate the model's performance on the testing set using metrics such as accuracy, precision,
recall, F1-score, and the confusion matrix.

In [None]:
# Predict using the trained model on the scaled test data
y_pred = model.predict(x_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Generate classification report
class_report = classification_report(y_test, y_pred)

# Display the evaluation metrics
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Accuracy: 0.9158067542213884
Confusion Matrix:
[[7584  144]
 [ 574  226]]
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.98      0.95      7728
           1       0.61      0.28      0.39       800

    accuracy                           0.92      8528
   macro avg       0.77      0.63      0.67      8528
weighted avg       0.90      0.92      0.90      8528




### Conclusion
Summarize the model's performance and discuss any insights or implications for the bank's
marketing strategies.

In [None]:
# Displaying the summary of the model performance
print("Model Performance Summary:")
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

# Interpreting the accuracy
if accuracy > 0.8:
  print("\nModel Accuracy Interpretation: The model has a high overall accuracy in predicting whether a customer will subscribe to a term deposit.")
else:
  print("\nModel Accuracy Interpretation: The model has a relatively low overall accuracy in predicting whether a customer will subscribe to a term deposit.")

# Displaying the classification report
print("\nClassification Report:")
print(class_report)

# Providing insights or implications for the bank's marketing strategies
print("\nTo improve the model's performance, the bank could consider the following strategies:")
print("- Analyze the confusion matrix and classification reports in order to identify the type of customers that the model struggles with.")
print("- Try different machine learning algorithms.")
print("- Collect more data on obtain more relevant features.")
print("- Monitor the model's performance over time.")
print("- Update it to adapt with changes as customer behavior and market conditions change over time.")

Model Performance Summary:
Accuracy: 0.92
Confusion Matrix:
[[7584  144]
 [ 574  226]]

Model Accuracy Interpretation: The model has a high overall accuracy in predicting whether a customer will subscribe to a term deposit.

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.98      0.95      7728
           1       0.61      0.28      0.39       800

    accuracy                           0.92      8528
   macro avg       0.77      0.63      0.67      8528
weighted avg       0.90      0.92      0.90      8528


To improve the model's performance, the bank could consider the following strategies:
- Analyze the confusion matrix and classification reports in order to identify the type of customers that the model struggles with.
- Try different machine learning algorithms.
- Collect more data on obtain more relevant features.
- Monitor the model's performance over time.
- Update it to adapt with changes as customer behavior and mar