# 01 Data Preprocessing :

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

df = pd.read_csv(r'/content/BankCustomerData.csv')

df_dummies = pd.get_dummies(df, drop_first = True)

missing_val = df.isnull().sum().sum()
## no missing values

for column in df_dummies.columns:
  if set(df_dummies[column].unique()) == {'yes', 'no'}:
        df_dummies[column] = df_dummies[column].map({'yes': 1, 'no': 0})


# 02 Feature Selection :

In [None]:
columns_to_drop = ['job_unknown', 'education_unknown', 'contact_unknown', 'month_aug', 'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'poutcome_unknown', 'term_deposit_yes']
x = df_dummies.drop(columns_to_drop, axis = 1)
y = df_dummies['term_deposit_yes']


# 03 Data Splitting :

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

# 04 Model Training

In [None]:
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

model = LogisticRegression()

model.fit(x_train_scaled, y_train)

y_pred = model.predict(x_test_scaled)

# 05 Model Evaluation

In [None]:
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"""
Accuracy : {accuracy}
Confusion Matrix :
{conf_matrix}
Classification Report : {class_report}
""")


Accuracy : 0.9154549718574109
Confusion Matrix :
[[7606  122]
 [ 599  201]]
Classification Report :               precision    recall  f1-score   support

           0       0.93      0.98      0.95      7728
           1       0.62      0.25      0.36       800

    accuracy                           0.92      8528
   macro avg       0.77      0.62      0.66      8528
weighted avg       0.90      0.92      0.90      8528




# 06 Conclusion

Model Performance Summary

*   The model correctly predicts outcomes about 91.55% of the time, which is generally good.
*   The model is highly accurate in identifying customers who are unlikely to subscribe to a term deposit.

* The confusion matrix and classification report provide key insights into the performance of the logistic regression model. In terms of the confusion matrix, the model demonstrates a notable ability to accurately identify customers who are unlikely to subscribe to a term deposit, as evidenced by a high count of True Negatives (TN) at 7606. However, the model falls short in identifying potential subscribers, as indicated by a low count of True Positives (TP) at 201. On the negative side, there are 122 False Positives (FP), where the model incorrectly predicts subscription, and 599 False Negatives (FN), indicating instances where the model fails to identify customers who actually subscribe.

* The classification report further breaks down the model's performance metrics. The precision, representing the positive predictive value, stands at 0.62. This means that of the instances predicted as positive (subscribers), 62% are genuinely positive. However, the recall, which measures the true positive rate or sensitivity, is relatively low at 0.25. This indicates that the model captures only 25% of the actual positive instances. The F1-score, a harmonic mean of precision and recall, is 0.36, providing a balanced assessment of the model's overall performance.

In summary, the confusion matrix and classification report highlight the model's strengths in correctly identifying non-subscribers but underscore its limitations in accurately capturing potential subscribers. The precision-recall trade-off should be considered, and efforts to enhance model performance may involve adjustments to the classification threshold or the exploration of additional features.

