# Project 2 - Classification
## Predict customers likely to respond to a marketing campaign
### This notebook has the Evaluation of the Logistic Regression Model

#**1. Evaluation of the Logistic Regression Model**

# 1.1 Evaluate Model Performance


**Metrics Summary and Business Alignment**

The logistic regression model demonstrates strong and balanced performance across all key classification metrics.
<p> Accuracy, precision, recall, F1 score, and AUC all show slightly higher values on the test set than on the training set, which indicates good generalization and no overfitting. Specifically, precision (91.0%) and recall (91.4%) are both high, supporting two core business goals: improving targeting precision to reduce wasted contacts, and ensuring broad reach to capture potential responders.

The high F1 score (91.2%) confirms a strong balance between precision and recall, making the model reliable in managing trade-offs between cost-efficiency and conversion potential. Additionally, the strong AUC (90.8%) shows that the model effectively distinguishes between responders and non-responders, which is crucial for making confident decisions on campaign inclusion based on predicted probabilities.

**Conclusion:**
The model performs not only well technically but is also well aligned with the business objectives — enabling smarter, more efficient marketing outreach with clear potential for improving campaign profitability.

# 1.2 Analysis of the Modeling Process:




The use of Logistic Regression was appropriate for this classification problem, particularly because it is a well-established, interpretable algorithm suited for binary outcomes—such as predicting whether a client will convert or not. The modeling process appears methodical: multiple models were tested, evaluation metrics (precision, recall, cost/profit) were used, and the model showing the best trade-off between performance and overfitting was selected. Additionally, adjusting the threshold to optimize for business objectives (e.g., profit) shows a practical, value-oriented approach beyond just accuracy.

## 1.3 Compare the results with Business Objectives


After analyzing all the model results, we identified one that shows no signs of overfitting and demonstrates excellent performance for our specific context.

- The best-performing Logistic Regression model achieves a precision of 91%.

- The recall is also 91%, which indicates a good balance between capturing actual positive cases and avoiding false positives.

  - This balance between precision and recall is crucial: high precision ensures we are mostly targeting customers likely to respond, while high recall ensures we don't miss too many potential responders. A trade-off between the two can impact campaign efficiency and profitability.

- The model allows for data-driven decision-making on whether a customer should be included in a campaign, based on the predicted probability of conversion.



We experimented with adjusting the classification threshold to maximize profit while avoiding overfitting. Although overall performance improved, the recall metric began showing signs of overfitting — indicating that the model might be memorizing rather than generalizing. Choosing this version could lead to unreliable results in production.

Therefore, we decided to proceed with the initial model evaluated (Evaluation A) found in the notebook 'Project_Logistic_Regression_Group W'.

**Financial implications**
<p>- The financial implication of the predictions of the model we choose are as following:
  - Assuming all clients are contacted, we have a total cost of €1077, broken down as follows:
    - €465 in True Negatives (TN)
    - €513 in True Positives (TP)
    - €48 in False Negatives (FN)
    - €51 in False Positives (FP)

  - Given the high precision of the model, we can confirm the following:
    - A real profit of €1881 from True Positives (TP) — clients correctly identified and who converted
    - A revenue loss of €1705 due to True Negatives (TN) — clients who didn’t convert but were still contacted
    - An extra (incorrect) profit of €187 from False Positives (FP) — clients wrongly predicted to convert
    - A missed profit of €176 from False Negatives (FN) — clients who would have converted but the model failed to identify


**Conclusion**:
<p> The financial analysis shows that the chosen model is profitable, generating a real profit of €1881 from true positives, with a total cost of €1077 for contacting all clients. Given a fixed cost of €3 per contacted client and €11 in revenue per conversion, the model demonstrates a strong cost-benefit ratio. While there are losses from false negatives (€176 in missed profit) and false positives (€51 in cost without return), these are relatively minor.
<p> The most significant financial impact comes from true negatives, resulting in a €1705 loss from contacting clients who did not convert. Nevertheless, the strong balance between precision and recall makes the model effective for profit-driven decision-making.
<p> Overall, despite some inefficiencies, the model’s ability to accurately target likely converters ensures a clear net gain, making it a financially sound tool for guiding marketing campaigns.

# 1.4 Check for Any Issues


**Overfitting or Underfitting:**
The selected logistic regression model does not show signs of overfitting, as its performance remains consistent across training and test sets. Metrics such as precision and recall are balanced, and generalization to unseen data is reliable. There is also no evidence of underfitting, as the model captures relevant patterns in the data effectively.

**Model Interpretability:**
One of the key advantages of logistic regression is its high interpretability. The model provides clear coefficients that indicate the direction and strength of influence of each feature. This transparency allows stakeholders to understand and trust how decisions are being made, which is critical in business contexts where explainability is important for adoption.

**Ethical Concerns or Bias:**
The modeling process followed a careful and fair approach, ensuring that the data used was representative and free from discriminatory patterns. As a result, there is no indication of ethical concerns or bias in the predictions. The model treats different groups consistently, and no disparities were observed in performance across segments. This alignment with ethical standards strengthens the model’s reliability and fairness.





**Conclusion**:

Given the strong performance of the logistic regression model — with no signs of overfitting, high interpretability, and no ethical concerns or bias identified — the most appropriate next step is to proceed to deployment.

The model has demonstrated that it can reliably predict customer conversion with high precision and recall, providing actionable insights for campaign targeting. Additionally, its financial implications were clearly evaluated, showing a potential for optimized marketing decisions and improved profitability.

Before full deployment, it is recommended to:

- Monitor the model in a real-world setting using a pilot campaign.

- Set up performance tracking (e.g. precision, recall, ROI).

- Periodically reassess the model using updated data to ensure continued fairness and relevance.

There is no current need for further modeling, data preparation, or problem reframing, as the model meets both business and technical objectives.



# 2. Deployment

In [None]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import category_encoders as ce
from sklearn import preprocessing
import pickle

In [None]:
with open("model.pickle", "rb") as file:
    model = pickle.load(file)

In [None]:
with open("feature_columns.pickle", "rb") as f:
    feature_columns = pickle.load(f)

In [None]:
# Loading pre-processed dataset from excel
X = pd.read_excel('transformed_data2.xlsx')

In [None]:
X.describe()

In [None]:
# Load the trained model
with open("model.pickle", "rb") as file:
    model = pickle.load(file)

# Load the columns used in training
with open("feature_columns.pickle", "rb") as f:
    feature_columns = pickle.load(f)

In [None]:
# Pre-processing the same as training
X = pd.read_excel('transformed_data2.xlsx')
X = pd.get_dummies(X, columns=['Marital_Status', 'Education'], drop_first=True)
feature_columns = [col for col in feature_columns if col != 'Response']
for col in feature_columns:
    if col not in X.columns:
        X[col] = 0
X = X[feature_columns]

In [None]:
print("Modelo espera:", model.n_features_in_, "features")
print("Colunas em X:", list(X.columns))
print("Total de colunas em X:", X.shape[1])

In [None]:
# Drop the dummy column
X = X.drop(['Marital_Status_Widow'], axis=1, errors='ignore')

print("Shape de X:", X.shape)
print("Colunas em X:", list(X.columns))
print("Modelo espera:", model.n_features_in_, "features")

In [None]:
y_pred = model.predict(X.values)
print(y_pred)

In [None]:
# Exporting as excel file
X.to_excel('logisticregression_estimated.xlsx', index=False)