## Logistic Regression for Customer Conversion Prediction in Bank Marketing
### Dataset choice
##### Bank Marketing (UCI / Kaggle) — binary classification problem (y: subscription yes/no).

###### - Real, tabular, mixed categorical + numeric features (age, job, balance, campaign, poutcome, contact, etc.).

###### - Good size for experiments and realistic results.

##### - Logistic Regression is appropriate because: 
###### &nbsp;&nbsp;&nbsp;&nbsp;-interpretable coefficients
###### &nbsp;&nbsp;&nbsp;&nbsp;-works well for binary outcomes
###### &nbsp;&nbsp;&nbsp;&nbsp;-allows regularization
###### &nbsp;&nbsp;&nbsp;&nbsp;-is a solid baseline you can improve

### 1. Problem

##### Predict whether a client will subscribe to a term deposit (y = 'yes' / 'no').

###### Binary classification → Logistic Regression is a natural baseline.



In [None]:
# ---------- Imports ----------
import pandas as pd                                   # data manipulation
import numpy as np                                    # numerical operations
import matplotlib.pyplot as plt                       # plotting
import seaborn as sns                                 # plotting helper
from sklearn.model_selection import train_test_split, GridSearchCV  # splitting & hyperparam search
from sklearn.preprocessing import StandardScaler      # feature scaling
from sklearn.linear_model import LogisticRegression   # logistic regression model
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, classification_report, confusion_matrix,
                             roc_curve, roc_auc_score)  # evaluation metrics
import warnings                                       # ignore warnings for clean output
warnings.filterwarnings("ignore")                     # suppress warnings

### 2. Input Data (features)

##### Typical columns (UCI bank-full):

###### - age (numeric)

###### - job, marital, education (categorical)

###### - default (has credit in default: yes/no)

###### - balance (numeric)

###### - housing (yes/no), loan (yes/no)

###### - contact (categorical: telephone / cellular)

###### - day, month (time of contact)

###### - duration (call duration in secs) — note: in many studies duration is dropped for true predictive modelling because it is only known after the call; I'll exclude it by default for realistic predictions (I’ll explain in the code).

###### - campaign, pdays, previous (campaign-related numeric features)

###### - poutcome (outcome of previous campaign)

###### - y — target: subscribe yes/no

##### - Read CSV with correct delimiter.

##### - Inspect columns and target distribution.

##### - Drop duration for realistic prediction (it leaks post-call information). (Optional: keep it if you want best possible offline performance.)

##### - Encode target y → binary (1 for yes, 0 for no).

##### - Handle missing values (this dataset typically has no NAs, but code checks).



In [None]:
# ---------- 1. Load dataset ----------
# (UCI Bank Marketing CSV uses semicolon ; as delimiter)
df = pd.read_csv(r".\data\bank-full.csv", sep=';')                 # load CSV into DataFrame

# ---------- 2. Quick inspection ----------
print("\n--- Dataset Shape ---")
print("Shape:", df.shape)                             # print number of rows and columns
print("\n--- Columns ---")
print(df.columns.tolist())                            # list column names
print("\n--- Sample Rows ---")
print(df['y'].value_counts(dropna=False))             # see target distribution (yes/no)

# ---------- 3. Clean / Basic preprocessing ----------
# For realistic predictive modelling, drop 'duration' because it is only known after the call
# (including it will inflate offline performance but is not available when deciding to call).
if 'duration' in df.columns:
    df = df.drop(columns=['duration'])                # remove duration to avoid target leakage

# Convert target 'y' from yes/no to binary 1/0
df['y'] = df['y'].map({'yes': 1, 'no': 0})           # map yes->1, no->0

# Quick NA check (this dataset normally has no missing values)
print("Missing values per column:\n", df.isnull().sum())


### 3. Preprocessing (what we’ll do and why)

###### - Convert categorical variables → numeric via pd.get_dummies (one-hot); for high-cardinality features we may group rare categories.

###### - Feature engineering: e.g., previous > 0 → has_previous, derive age_group optionally. (I include a simple has_previous example.)

###### - Scale numeric features with StandardScaler (important for regularized Logistic Regression).


In [None]:

# ---------- 4. Feature engineering ----------
# create a simple engineered feature: whether customer had previous contacts
if 'previous' in df.columns:
    df['has_previous'] = (df['previous'] > 0).astype(int)  # 1 if previous contacts > 0 else 0

# Optionally group rare job categories (example)
if 'job' in df.columns:
    job_counts = df['job'].value_counts()
    rare_jobs = job_counts[job_counts < 100].index.tolist()  # threshold=100, adjust as needed
    df['job_group'] = df['job'].replace(rare_jobs, 'other')   # replace rare jobs with 'other'

# ---------- 5. Select features and target ----------
# Choose a list of candidate features (numeric + categorical)
candidate_features = [
    'age', 'balance', 'campaign', 'pdays', 'previous',
    'has_previous', 'job_group', 'marital', 'education',
    'default', 'housing', 'loan', 'contact', 'month', 'day', 'poutcome'
]
# Keep only features that exist in the dataset (some versions differ)
features = [f for f in candidate_features if f in df.columns]
X = df[features].copy()                               # feature DataFrame copy
y = df['y'].copy()                                    # target series

# ---------- 6. Categorical encoding ----------
# Identify categorical columns in X (object or category dtype)
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns to encode:", cat_cols)

# One-hot encode categorical variables; drop_first=True reduces multicollinearity
X_encoded = pd.get_dummies(X, columns=cat_cols, drop_first=True)

# ---------- 7. Numeric scaling ----------
# Identify numeric columns after encoding (all columns will be numeric dtype)
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns to scale:", numeric_cols)

# We'll scale numeric columns (important for regularized logistic regression)
scaler = StandardScaler()                             # create scaler
# Fit scaler on numeric columns and transform
X_encoded[numeric_cols] = scaler.fit_transform(X_encoded[numeric_cols])

### 4. Train/Test split
###### - Split the dataset into training (80%) and test (20%) sets using train_test_split; stratify by target to maintain class balance.

In [None]:
# ---------- 8. Train/Test split (80/20) ----------
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)

### 5. Model (Baseline)
###### - Fit the baseline Logistic Regression model on the training data.

In [None]:
# ---------- 9. Baseline Logistic Regression (no tuning) ----------
baseline_clf = LogisticRegression(solver='liblinear', random_state=42)  # initialize model
baseline_clf.fit(X_train, y_train)                    # train model on training set

# Baseline predictions
y_pred_base = baseline_clf.predict(X_test)            # predicted labels
y_proba_base = baseline_clf.predict_proba(X_test)[:, 1]  # predicted probabilities for positive class


### 6. Evaluation (Baseline)
###### - valuate the baseline model using metrics: accuracy, precision, recall, F1-score, confusion matrix, ROC-AUC.

In [None]:
# ---------- 10. Baseline evaluation ----------
print("Baseline Accuracy:", accuracy_score(y_test, y_pred_base))          # overall accuracy
print("Baseline Precision:", precision_score(y_test, y_pred_base))       # precision
print("Baseline Recall:", recall_score(y_test, y_pred_base))             # recall
print("Baseline F1:", f1_score(y_test, y_pred_base))                     # F1-score
print("\nClassification Report:\n", classification_report(y_test, y_pred_base))  # detailed report

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_base)            # compute confusion matrix
print("Confusion Matrix:\n", cm)                      # print confusion matrix

# ROC AUC
roc_auc_base = roc_auc_score(y_test, y_proba_base)    # compute AUC
print("Baseline ROC AUC:", roc_auc_base)              # print AUC

# Plot ROC curve for baseline
fpr_base, tpr_base, _ = roc_curve(y_test, y_proba_base)
plt.figure(figsize=(6,4))
plt.plot(fpr_base, tpr_base, label=f'Baseline (AUC = {roc_auc_base:.3f})')
plt.plot([0,1], [0,1], '--', color='gray')            # random guess line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Baseline Logistic Regression")
plt.legend()
plt.grid(True)
plt.show()

### 7. Model (Tuned)
###### - Hyperparameter tuning with GridSearchCV or other techniques (e.g., class weights, regularization) to optimize model performance.

In [None]:
# ---------- 11. Hyperparameter tuning (GridSearchCV) ----------
# We tune penalty (l1/l2) and C (inverse of regularization strength). Also try class_weight balance.
param_grid = {
    'penalty': ['l1', 'l2'],                          # L1 -> feature selection; L2 -> ridge-like
    'C': [0.01, 0.1, 1, 10],                          # regularization strengths
    'class_weight': [None, 'balanced']                # handle class imbalance
}
# Use liblinear solver because it supports l1 penalty
grid = GridSearchCV(LogisticRegression(solver='liblinear', random_state=42),
                    param_grid, cv=5, scoring='f1', n_jobs=-1)
grid.fit(X_train, y_train)                            # run grid search on training data

# Best params and CV score
print("Best params:", grid.best_params_)              # print best hyperparameters
print("Best CV F1:", grid.best_score_)                # best cross-validated F1


### 8.Evaluation (Tuned)
###### - Evaluate the tuned model on the test set with the same metrics as baseline: accuracy, precision, recall, F1, confusion matrix, ROC-AUC.

In [None]:
# ---------- 12. Evaluate tuned model on test set ----------
best_clf = grid.best_estimator_                       # retrieve best estimator
y_pred_tuned = best_clf.predict(X_test)               # predictions from tuned model
y_proba_tuned = best_clf.predict_proba(X_test)[:, 1]  # probabilities from tuned model

# Metrics for tuned model
print("Tuned Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("Tuned Precision:", precision_score(y_test, y_pred_tuned))
print("Tuned Recall:", recall_score(y_test, y_pred_tuned))
print("Tuned F1:", f1_score(y_test, y_pred_tuned))
print("\nTuned Classification Report:\n", classification_report(y_test, y_pred_tuned))

# Confusion matrix for tuned model
cm_tuned = confusion_matrix(y_test, y_pred_tuned)
print("Tuned Confusion Matrix:\n", cm_tuned)

# ROC AUC for tuned model
roc_auc_tuned = roc_auc_score(y_test, y_proba_tuned)
print("Tuned ROC AUC:", roc_auc_tuned)


### Analyze Current Model

###### - Baseline accuracy is high (~ 0.89) but precision/recall for positive class is low → class imbalance problem.

###### - Tuned model improved recall (~ 0.63) but precision dropped (~ 0.27) → more true positives but many false positives.

##### Goal: improve F1-score and balance precision/recall.

### Possible Improvements

##### Handle Class Imbalance

##### &nbsp;&nbsp;&nbsp;&nbsp; - Use class_weight='balanced' in Logistic Regression.

###### &nbsp;&nbsp;&nbsp;&nbsp; - Try SMOTE or undersampling the majority class.

##### Feature Selection / Engineering

###### &nbsp;&nbsp;&nbsp;&nbsp; - Check which features are most important (correlations, coefficients).

###### &nbsp;&nbsp;&nbsp;&nbsp; - Combine or transform features (e.g., interaction terms).

##### Regularization

###### &nbsp;&nbsp;&nbsp;&nbsp; - Tune C parameter for L1/L2 regularization.

###### &nbsp;&nbsp;&nbsp;&nbsp; - L1 can also do feature selection.

##### Scaling

###### &nbsp;&nbsp;&nbsp;&nbsp; - Standardize numeric features for better convergence (e.g., StandardScaler).

##### Cross-validation

###### &nbsp;&nbsp;&nbsp;&nbsp; - Use StratifiedKFold CV to evaluate more robustly than a single train/test split.

### 9. Improvement

In [None]:
from sklearn.metrics import RocCurveDisplay
# ---------- 1. Load dataset ----------
# (UCI Bank Marketing CSV uses semicolon ; as delimiter)
df = pd.read_csv(r".\data\bank-full.csv", sep=';')                 # load CSV into DataFrame

# ---------- 2. Quick inspection ----------
print("\n--- Dataset Shape ---")
print("Shape:", df.shape)                             # print number of rows and columns
print("\n--- Columns ---")
print(df.columns.tolist())                            # list column names
print("\n--- Sample Rows ---")
print(df['y'].value_counts(dropna=False))             # see target distribution (yes/no)

# ---------- 3. Check Missing Values ----------
print("\n--- Missing Values ---")
print(df.isnull().sum())

# The dataset usually has no true NaN values, but we check to be safe.

# ---------- 5. Encode Categorical Variables ----------
# Identify categorical and numerical columns
categorical_cols = df.select_dtypes(include=["object"]).columns
numeric_cols = df.select_dtypes(exclude=["object"]).columns

print("\n--- Categorical Columns ---")
print(categorical_cols)

# One-hot encode categorical columns
X_encoded = pd.get_dummies(df.drop("y", axis=1), drop_first=True)

# ---------- 6. Feature Engineering ----------
# Here we can add new features if needed (none added yet)
# Example: interaction terms, binning, ratios, etc.

# ---------- 7. Drop Unnecessary Columns ----------
# Nothing to drop here beyond 'y', already handled above.

# ---------- 8. Select Features and Target ----------
X = X_encoded.copy()
y = df["y"].map({"no": 0, "yes": 1})  # convert target to 0/1

# ---------- 9. Sanity Check ----------
print("\n--- Sanity Check ---")
print("X shape:", X.shape)
print("y shape:", y.shape)
print("Any NaN in X?", X.isnull().sum().sum())
print("Any NaN in y?", y.isnull().sum())

# Fill any unexpected NaNs with 0
X = X.fillna(0)

# ---------- 10. Train/Test Split ----------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ---------- 11. Baseline Model (Default Logistic Regression) ----------
baseline_model = LogisticRegression(max_iter=1000, solver="liblinear")
baseline_model.fit(X_train, y_train)

y_pred_base = baseline_model.predict(X_test)
y_prob_base = baseline_model.predict_proba(X_test)[:, 1]

print("\n--- Baseline Evaluation ---")
print("Accuracy:", accuracy_score(y_test, y_pred_base))
print("Precision:", precision_score(y_test, y_pred_base))
print("Recall:", recall_score(y_test, y_pred_base))
print("F1:", f1_score(y_test, y_pred_base))
print("\nClassification Report:\n", classification_report(y_test, y_pred_base))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_base))
print("ROC AUC:", roc_auc_score(y_test, y_prob_base))

# ---------- 12. Hyperparameter Tuning ----------
param_grid = {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ["l1", "l2"],
    "solver": ["liblinear"]
}

grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring="f1")
grid.fit(X_train, y_train)

print("\n--- Best Hyperparameters ---")
print(grid.best_params_)

# ---------- 13. Tuned Model Evaluation ----------
tuned_model = grid.best_estimator_
y_pred_tuned = tuned_model.predict(X_test)
y_prob_tuned = tuned_model.predict_proba(X_test)[:, 1]

print("\n--- Tuned Evaluation ---")
print("Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("Precision:", precision_score(y_test, y_pred_tuned))
print("Recall:", recall_score(y_test, y_pred_tuned))
print("F1:", f1_score(y_test, y_pred_tuned))
print("\nClassification Report:\n", classification_report(y_test, y_pred_tuned))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tuned))
print("ROC AUC:", roc_auc_score(y_test, y_prob_tuned))

# ---------- 14. Visualization ----------
# Confusion Matrix Heatmap
plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test, y_pred_tuned), annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix (Tuned Model)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# ROC Curve
RocCurveDisplay.from_estimator(tuned_model, X_test, y_test)
plt.title("ROC Curve (Tuned Model)")
plt.show()

### 10. Visualization
###### - Plot results or analyze feature importance (coefficients), confusion matrix heatmaps, ROC curves, etc.

In [None]:
# ---------- 13. Compare ROC curves (baseline vs tuned) ----------
plt.figure(figsize=(6,4))
plt.plot(fpr_base, tpr_base, label=f'Baseline (AUC={roc_auc_base:.3f})', color='blue')  # baseline ROC
fpr_tuned, tpr_tuned, _ = roc_curve(y_test, y_proba_tuned)    # tuned ROC
plt.plot(fpr_tuned, tpr_tuned, label=f'Tuned (AUC={roc_auc_tuned:.3f})', color='green') # tuned ROC
plt.plot([0,1], [0,1], '--', color='gray')                  # random guess
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison: Baseline vs Tuned")
plt.legend()
plt.grid(True)
plt.show()

### 11.Save Model
###### - Save the trained/tuned model for later use (e.g., joblib.dump).

In [None]:
# ---------- 14. Inspect logistic regression coefficients (feature importances) ----------
# Coefficients map to X_encoded column order
coefficients = pd.Series(best_clf.coef_[0], index=X_encoded.columns)  # map coef vector to feature names
coefficients = coefficients.sort_values(key=lambda x: np.abs(x), ascending=False)  # sort by absolute value
print("Top coefficients:\n", coefficients.head(20))    # print top 20 influential features

