In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import recall_score, mean_absolute_error, accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10,6)

df = pd.read_csv('../data/model_ready_data.csv')

# Splitting the data

First, we need to split the data into training batch and testing batch (80/20 Split). Stratification is a must here since the label is very imbalanced here.


In [25]:
X = df.drop("Churn", axis = 1)
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

cols_to_scale = ["tenure", "TotalCharges"]

scaler = MinMaxScaler()

scaler.fit(X_train[cols_to_scale])

X_train[cols_to_scale] = scaler.transform(X_train[cols_to_scale])
X_test[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

#chosen_columns = ["tenure","Contract_One year","Contract_Two year", "FiberOptic", "InternetServiceButNoOnlineSecurity","InternetServiceButNoTechSupport","ElectronicCheck","PaperlessBilling"]

#X_train = X_train[chosen_columns]
#X_test = X_test[chosen_columns]

print(f"Training rows: {len(X_train)}")
print(f"Testing rows: {len(X_test)}")

Training rows: 5625
Testing rows: 1407


# Initialize and train the model

In [26]:

model = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)

model.fit(X_train, y_train)



In [27]:
#Get the coefficients from the trained model
coeffs = pd.DataFrame({
    'Feature': X_train.columns,
    'Weight': model.coef_[0]
})

# Sorting by magnitude
coeffs['Abs_Weight'] = coeffs['Weight'].abs()
coeffs = coeffs.sort_values(by='Abs_Weight', ascending=False)

print(coeffs)

                               Feature    Weight  Abs_Weight
0                               tenure -3.047744    3.047744
1                         TotalCharges  1.977612    1.977612
10                   Contract_Two year -1.443932    1.443932
9                    Contract_One year -0.779440    0.779440
3                           FiberOptic  0.733989    0.733989
4   InternetServiceButNoOnlineSecurity  0.520640    0.520640
2                      ElectronicCheck  0.442808    0.442808
5      InternetServiceButNoTechSupport  0.421224    0.421224
8                     PaperlessBilling  0.309919    0.309919
7                           Dependents -0.299334    0.299334
6                              Partner  0.003456    0.003456


# Evaluation

In [28]:
predictions = model.predict(X_test)

#Accuracy score:
print(f"Accuracy Score: {accuracy_score(y_test,predictions):.2%}")

#Recall score:
print(f"Recall Score: {recall_score(y_test,predictions):.2%}")



Accuracy Score: 72.85%
Recall Score: 79.68%


## Modeling Summary: Logistic Regression (Baseline)


### Performance Metrics

* **Recall (Sensitivity): 79.68%**
    * **Interpretation:** The model successfully identified approximately 80% of all customers who actually churned. This is the critical metric for this business problem, as the primary goal is to intervene before a customer leaves.

* **Accuracy: 72.57%**
    * **Interpretation:** While lower than the unweighted baseline, this drop is an expected trade-off. By forcing the model to pay attention to the minority class (Churners), we accept a higher number of "False Positives" (predicting a customer will leave when they stay) in exchange for catching the vast majority of actual churners.

### Conclusion & Next Steps
The current model serves as a strong baseline for identifying churn risk.
* **Observation:** Treating `Contract` as ordinal (0-1-2) imposes a linear constraint that may not reflect the true difference in risk between contract types.
* **Next Iteration:** We will attempt to improve performance by applying One-Hot Encoding to the `Contract` feature to allow for non-linear weighting.
* **Interpretability:** In the following notebook, we will employ SHAP values to explain individual predictions and visualize feature impact.

## Modeling Summary: Logistic Regression (Refined)

### Experiment
We iterated on the baseline model by replacing the ordinal encoding of `Contract` (0, 1, 2) with **One-Hot Encoding**.
* **Hypothesis:** We hypothesized that the relationship between contract duration and churn risk is non-linear, and that separating the contract types would allow the model to capture distinct risk profiles for "One-year" vs "Two-year" customers.

### Results Comparison
| Metric | Baseline (Ordinal) | New (One-Hot) | Change |
| :--- | :--- | :--- | :--- |
| **Recall** | 79.68% | 79.68% | **0.0%** |
| **Accuracy** | 72.57% | 72.85% | **+0.28%** |

### Strategic Conclusion
1.  **Diminishing Returns:** The switch to One-Hot Encoding yielded a negligible improvement (+0.28%). This indicates that the specific encoding of the `Contract` feature was not the primary bottleneck.
2.  **Linear Ceiling:** The stability of the metrics suggests we have reached the performance ceiling of a linear classifier (Logistic Regression) on this specific feature set. The model is correctly identifying the broad "Month-to-Month" risk group (high Recall) but struggles to separate any non-linear patterns.
3.  **Decision:** I will finalize this Logistic Regression model as my baseline. While a complex model (e.g., Random Forest) might achieve higher accuracy, this linear model offers transparency.
4.  **Next Step:** We will proceed to Model Explainability using SHAP values to investigate exactly which features are driving the predictions for our caught churners.

In [None]:
#Saving the model for shap
import joblib

model_filename = '../models/logistic_regression_ohe.pkl'
joblib.dump(model, model_filename)

FileNotFoundError: [Errno 2] No such file or directory: '..models/logistic_regression_ohe.pkl'