1. Load the dataset and explore the variables.
2. We will try to predict variable `Churn` using a logistic regression on variables `tenure`, `SeniorCitizen`,`MonthlyCharges`.

In [2]:
import pandas as pd

In [4]:
df = pd.read_csv("files_for_lab/customer_churn.csv")

3. Extract the target variable.
4. Extract the independent variables and scale them.
5. Build the logistic regression model.


In [10]:
from sklearn.model_selection import train_test_split

In [9]:
target = df.Churn
features = df[["tenure", "SeniorCitizen", "MonthlyCharges"]]

In [11]:
X_train, X_test, y_train, y_test = train_test_split(features, target)

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
lg_model = LogisticRegression()
lg_model.fit(X_train, y_train)

6. Evaluate the model.
7. Even a simple model will give us more than 70% accuracy. Why?
-> Missclassifying will still give good results on accuracy but bad one on churn forecast

In [14]:
lg_model.score(X_test, y_test)

0.7932992617830777

In [19]:
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

In [16]:
y_pred_os = lg_model.predict(X_test)
y_prob_os = lg_model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred_os))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_os))

              precision    recall  f1-score   support

          No       0.82      0.92      0.87      1292
         Yes       0.67      0.45      0.53       469

    accuracy                           0.79      1761
   macro avg       0.74      0.68      0.70      1761
weighted avg       0.78      0.79      0.78      1761

ROC-AUC: 0.8114392654155143


8. **Synthetic Minority Oversampling TEchnique (SMOTE)** is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply `imblearn.over_sampling.SMOTE` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?


In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

model_sm = LogisticRegression(max_iter=1000)
model_sm.fit(X_resampled, y_resampled)

y_pred_sm = model_sm.predict(X_test)
y_prob_sm = model_sm.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred_sm))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_sm))



              precision    recall  f1-score   support

          No       0.88      0.73      0.80      1292
         Yes       0.50      0.73      0.59       469

    accuracy                           0.73      1761
   macro avg       0.69      0.73      0.70      1761
weighted avg       0.78      0.73      0.74      1761

ROC-AUC: 0.8075313393228463


9. **Tomek links** are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. Apply `imblearn.under_sampling.TomekLinks` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [21]:
from imblearn.under_sampling import TomekLinks
from sklearn.preprocessing import StandardScaler

In [22]:
# ================================
# 3. Apply Tomek Links under-sampling
# ================================
tl = TomekLinks(sampling_strategy="majority")
X_res, y_res = tl.fit_resample(features, target)

print("\nResampled dataset shape (Tomek Links):", y_res.value_counts().to_dict())

# ================================
# 4. Logistic Regression after Tomek Links
# ================================
X_train_res, X_test_res, y_train_res, y_test_res = train_test_split(
    X_res, y_res, test_size=0.3, random_state=42, stratify=y_res
)

scaler_res = StandardScaler()
X_train_res_scaled = scaler_res.fit_transform(X_train_res)
X_test_res_scaled = scaler_res.transform(X_test_res)

tomek_model = LogisticRegression(max_iter=1000, random_state=42)
tomek_model.fit(X_train_res_scaled, y_train_res)

y_pred_tomek = tomek_model.predict(X_test_res_scaled)
y_prob_tomek = tomek_model.predict_proba(X_test_res_scaled)[:,1]

print("\n=== Logistic Regression after Tomek Links ===")
print(confusion_matrix(y_test_res, y_pred_tomek))
print(classification_report(y_test_res, y_pred_tomek))
print("ROC AUC:", roc_auc_score(y_test_res, y_prob_tomek))


Resampled dataset shape (Tomek Links): {'No': 4712, 'Yes': 1869}

=== Logistic Regression after Tomek Links ===
[[1262  152]
 [ 270  291]]
              precision    recall  f1-score   support

          No       0.82      0.89      0.86      1414
         Yes       0.66      0.52      0.58       561

    accuracy                           0.79      1975
   macro avg       0.74      0.71      0.72      1975
weighted avg       0.78      0.79      0.78      1975

ROC AUC: 0.827455014409004
