## Title: Customer Churn Prediction (MySQL + Logistic Regression)

Objective: Train a logistic regression model to predict customer churn, write predicted probabilities back to MySQL, and export a CSV for visualization (e.g., Tableau).

### 0. Environment & Configuration

"Set database credentials in a .env file (not tracked in git). See .env.example for structure."

In [None]:
from dotenv import load_dotenv
load_dotenv(override=True)  # IMPORTANT: allows replacing a previously-set value
import os

load_dotenv()  # reads .env if present

MYSQL_HOST = os.getenv("MYSQL_HOST", "127.0.0.1")
MYSQL_PORT = int(os.getenv("MYSQL_PORT", "3306"))
MYSQL_USER = os.getenv("MYSQL_USER", "root")
MYSQL_PASSWORD = os.getenv("MYSQL_PASSWORD")  # keep secrets out of notebook
MYSQL_DB = os.getenv("MYSQL_DB", "churn_project")

In [None]:
import pandas as pd
import mysql.connector as mc
import os
from dotenv import load_dotenv

load_dotenv(override=True)

QUERY = """
SELECT
  customerID AS id,
  is_monthly,
  auto_pay,
  CASE WHEN tenure < 12 THEN 1 ELSE 0 END AS short_tenure,
  churn
FROM customers_clean;
"""

conn = mc.connect(
    host=os.getenv("MYSQL_HOST","127.0.0.1"),
    port=int(os.getenv("MYSQL_PORT","3306")),
    user=os.getenv("MYSQL_USER"),
    password=os.getenv("MYSQL_PASSWORD"),
    database=os.getenv("MYSQL_DB","churn_project")
)
df = pd.read_sql(QUERY, conn)
conn.close()

df.head(10)

### 1) EDA sanity checks (feat/notebook-eda-sanity)

Goal: quick data validation from customers_clean.

In [None]:
import pandas as pd, numpy as np

print(df.shape)
display(df.head(3))

# Basic label balance
df['churn'].value_counts().rename_axis('label').to_frame('count')
df['churn'].value_counts(normalize=True).mul(100).round(2).rename('pct')

# Leakage check: make sure target not in features
set(df.columns) - {'id','is_monthly','auto_pay','short_tenure','churn'}


### 2) Baseline logistic regression (feat/notebook-baseline-logreg)

Goal: simple baseline using the three engineered features you already have.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score

X = df[['is_monthly','auto_pay','short_tenure']].astype(float)
y = df['churn'].astype(int)
ids = df['id'].astype(str)

X_tr, X_te, y_tr, y_te, id_tr, id_te = train_test_split(
    X, y, ids, test_size=0.25, random_state=42, stratify=y
)

pipe = Pipeline([
    ('scaler', StandardScaler(with_mean=False)),
    ('logreg', LogisticRegression(max_iter=2000, solver='lbfgs', class_weight=None))
]).fit(X_tr, y_tr)

p_tr = pipe.predict_proba(X_tr)[:,1]
p_te = pipe.predict_proba(X_te)[:,1]

print(f"AUC train: {roc_auc_score(y_tr, p_tr):.3f}")
print(f"AUC test : {roc_auc_score(y_te, p_te):.3f}")