# Telco Customer Churn – Univariate Predictor Challenge

![Churn Analysis](data/churn.png)
### Project Brief

You have received anonymised customer–account records from a fictitious telecoms provider.  Your task is to discover **which single column, on its own, is the most powerful predictor of churn** (i.e. whether a customer left the company last month).  You will train one‑feature logistic‑regression models for every candidate variable, compare their predictive skill, and report the champion feature in a two‑column DataFrame called `best_feature_df`.

> **Business context** – Reducing voluntary churn is a primary driver of profitability in subscription businesses.  Pinpointing the one factor that most strongly signals attrition helps the retention team focus interventions and craft targeted offers.


### The Dataset


| Attribute | Description |
|-----------|-------------|
| **File** | Telco-Customer-Churn.csv (≈ 7k rows, 21 columns) |
| **Target column** | Churn – "Yes" / "No" |
| **Identifier** | customerID (must be excluded from modelling) |
| **Predictors** | Mixed types: <br>• Demographics (gender, SeniorCitizen, Partner, Dependents)<br>• Services (InternetService, StreamingTV, …)<br>• Contract details (Contract, PaperlessBilling, PaymentMethod)<br>• Usage (tenure, MonthlyCharges, TotalCharges) |
| **Missing data** | TotalCharges contains 11 blank strings → treat as NaN |

A full data‑dictionary is available on the Kaggle page.

### Learning Objectives

1. Practise **feature‑type diagnostics** and appropriate encoding (boolean, numerical, nominal, ordinal).
2. Implement a **loop‑driven modelling workflow** with `statsmodels.formula.api.logit`.
3. Compute **out‑of‑sample performance** via stratified 5‑fold cross‑validation.
4. Consolidate results into a tidy ranking table and extract the top scorer.


### Task Checklist

| Step | Action | Expected artefact |
|------|--------|-------------------|
| 1 | Load the CSV into a DataFrame named `telco`. | – |
| 2 | Convert `TotalCharges` to numeric, coercing errors to `NaN`, then mean‑impute. | Cleaned `telco` DataFrame |
| 3 | Map `Churn` to binary **0 = "No", 1 = "Yes"**. | `telco['Churn']` as `int` |
| 4 | Drop the `customerID` column. | – |
| 5 | Identify candidate predictors: every remaining column except `Churn`. | Python list `features` |
| 6 | For **each** feature:<br>a. Wrap with `C()` if categorical.<br>b. Fit `logit` on the training fold, predict on the validation fold.<br>c. Record the **AUC** for that fold. | List of dicts `results` |
| 7 | Compute the **mean AUC** over the 5 folds for each feature. | `all_results_df` (feature, auc) |
| 8 | Sort descending, take the top row, rename columns to `best_feature`, `best_auc`, and reset the index. | **`best_feature_df`** |

<br>

> *Tip*  Use `StratifiedKFold(n_splits=5, shuffle=True, random_state=42)` to maintain the churn ratio across folds.


## Deliverables

1. **`best_feature_df`** – a one‑row DataFrame with columns:

   * `best_feature`  (string) – name of the champion predictor.
   * `best_auc`      (float) – its mean AUC, rounded to three decimals.
2. (Optional) `all_results_df` – full leaderboard of feature AUCs for inspection.

## Extension Ideas

* **Regularised Logit** – repeat the exercise with `fit_regularized` and observe whether the leader changes.
* **Alternative Metrics** – compare rankings by accuracy, F1, and Matthews CC.
* **Explainability** – draw a violin/box plot of the champion feature split by churn status to visualise separation.


In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [4]:
url = "https://raw.githubusercontent.com/jhlopesalves/data-science-practice-notebook/refs/heads/main/Python/projects/telcom_customer_churn/data/Telco-Customer-Churn.csv"
telcom = pd.read_csv(url, usecols=lambda col: not col.startswith("Unnamed"))
telcom.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
