## Goal 

This notebook generates a churn probability score (PredictedChurnProb) for each customer using a logistic regression model trained on the cleaned IBM Telco Customer Churn dataset. 

The resulting predictions will later support ROI simulations in Tableau.

### Import and Load Data

In [3]:

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import os 
import json

# Load the cleaned dataset 
df = pd.read_csv(os.path.join('data', 'processed', 'telco_turnaround.csv'))
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,ChurnFlag,CLTV_Est
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,0,29.85
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,No,No,One year,No,Mailed check,56.95,1889.5,No,0,1936.3
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,107.7
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,0,1903.5
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,141.4


### Encode Categorical Columns + Save Map

In [5]:

# A dictionary that stores the encoding map for each column
label_encoders = {}

# A list which tracks which columns were encoded
encoded_columns = []

# Create copy of the processed df
df_model = df.copy()

# Loops through all columns with object data type (i.e., strings or categorical variables)
for col in df_model.select_dtypes(include ='object').columns:
    if col != 'customerID':
        le = LabelEncoder() # Initialize label encoder
        df_model[col] = le.fit_transform(df_model[col]) # Apply label encoding to the column

         # Save the readable mapping (e.g., {'Female': 0, 'Male': 1})
         # zip pairs the labels and the values and 
         # Convert np.int64 to native Python int for JSON compatibility
        label_encoders[col] = {
            cls: int(code) for cls, code in zip(le.classes_, le.transform(le.classes_))
        }

        # Keep track of encoded columns
        encoded_columns.append(col)

# Save encodings for transparency
with open('data/processed/encoding_map.json', 'w') as f:
    json.dump(label_encoders, f, indent = 4)

print("✅ Encoded columns:", encoded_columns)

✅ Encoded columns: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']


### Train Logistic Regression

In [7]:

# Select features and target variable 

X = df_model[['tenure', 'MonthlyCharges', 'Contract', 'InternetService', 'PaymentMethod']] # returns dataframe 
y = df_model['ChurnFlag'] # returns series

# Split the dataset into training and testing sets.
# - X: feature matrix (e.g., tenure, charges, contract type)
# - y: target variable (churn flag: 0 = no churn, 1 = churn)
# - stratify = y ensures that the proportion of churn vs. non-churn is preserved in both sets, 
#   which is critical for maintaining class balance in classification tasks.
# - random_state = 42 sets a fixed seed for the random number generator so the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Initialize regression model 
# Set max_iter=1000 gives it more time to converge (max number to find line of best fit)
model = LogisticRegression(max_iter=1000)

# fit(X_train, y_train) learns the optimal feature weights using the training data,
# so the model can later predict the probability of customer churn.
model.fit(X_train, y_train)




### Add Predicted Churn Probabilities

In [8]:

# Use the trained model to predict the probability of churn (class = 1) for each customer.
# model.predict_proba(X) returns two columns: [P(not churn), P(churn)]
# We select the second column (index 1) to get the probability of churn for each row (customer).
df['PredictedChurnProb'] = model.predict_proba(X)[:, 1]

df[['customerID', 'PredictedChurnProb']].head()

Unnamed: 0,customerID,PredictedChurnProb
0,7590-VHVEG,0.303021
1,5575-GNVDE,0.108249
2,3668-QPYBK,0.478793
3,7795-CFOCW,0.041016
4,9237-HQITU,0.584504


### Export Final Dataset

In [9]:

# Save the enriched dataset
output_path = os.path.join('data', 'processed', 'telco_turnaround_with_churn_scores.csv')
df.to_csv(output_path, index=False)
print(f"✅ Exported enriched dataset → {output_path}")

✅ Exported enriched dataset → data/processed/telco_turnaround_with_churn_scores.csv


### Next Steps 

The enriched dataset, now containing churn probability scores, is ready to be imported into Tableau.

You can use it to:
- Identify high-risk customer segments
- Simulate the ROI of targeted retention strategies
- Drive executive-level decision-making with dynamic insights

All label encodings for categorical variables are saved in `encoding_map.json` for reference and reproducibility.