# 📊 Customer Churn Prediction

## Project Portfolio Entry

**Objective:** Demonstrate ability to build an end-to-end data analysis and predictive modeling project.

**Business Question:**
👉 Can we predict which customers are likely to churn, and provide actionable insights to reduce churn?

## Project Workflow:
1️⃣ Define Problem  
2️⃣ Load & Clean Data  
3️⃣ Explore Data  
4️⃣ Statistical Testing  
5️⃣ Build Predictive Model  
6️⃣ Predict Risk Scores  
7️⃣ Conclusion & Business Recommendations  


## 1️⃣ Import Libraries (Purpose: Load Python tools needed for data analysis & modeling)

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

## 2️⃣ Load CSV (Purpose: Import raw dataset into DataFrame to prepare for cleaning & analysis)

In [None]:
project_root = os.path.abspath('..')  # Go up 1 level to project root
data_path = os.path.join(project_root, 'data', 'telco_churn.csv')

df = pd.read_csv(data_path)
df.head()

## 3️⃣ Clean Data (Purpose: Prepare the dataset for analysis & modeling)
- Fix data types (TotalCharges)
- Handle missing values
- Encode categorical columns (gender, churn)

In [None]:
# Strip column names
df.columns = df.columns.str.strip()

# Clean Churn values
df['Churn'] = df['Churn'].str.strip()
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Clean TotalCharges
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())

# Encode gender
df['gender'] = df['gender'].map({'Male': 1, 'Female': 0})

df.head()

## 4️⃣ Explore Data (Purpose: Visualize key patterns & trends)
- Tenure vs Churn
- Monthly Charges vs Churn
- Identify segments with higher risk

In [None]:
sns.histplot(data=df, x='tenure', hue='Churn', multiple='stack', bins=30)
plt.title('Tenure Distribution by Churn')
plt.xlabel('Tenure (months)')
plt.ylabel('Number of Customers')
plt.show()

In [None]:
sns.boxplot(data=df, x='Churn', y='MonthlyCharges')
plt.title('Monthly Charges vs Churn')
plt.xlabel('Churn')
plt.ylabel('Monthly Charges ($)')
plt.show()

## 5️⃣ Statistical Testing (Purpose: Confirm if tenure difference is statistically significant)
👉 Perform independent T-test between churned & retained customers

In [None]:
churned = df[df['Churn'] == 1]['tenure']
retained = df[df['Churn'] == 0]['tenure']

t_stat, p_value = ttest_ind(churned, retained)
print(f"T-test p-value for tenure difference = {p_value:.4f}")

## 6️⃣ Build Logistic Regression Model (Purpose: Predict likelihood of churn)

In [None]:
features = ['gender', 'SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
X = df[features]
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## 7️⃣ Predict Risk Scores (Purpose: Rank customers by risk of churn)
👉 Generate risk scores for each customer (Probability of churn)

In [None]:
df['RiskScore'] = model.predict_proba(X)[:, 1]
df[['customerID', 'RiskScore']].sort_values(by='RiskScore', ascending=False).head(10)

## ✅ Conclusion & Business Recommendations for Stakeholders

**Key Findings:**
- Shorter-tenure customers are more likely to churn
- Higher monthly charges correlate with higher churn risk
- Model provides risk scores with reasonable accuracy (see classification report)

**Recommendations:**
1. Launch targeted retention campaigns for customers in first 12 months of tenure
2. Offer discounts or service bundles to customers with high monthly charges
3. Use risk scores in CRM systems to proactively engage at-risk customers
4. Automate weekly churn monitoring dashboard for continuous tracking

**Next Steps:**
- Explore more advanced models (Random Forest, XGBoost)
- Add new features (customer complaints, NPS score)
- Deploy into production using automated pipelines

---
🎓 This project demonstrates my skills in:
- Data cleaning & wrangling
- Exploratory data analysis (EDA)
- Statistical testing
- Predictive modeling
- Communicating insights to business stakeholders
