# Project: Reducing Customer Churn at Telco

## Data-Driven Approach to Reducing Customer Attrition

### Motivation & Personal Interest

As an aspiring Data Scientist, I’m passionate about tackling real-world business problems where data analysis can drive measurable impact. The Telco Customer Churn dataset presents a perfect opportunity to apply my skills in machine learning and interpretable statistics while addressing a critical industry challenge: Why do customers leave?

What excites me about this project:

- Complexity: The dataset includes demographic, contractual, and service-related factors—ideal for isolating key drivers of churn

- Business Relevance: Customer attrition costs telecom companies billions annually

- Skill Development: This project combines EDA, predictive modeling, and actionable insights

### Goals & Methodology

Using Python, I analyze the data and create visuals to answer the following questions:

- **Question 1: Which customer segments churn most—and why?**
    - Approach: Explorative data analysis to visualize relationships between **`InternetService`**, **`MonthlyCharges`** and **`Churn`**
    - Statistics: Logistic regression (**`statsmodels`**) to quantify impact
- **Question 2: Can I build a simple yet effective churn prediction model?**
    - Approach: Train a Random Forest (**`scikit-learn`**) using interpretable features (e.g., **`Contract`**, **`tenure`**)
    - Business Value: Such a model could help telecoms proactively retain at-risk customers
- **Question 3: Do add-on services (e.g., Tech Support) actually reduce churn?**
    - Approach: Compare churn rates between customers with/without services, stratified by contract type

 ### Findings

- **Question 1: Which customer segments churn most—and why?**
    - dffd

 ### Next steps 🏃

- Explore the full code and my other projects on GitHub.
- Connect with me on LinkedIn to discuss churn strategies!

### Loading Packages and Initial Settings

In this section, all relevant Python packages are imported and basic settings are defined. These include libraries for data handling (`pandas`, `numpy`), visualization (`matplotlib`, `seaborn`), statistical modeling (`statsmodels`, `scipy`), and machine learning utilities (`sklearn`). Additionally, the random seed is set for reproducibility of results.

In [3]:
import os
from kaggle.api.kaggle_api_extended import KaggleApi
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.formula.api import logit
from statsmodels.stats.proportion import proportions_ztest
import pingouin
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

sns.set_style("darkgrid")
SEED = 9

### Data Downloading and Importing

In this step, the Telco Customer Churn dataset is downloaded directly from Kaggle using the Kaggle API. The data is saved in a local directory and automatically unzipped for further use.  
Note: Authentication via the Kaggle API is required beforehand.

In [8]:
cwd = Path.cwd()

project_root = cwd.parent

target_dir = project_root / "data"

api = KaggleApi()
api.authenticate()

dataset = "blastchar/telco-customer-churn"
api.dataset_download_files(dataset, path=target_dir, unzip=True)

print(f" Dataset downloaded to: {target_dir.resolve()}")

data_path = os.path.join(project_root, "data", "WA_Fn-UseC_-Telco-Customer-Churn.csv")

df_telco = pd.read_csv(data_path)

Projekt-Root: C:\Users\kevin\OneDrive\Documents\Data Science\Github\Repositories\telco-customer-churn-ml
Data-Folder: C:\Users\kevin\OneDrive\Documents\Data Science\Github\Repositories\telco-customer-churn-ml\data


### Identifying High-Risk Churn Segments: Key Drivers and Patterns

This analysis investigates which customer segments are most prone to churning and examines the underlying reasons behind their attrition. The study focuses on four key hypothesized drivers of customer churn, which will be explored through exploratory data analysis and logistic regression modeling. First, we hypothesize that customers with shorter duration of contract duration (`tenure`) are more likely to churn (`Churn`) due to weaker customer-company bonding effects. Second, we expect to find that higher monthly charges (`MonthlyCharges`) lead to increased churn as customers become more price-sensitive. Third, we anticipate that fiber optic internet service subscribers (`InternetService`) will demonstrate higher churn rates compared to DSL users, potentially due to unmet higher expectations regarding service quality. Finally, we predict that customers with month-to-month contracts will churn more frequently than those with longer-term contracts (`Contract`), as the absence of contractual commitments reduces switching costs. These hypotheses will guide our investigation into the primary factors influencing customer retention in the telecommunications sector.

#### Visualizing Churn Patterns
These plots compare tenure and monthly charges between churned vs. retained customers. Boxplots show distribution spreads, while pointplots highlight mean differences.

Results:
- Contract duration (tenure): A boxplot revealed that customers who did not churn had significantly longer contract tenures (median above 35 months), whereas churners typically left within the first 20 months. → Customer loyalty increases over time.
- Monthly charges (MonthlyCharges): Customers who churned paid considerably more on average ($74/month) than those who stayed ($61–62/month). → Higher costs appear to be a key driver of churn.

In [None]:
print(df_telco.info())

fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (14, 10))
fig.suptitle('Churn Analysis: Tenure vs Monthly Charges', y = 1.02, fontsize = 14)

sns.boxplot(x = 'Churn', y = 'tenure', data = df_telco, ax = axes[0,0])
axes[0,0].set_title('Tenure Distribution by Churn Status')
axes[0,0].set_xlabel('Has Churned')
axes[0,0].set_ylabel('Tenure (months)')

sns.pointplot(x ='Churn', y = 'tenure', data = df_telco, join = False, ax = axes[1,0])
axes[1,0].set_xlabel('Has Churned')
axes[1,0].set_ylabel('Tenure (months)')

sns.boxplot(x = 'Churn', y = 'MonthlyCharges', data = df_telco, ax = axes[0,1])
axes[0,1].set_title('Monthly Charges Distribution by Churn Status')
axes[0,1].set_xlabel('Has Churned')
axes[0,1].set_ylabel('Monthly Costs ($)')

sns.pointplot(x = 'Churn', y = 'MonthlyCharges', data = df_telco, join = False, ax = axes[1,1])
axes[1,1].set_xlabel('Has Churned')
axes[1,1].set_ylabel('Monthly Costs ($)')

plt.tight_layout()
plt.show()

#### Hypotheses Testing: Churn Drivers Analysis

**Research Focus**: 
Statistically validate three key hypotheses about churn drivers:  
1. **Age**: Senior citizens have different churn rates  
2. **Internet Service**: Fiber optic/DSL services show different churn patterns  
3. **Contract Type**: Month-to-month vs longer contracts affect churn differently  

**Statistical Methods**:  
- Z-test for proportions (age groups)  
- Chi-square tests (categorical variables)

**Results**:
- There is a statistically significant difference in churn rates between senior and non-senior customers.
    - The variable SeniorCitizen is associated with churn behavior – the proportion of churn differs between the two age groups.
- The test indicates a meaningful statistical association between the type of internet service and churn.
    - Interpretation: The churn behavior differs significantly across service types, suggesting that internet service category plays a role in how churn is distributed.
- The analysis reveals a significant relationship between contract type and customer churn.
    - Churn is not evenly distributed across different contract models – the distribution of churn status varies by contract type.

In [10]:
df_telco.groupby("SeniorCitizen")['Churn'].value_counts()

n_churned_senior = np.array([1393, 476])
n_rows_senior = np.array([1393 + 4508, 476 + 666])
z_score_senior, p_value_senior = proportions_ztest(count = n_churned_senior, nobs = n_rows_senior, alternative = "two-sided")
print(z_score_senior, p_value_senior)

expected_internet, observed_internet, stats_internet = pingouin.chi2_independence(data = df_telco, x = "InternetService", y = "Churn")
print(stats_internet)

expected_contract, observed_contract, stats_contract = pingouin.chi2_independence(data = df_telco, x = "Contract", y = "Churn")
print(stats_contract)

NameError: name 'df_telco' is not defined

#### Logistic Regression: Predicting Churn Probability

**Model Purpose:**  
Quantify how different factors influence churn likelihood using logistic regression. The model predicts the probability of a customer churning based on:
- Monthly charges
- Tenure (duration as customer)
- Age (SeniorCitizen status)
- Internet service type
- Contract type

**Key Outputs:**
- Coefficients: Direction and magnitude of each factor's impact
- Odds Ratios: How much each factor increases/decreases churn risk
- Statistical Significance: p-values indicating reliable effects

**Results**:
- Internet type:
  - Customers with Fiber Optic internet are significantly more likely to churn than those with DSL (+0.98)
    - In contrast, customers with no internet service are less likely to churn (–0.88)
    - → This suggests that high-performance internet (likely more expensive) may come with higher expectations — and greater risk of dissatisfaction.
- Contract type:
  - A 1-year contract significantly reduces churn risk (–0.83)
  - A 2-year contract has an even stronger effect (–1.66)
  - → This confirms earlier insights and provides a quantitative foundation for loyalty-focused strategies.
- Tenure (contract duration): Each additional month with the company reduces the likelihood of churn (–0.033 per month), underlining the value of long-term customer relationships.
- Monthly charges: Although initially suspected to drive churn, monthly cost shows no significant effect in the full model (p = 0.122). This indicates that pricing effects may be explained by other variables like contract type or internet service.
- Age: Surprisingly, older customers have a higher likelihood of churn (+0.39). This could be due to tech-related frustration, lack of perceived value, or pricing concerns — and suggests a need for targeted support strategies.

In [11]:
df_telco.loc[df_telco["Churn"] == "Yes", "Churn"] = 1
df_telco.loc[df_telco["Churn"] == "No", "Churn"] = 0

df_telco['Churn'] = df_telco['Churn'].astype("int")

logit_model_churn = logit("Churn ~ MonthlyCharges + tenure + SeniorCitizen + InternetService + Contract",
                         data = df_telco).fit()

print(logit_model_churn.summary())


odds_ratios = np.exp(logit_model_churn.params)

results_df = pd.DataFrame({
    'Variable': odds_ratios.index,
    'Odds Ratio': odds_ratios.values,
    'P-value': logit_model_churn.pvalues
})

print(results_df)

NameError: name 'df_telco' is not defined

### Building an Effective Churn Prediction Model?

The next section examines whether we can develop both simple and accurate churn prediction models by leveraging machine learning techniques. While traditional logistic regression offers valuable interpretability, machine learning approaches provide distinct advantages for this predictive modeling task. Specifically, ML algorithms can identify and model complex non-linear patterns in the data, account for intricate interactions between multiple variables, and ultimately deliver superior predictive performance compared to conventional statistical methods. For this analysis, we implement a **Random Forest classifier** - an ensemble method that combines multiple decision trees to reduce overfitting while maintaining robust performance across diverse data types. The model's ability to generate feature importance scores adds valuable interpretability to its predictions. Our optimization strategy focuses on three key aspects: First, we conduct comprehensive **hyperparameter tuning** to identify the optimal model configuration. Second, we **prioritize recall** as our primary evaluation metric, aiming to correctly identify at least 70% of actual churn cases, since failing to detect potential churners (false negatives) carries greater business risk than false alarms. This recall-focused approach ensures we minimize missed opportunities for customer retention interventions.

#### Data Preprocessing for Machine Learning

**Objective:** Convert all categorical variables to numerical representations and handle missing values to prepare data for machine learning algorithms.

**Key Steps:**
1. Binary encoding for Yes/No features  
2. Ordinal encoding for multi-category features  
3. Missing value handling for TotalCharges  
4. Type conversion for all encoded features  

In [12]:
df_telco_num = df_telco

df_telco_num.loc[df_telco_num["gender"] == "Male", "gender"] = 1
df_telco_num.loc[df_telco_num["gender"] == "Female", "gender"] = 0

yes_no_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']

df_telco_num[yes_no_cols] = df_telco_num[yes_no_cols].replace({'Yes': 1, 'No': 0})

yes_no_other_cols = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for col in yes_no_other_cols:
    third_value = list(set(df_telco_num[col].unique()) - {'Yes', 'No'})[0]
    df_telco_num[col] = df_telco_num[col].replace({
        'Yes': 1,
        'No': 0,
        third_value: 2
    })

df_telco_num.loc[df_telco_num["InternetService"] == "Fiber optic", "InternetService"] = 3
df_telco_num.loc[df_telco_num["InternetService"] == "DSL", "InternetService"] = 2
df_telco_num.loc[df_telco_num["InternetService"] == "No", "InternetService"] = 1

df_telco_num.loc[df_telco_num["Contract"] == "Month-to-month", "Contract"] = 1
df_telco_num.loc[df_telco_num["Contract"] == "Two year", "Contract"] = 2
df_telco_num.loc[df_telco_num["Contract"] == "One year", "Contract"] = 3

df_telco_num.loc[df_telco_num["PaymentMethod"] == "Electronic check", "PaymentMethod"] = 1
df_telco_num.loc[df_telco_num["PaymentMethod"] == "Mailed check", "PaymentMethod"] = 2
df_telco_num.loc[df_telco_num["PaymentMethod"] == "Bank transfer (automatic)", "PaymentMethod"] = 3
df_telco_num.loc[df_telco_num["PaymentMethod"] == "Credit card (automatic)", "PaymentMethod"] = 4

df_telco_num["gender"] = df_telco_num["gender"].astype(int)
df_telco_num["InternetService"] = df_telco_num["InternetService"].astype(int)
df_telco_num["Contract"] = df_telco_num["Contract"].astype(int)
df_telco_num["PaymentMethod"] = df_telco_num["PaymentMethod"].astype(int)

df_telco_num['TotalCharges'] = df_telco_num['TotalCharges'].replace(' ', np.nan)
df_telco_num['TotalCharges'] = df_telco_num['TotalCharges'].astype(float)
df_telco_num['TotalCharges'] = df_telco_num['TotalCharges'].fillna(df_telco_num['TotalCharges'].mean())

NameError: name 'df_telco' is not defined

#### Data Splitting for Model Development

**Purpose:**  
Split the preprocessed data into training and test sets to:  
- Train models on one subset (training)  
- Evaluate performance on unseen data (test)  
- Prevent data leakage and overfitting  

**Key Parameters:**  
- Test size: 20% of total data  
- Random state: Fixed for reproducibility (`SEED = 9`)  
- Stratification: Preserves original churn distribution in splits  

In [None]:
X = df_telco_num.drop(columns = ["customerID", "Churn"])
y = df_telco_num["Churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = SEED)

#### Random Forest Model Optimization

**Objective**: Find the optimal Random Forest configuration through hyperparameter tuning, focusing on maximizing recall to best identify potential churners.

**Tuning Strategy:**
- **10-fold cross-validation** for robust performance estimation  
- **Recall-focused** (prioritize identifying true churners)  
- **Key Parameters:**  
  - `n_estimators`: Number of trees (100-300)  
  - `max_depth`: Tree complexity (shallow to deep)  
  - `min_samples_split`: Controls overfitting  
  - `class_weight`: Handles imbalanced data  

In [13]:
rf = RandomForestClassifier(random_state = SEED)
rf.get_params()

params_rf = {
                'n_estimators': [100, 200, 300],
                'max_depth': [None, 10, 20],
                'min_samples_split': [2, 5, 10],
                'class_weight': [None, 'balanced']
}

grid_rf = GridSearchCV(estimator = rf,
                      param_grid = params_rf,
                      cv = 10,
                      scoring = "recall",
                      verbose = 1,
                      n_jobs = -1)

grid_rf.fit(X_train, y_train)

print(grid_rf.best_params_)

NameError: name 'X_train' is not defined

#### Final Random Forest Model Evaluation

**Model Configuration**:
- n_estimators: 100 trees
- max_depth: 10 levels
- class_weight: Balanced (handles class imbalance)
- min_samples_split: 10 samples required to split a node
- random_state: `SEED` for reproducibility

**Model Performance**:
- The model achieved a recall of 0.71 for churners, surpassing our target of 0.70.
- The Precision for churners is 0.55, which is acceptable given the high cost of false negatives in this scenario.
- For non-churners, both metrics are strong, ensuring a stable prediction base.

→ The model performs reasonably well in identifying customers who are likely to churn, while maintaining a manageable rate of false positives. It can serve as an operational tool for early intervention.

**Feature Importance:**
The key features influencing the model’s decisions were `Contract`, `Tenure`, `TotalCharges`, `MonthlyCharges`, `OnlineSecurity`, and `TechSupport`. These variables were used most frequently by the model to distinguish between churners and non-churners.
It is important to note that high feature importance **does not imply causation**; rather, it indicates how **relevant** a feature is within the model’s context.
Combined with results from the logistic regression and exploratory analysis, these features consistently highlight **specific risk patterns**—such as short contract duration, low usage of add-on services, and lower customer value—that can guide targeted retention efforts.

In [None]:
best_model_rf = RandomForestClassifier(n_estimators = 100,
                                    max_depth = 10,
                                    class_weight = 'balanced',
                                    min_samples_split = 10,
                                    random_state = SEED)
best_model_rf.fit(X_train, y_train)
y_pred_rf = best_model_rf.predict(X_test)

best_recall_rf = recall_score(y_test, y_pred_rf)

print(best_recall_rf)

print(classification_report(y_test, y_pred_rf, target_names=["No Churn", "Churn"]))


importances_rf = pd.Series(best_model_rf.feature_importances_, index = X.columns)

sorted_importances_rf = importances_rf.sort_values(ascending = False)

sorted_importances_rf.plot(kind = 'barh', color = 'lightgreen'); plt.show()

###  Analyzing Add-On Service Impact on Churn

This investigation examines whether value-added services like Tech Support, Online Security, Online Backup, and Device Protection effectively reduce customer churn in telecommunications. We hypothesize that customers who subscribe to these premium services demonstrate lower attrition rates due to three key mechanisms: (1) enhanced perceived value of their service package, making them less likely to switch providers; (2) increased switching costs as these add-ons create more complex service integrations; and (3) improved overall customer experience through better support and protection features. The analysis specifically focuses on measuring the churn impact of four critical services: Technical Support, which provides troubleshooting assistance; Online Security, offering protection against digital threats; Online Backup for data preservation; and Device Protection covering hardware maintenance. By comparing churn rates between subscribers and non-subscribers of each service, we aim to quantify their true retention value and identify which offerings provide the strongest protective effect against customer attrition.

#### Hypothesis Testing: Impact of Value-Added Services on Churn

**Objective**: Statistically validate whether subscribing to value-added services significantly reduces churn probability.

**Methodology:**  
For each service (Tech Support, Online Security, Online Backup, Device Protection):
1. Conduct chi-square test of independence between service subscription and churn status
2. Compare observed vs. expected churn frequencies
3. Report Pearson's chi-square statistic with p-values

**Null Hypothesis (H₀)**: No association exists between service subscription and churn rate.

**Alternative Hypothesis (H₁)**: Service subscribers have different churn rates than non-subscribers.

**Results**: Initial Chi-Square tests revealed highly significant associations between churn status and all four services. In each case, the distribution of churned vs. retained customers differed meaningfully depending on whether the service was used – suggesting that service usage correlates with churn behavior.

In [None]:
services = ["TechSupport", "OnlineSecurity", "OnlineBackup", "DeviceProtection"]

for service in services:
    expected, observed, stats = pingouin.chi2_independence(data = df_telco, x = service, y = "Churn")
    print(stats)

#### Logistic Regression: Quantifying Service Impact on Churn

**Objective:** Measure how each value-added service affects churn probability while controlling for base predictors (monthly charges, tenure, age, and contract type).

**Model Specifications:**
- **Dependent Variable**: `Churn`
- **Base Control Variables**:
  - Monthly Costs (`MonthlyCharges`)
  - Duration of Customer (`tenure`)
  - Age (`SeniorCitizen`)
  - Type of Contract (`Contract`)
- **Tested Service Variables**:
  - Tech Support
  - Online Security  
  - Online Backup
  - Device Protection

**Analytical Approach:** Separate logistic models for each service to isolate individual effects while maintaining consistent controls.

**Results**:  All coefficients are negative, these services are associated with lower churn probabilities, even after adjusting for pricing, contract type, and customer profile
- While we cannot infer causality, the consistency across models suggests that customers who opt into additional services – especially Tech Support and Online Security – may have stronger engagement or switching barriers, lowering their churn likelihood.
- These services could therefore serve as potential retention levers, especially if bundled strategically or promoted to at-risk customers.

In [None]:
base_predictors = "MonthlyCharges + tenure + SeniorCitizen + Contract"

for service in services:
    # Build the formula dynamically
    formula = f"Churn ~ {base_predictors} + {service}"
    model = logit(formula, data=df_telco).fit()
    print(model.summary())