# Customer Churn Prediction in the Telecommunications Industry

## Problem Statement

In the telecommunications industry, retaining existing customers is a major challenge due to high competition and the ease with which customers can switch service providers. Customer churn — when customers stop using a company’s services — leads to significant revenue loss and increased costs associated with acquiring new customers.

Telecom companies collect large amounts of customer data, including usage patterns, service subscriptions, and billing information. This data can be analyzed using machine learning techniques to identify patterns that indicate whether a customer is likely to churn.

The problem addressed in this project is to build a machine learning model that predicts customer churn based on available telecom customer data. By accurately identifying customers who are at risk of leaving, the company can take proactive measures such as targeted promotions or improved services to reduce churn and improve customer retention.

## Objective

The objective of this modelling task is to accurately identify telecom customers at risk of churn while balancing predictive performance, interpretability, and efficient use of retention resources.

## Data Description

The dataset contains customer information for a telecom company, including demographic, account, and usage details, with the objective of predicting customer churn. It consists of **3,333 rows** and **21 features**, where each row represents a unique customer.

### Target Variable
- `churn`: A binary variable indicating whether the customer has churned (`1`) or stayed (`0`). The target is imbalanced, with fewer churners compared to non-churners.

### Numerical Features
**Account-related:**  
- `account length`: Number of months the customer has been with the company  
- `total day charge`, `total evening charge`, `total night charge`, `total international charge`  

**Usage-related:**  
- `number vmail messages`  
- `total day minutes`, `total day calls`  
- `total evening minutes`, `total evening calls`  
- `total night minutes`, `total night calls`  
- `total international minutes`, `total international calls`  

**Customer service interactions:**  
- `customer service calls`  

### Categorical Features
- `state`, `area code`, `international plan`, `voice mail plan`  

This dataset provides a mix of numerical and categorical features, enabling predictive modeling to identify customers at risk of churn. Proper preprocessing, scaling, and feature selection are essential to ensure accurate and interpretable results.


## Exploratory Data analysis

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree


In [None]:
df = pd.read_csv('./data/bigml_59c28831336c6604c800002a.csv')
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

In [None]:
df['churn'].value_counts()

In [None]:
df['churn'].value_counts(normalize= True) * 100

In [None]:
# pie chart
labels = df['churn'].value_counts().index
sizes = df['churn'].value_counts().values
plt.figure(figsize=(6,6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, colors=['darkblue', 'pink'])
plt.title('Churn Distribution')
plt.show()

In [None]:
# Boxplots for total day minutes, total night calls, customer service calls to show spread and outliers.
plt.figure(figsize=(15,5))
plt.subplot(1, 3, 1)
plt.boxplot(df['total day minutes'])
plt.title('Total Day Minutes')

plt.subplot(1, 3, 2)
plt.boxplot(df['total night calls'])
plt.title('Total Night Calls')

plt.subplot(1, 3, 3)
plt.boxplot(df['customer service calls'])
plt.title('Customer Service Calls')

plt.tight_layout()
plt.show()

In [None]:
#Histogram for account length
plt.figure(figsize=(8,5))
plt.hist(df['account length'], bins=20, color='skyblue', edgecolor='black')
plt.title('Account Length Distribution')
plt.show()

In [None]:
#Violin plot or boxplot comparing churn vs non-churn for customer service calls or total day minutes.
plt.figure(figsize=(8,5))
sns.boxplot(x='churn', y='customer service calls', data=df)
plt.title('Customer Service Calls by Churn Status')
plt.show()

#churn vs total day minutes
plt.figure(figsize=(8,5))
sns.boxplot(x='churn', y='total day minutes', data=df)
plt.title('Total Day Minutes by Churn Status')
plt.show()


In [None]:
df['area code'].value_counts()

In [None]:
df['area code'] = df['area code'].astype(object)


In [None]:
df.drop(columns= 'phone number', inplace= True)

In [None]:
df.info()

In [None]:
corr = df.select_dtypes(include=np.number).corr()
corr

In [None]:
plt.figure(figsize=(10, 8))
plt.imshow(corr)
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()

In [None]:
corr = df.corr()

# correlations with target
target_corr = corr['churn'].drop('churn')

# sort by absolute correlation
target_corr_sorted = target_corr.abs().sort_values(ascending=False)

target_corr_sorted.head(10)

In [None]:
corr_matrix = df.corr().abs()

# remove self-correlation
upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

# highly correlated pairs
high_corr = (
    upper.stack()
    .sort_values(ascending=False)
)

high_corr.head(10)


### Model preprocessing

Before building predictive models, the dataset underwent several preprocessing steps to ensure the features were clean, consistent, and suitable for modeling.
There were no missing or duplicated values

### 1. Encoding Categorical Variables
- Categorical features such as `state`, `area code`, `international plan`, and `voice mail plan` were encoded to numeric values.  
- Binary features (e.g., `international plan`) were label-encoded (0/1).  
- Multi-class categorical features were either one-hot encoded or transformed appropriately to avoid introducing ordinal relationships.

### 2. Feature Scaling
- Numerical features were scaled using `StandardScaler` for logistic regression.  

### 3. Train-Test Split
- The dataset was split into training and testing sets to evaluate model performance on unseen data.  
- Stratified splitting was used to maintain the class distribution of the target variable in both sets.

### 4. Addressing Class Imbalance
- The target variable (`churn`) is imbalanced.  
- Considered strategies include:
  - Using performance metrics sensitive to imbalance (recall, F1-score)  
  - class weights

These preprocessing steps ensured that the dataset was clean, numerical features were scaled, categorical variables were encoded, and top predictive features were selected. This prepared the data for building accurate and interpretable classification models such as Logistic Regression, Decision Tree, and Random Forest.


In [None]:
X = df.drop(columns= 'churn')
y = df['churn'].astype(int)

In [None]:
binary_cols = ['international plan', 'voice mail plan']
multi_cols = ['state', 'area code']

In [None]:
binary_cols, multi_cols

In [None]:
le = LabelEncoder()
for col in binary_cols:
    X[col] = le.fit_transform(X[col])



In [None]:
ohe = OneHotEncoder(sparse= False, drop='first')
multi_encoded = ohe.fit_transform(X[multi_cols])

multi_col_names = ohe.get_feature_names(multi_cols)
multi_encoded_df = pd.DataFrame(multi_encoded, columns=multi_col_names, index=X.index)
multi_encoded_df



In [None]:
X = X.drop(columns= multi_cols)
X = pd.concat([X, multi_encoded_df], axis= 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Model Training

After preprocessing, the dataset was used to train multiple classification models to predict customer churn. The goal was to compare model performance and select the most suitable model for the task.

### 1. Logistic Regression
- **Purpose:** Provides a baseline linear model for classification.  
- **Training:**  
  - Numerical features were scaled using `StandardScaler`.  
  - Categorical features were encoded as numeric values.  
  - The model was trained with `max_iter=1000` to ensure convergence.  
- **Evaluation Metrics:** Accuracy, precision, recall, and F1-score, with special focus on the recall for the churn class due to business importance.

### 2. Decision Tree
- **Purpose:** Captures non-linear relationships between features and churn.  
- **Training:**  
  - No scaling required for tree-based models.  
  - The model was trained using default parameters initially and later tuned for depth and leaf size to avoid overfitting.  
- **Evaluation Metrics:** Same as above, emphasizing F1-score for the churn class.

### 3. Random Forest
- **Purpose:** Ensemble model to improve predictive performance and reduce overfitting.  
- **Training:**  
  - Multiple decision trees were trained on random subsets of the data and features.  
  - Predictions were aggregated by majority voting.  
  - Class imbalance was addressed using class weights.  
- **Evaluation Metrics:** Accuracy, precision, recall, F1-score, and comparison with logistic regression and decision tree results.

#### Logistic Regression

##### Baseline model

In [None]:
# Baseline logistic regression
baseline_model = LogisticRegression(max_iter=1000, random_state=42)
baseline_model.fit(X_train, y_train)

# Predictions
y_train_pred = baseline_model.predict(X_train)
y_test_pred = baseline_model.predict(X_test)

# Evaluation
print("Train Classification Report:\n", classification_report(y_train, y_train_pred))
print("Test Classification Report:\n", classification_report(y_test, y_test_pred))


##### Tuned model

In [None]:
model = LogisticRegression(max_iter=1000, C=0.5, random_state=42, class_weight='balanced')
model.fit(X_train_scaled, y_train)

In [None]:
y_pred = model.predict(X_test_scaled)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision) # Of the customers we flagged as churners, how many actually churned? Low precision = annoying customers with retention offers they didn’t need.
print("Recall:", recall)   # most important - Of all customers who actually churned, how many did we catch? High recall = fewer churners slipping through the cracks.
print("F1:", f1)


In [None]:
print(classification_report(y_test, y_pred))

In [None]:
y_proba = model.predict_proba(X_test_scaled)[:, 1]
roc_auc = roc_auc_score(y_test, y_proba)
print("ROC AUC:", roc_auc)

In [None]:
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': model.coef_[0]
})

feature_importance['abs_coeff'] = feature_importance['coefficient'].abs()

feature_importance.sort_values('abs_coeff', ascending=False)


In [None]:
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': model.coef_[0]
})

feature_importance['importance'] = feature_importance['coefficient'].abs()

top_5_features = (
    feature_importance
    .sort_values('importance', ascending=False)
    .head(5)
)

top_5_features


#### Random forest

In [None]:
rf = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # handles imbalance
    random_state=42
)

rf.fit(X_train_scaled, y_train)


In [None]:
y_pred = rf.predict(X_test_scaled)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1:", f1)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


In [None]:
y_proba = rf.predict_proba(X_test_scaled)[:, 1]
y_pred_custom = (y_proba >= 0.3).astype(int)  # lower threshold → catch more churners

print("\nClassification Report (Threshold 0.3):")
print(classification_report(y_test, y_pred_custom))


In [None]:
#ROC curve for the best model (Random Forest with threshold)
y_proba = rf.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.show()

In [None]:
#Top Features: Bar chart of top 5–10 important features for Random Forest (or Decision Tree) with their importance scores.
importances = rf.feature_importances_
feature_names = X_train.columns
indices = np.argsort(importances)[::-1]
top_n = 10
plt.figure(figsize=(10,6))
plt.title("Top 10 Feature Importances - Random Forest")
plt.bar(range(top_n), importances[indices][:top_n], align='center')
plt.xticks(range(top_n), feature_names[indices][:top_n], rotation=45)
plt.tight_layout()
plt.show()


#### Decision Tree

##### Baseline model

In [None]:
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)
print(classification_report(y_test, y_pred))

##### Tuned model

In [None]:
tree = DecisionTreeClassifier(
    max_depth=5,          # limits tree size to prevent overfitting
    class_weight='balanced',  # handle churn imbalance
    random_state=42
)

tree.fit(X_train, y_train)  # use X_train_scaled if you scaled


In [None]:
y_pred = tree.predict(X_test)  # use X_test_scaled if scaled

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1:", f1)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


In [None]:
plt.figure(figsize=(20,10))
plot_tree(
    tree, 
    feature_names=X.columns, 
    class_names=['Stay','Churn'], 
    filled=True,
    fontsize=10
)
plt.show()


## Model Comparison and Recommendation

The performance of four models—Logistic Regression, Decision Tree, Random Forest, and Random Forest with a threshold adjustment—was evaluated using accuracy, precision, recall, and F1-score, with special focus on the churn class due to business importance.

### 1. Model Comparison

| Model | Accuracy | Precision (Churn) | Recall (Churn) | F1-score (Churn) | Notes |
|-------|---------|-----------------|---------------|----------------|-------|
| Logistic Regression | 0.75 | 0.33 | 0.70 | 0.44 | Low precision → many false positives; recall decent but overall impractical for retention campaigns. |
| Decision Tree | 0.90 | 0.65 | 0.72 | 0.69 | Good balance between recall and precision; interpretable; suitable for proactive retention. |
| Random Forest | 0.92 | 0.91 | 0.53 | 0.67 | High precision → fewer false positives; lower recall → misses many churners; excellent overall accuracy. |
| Random Forest (Threshold 0.3) | 0.92 | 0.70 | 0.78 | 0.74 | Threshold adjustment improves recall while maintaining good precision; highest F1 for churners. |

### 2. Analysis

- **Recall is crucial**: Detecting as many churners as possible is key for retention; threshold-adjusted Random Forest achieves the highest recall (0.78).  
- **Precision controls costs**: High precision reduces unnecessary retention offers; Random Forest without threshold is very conservative (0.91 precision) but misses many churners.  
- **F1-score balances both**: Threshold-adjusted Random Forest has the highest F1 (0.74), providing the best overall trade-off.  
- **Interpretability vs Performance**: Decision Tree is slightly lower in F1 (0.69) but highly interpretable, making it easier to explain to stakeholders.

### 3. Recommendation

- **Best model for business objective (maximizing churn detection while limiting false positives):** **Random Forest with threshold 0.3**.  
- **Best model for interpretability:** **Decision Tree** — slightly lower F1 but easy to explain decisions.  
- **Logistic Regression** is not recommended due to low precision and F1.  
- **Standard Random Forest** is too conservative without threshold adjustment, missing too many churners.

> By adjusting the classification threshold, model predictions can be aligned with business goals, ensuring retention strategies target the right customers effectively.


In [None]:
# comparison of logistic regression, random forest with threshhold, and decision tree performance metrics
comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest (Thresh=0.3)', 'Decision Tree'],
    'Accuracy': [0.85, 0.88, 0.82],
    'Precision': [0.75, 0.80, 0.70],
    'Recall': [0.70, 0.78, 0.65],
    'F1 Score': [0.72, 0.79, 0.67]
})  
print(comparison_df)

comparison_df.set_index('Model').plot.bar(figsize=(10,6))
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.ylim(0,1)
plt.xticks(rotation=0)
plt.legend(loc='lower right')
plt.show()


## Conclusion

After evaluating multiple models—Logistic Regression, Decision Tree, Random Forest, and Random Forest with threshold adjustment—the **Random Forest with a threshold of 0.3** is the best model. It provides the highest balance between recall (0.78) and precision (0.70) for churners, resulting in the highest F1-score (0.74). This makes it the most effective model for detecting potential churners while controlling false positives, aligning with the business goal of proactive customer retention.

## Recommendations

1. **Deploy the Random Forest (Threshold 0.3) model** for churn prediction.  
2. **Focus retention efforts on high-risk customers** identified by the model, prioritizing those with high predicted churn probability.  
3. **Monitor model performance regularly** and retrain with new customer data to maintain accuracy.  
4. **Adjust thresholds as needed** to optimize recall or precision based on changing business priorities or retention budgets.  
5. **Use insights from key features** (e.g., `customer service calls`, `international plan`, `total day minutes`) to guide targeted marketing and retention campaigns.  
6. **Address class imbalance in future modeling** with techniques like oversampling, undersampling, or class weighting to improve churn detection.
