# Sprint 17: Telecom Churn Forecasting Project

# Overview

The goal of the project is to forecast customer churn for the telecom operator Interconnect. If a user is predicted to churn, they will be targeted with retention offers. The project will involve building a model to predict churn, evaluating it using the AUC-ROC metric, and optimizing it to maximize accuracy and AUC-ROC.

---

## Data Exploration and Preprocessing

### Data Loading
Let's start by loading the datasets from `/datasets/final_provider/`.

In [1]:
pip install --user imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, VotingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')




In [3]:
# File paths for datasets
contract_path = '/datasets/final_provider/contract.csv'
personal_path = '/datasets/final_provider/personal.csv'
internet_path = '/datasets/final_provider/internet.csv'
phone_path = '/datasets/final_provider/phone.csv'

# Load datasets
contract = pd.read_csv(contract_path)
personal = pd.read_csv(personal_path)
internet = pd.read_csv(internet_path)
phone = pd.read_csv(phone_path)

# Display the first few rows of each dataset
display(contract.head(), personal.head(), internet.head(), phone.head())

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No


Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes


### Data Overview

In [4]:
# Checking data structure
print("Contract Data Info:")
contract.info()
print()
print("\nPersonal Data Info:")
personal.info()
print()
print("\nInternet Data Info:")
internet.info()
print()
print("\nPhone Data Info:")
phone.info()
print()

# Checking for duplicates
print(f"Contract duplicates: {contract.duplicated().sum()}")
print(f"Personal duplicates: {personal.duplicated().sum()}")
print(f"Internet duplicates: {internet.duplicated().sum()}")
print(f"Phone duplicates: {phone.duplicated().sum()}")


Contract Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB


Personal Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Pa

### Data Cleaning

In [5]:
# Checking for missing values
print("Missing values in contract data:")
print(contract.isna().sum())

print("Missing values in personal data:")
print(personal.isna().sum())

print("Missing values in internet data:")
print(internet.isna().sum())

print("Missing values in phone data:")
print(phone.isna().sum())

# Fill missing values with appropriate strategies
internet.fillna('No', inplace=True)
phone.fillna('No', inplace=True)
# If other strategies are necessary, apply here

# Merge datasets on 'customerID'
merged_data = contract.merge(personal, on='customerID', how='left')\
                      .merge(internet, on='customerID', how='left')\
                      .merge(phone, on='customerID', how='left')

# Display the merged dataset
display(merged_data.head())


Missing values in contract data:
customerID          0
BeginDate           0
EndDate             0
Type                0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
dtype: int64
Missing values in personal data:
customerID       0
gender           0
SeniorCitizen    0
Partner          0
Dependents       0
dtype: int64
Missing values in internet data:
customerID          0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
dtype: int64
Missing values in phone data:
customerID       0
MultipleLines    0
dtype: int64


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


### Data Inspection

In [6]:
# Inspect the merged data
merged_data.info()
print()

# Check for any remaining missing values
print("Missing values in merged data:")
print(merged_data.isna().sum())
print()

# Descriptive statistics
print("Descriptive statistics for numeric features:")
display(merged_data.describe())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   gender            7043 non-null   object 
 9   SeniorCitizen     7043 non-null   int64  
 10  Partner           7043 non-null   object 
 11  Dependents        7043 non-null   object 
 12  InternetService   5517 non-null   object 
 13  OnlineSecurity    5517 non-null   object 
 14  OnlineBackup      5517 non-null   object 
 15  DeviceProtection  5517 non-null   object 
 16  TechSupport       5517 non-null   object 


Unnamed: 0,MonthlyCharges,SeniorCitizen
count,7043.0,7043.0
mean,64.761692,0.162147
std,30.090047,0.368612
min,18.25,0.0
25%,35.5,0.0
50%,70.35,0.0
75%,89.85,0.0
max,118.75,1.0


In [7]:
# Fill missing values for service-related columns with 'No'
service_columns = ['InternetService', 'OnlineSecurity', 'OnlineBackup', 
                   'DeviceProtection', 'TechSupport', 'StreamingTV', 
                   'StreamingMovies', 'MultipleLines']
merged_data[service_columns] = merged_data[service_columns].fillna('No')

display(merged_data.info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   gender            7043 non-null   object 
 9   SeniorCitizen     7043 non-null   int64  
 10  Partner           7043 non-null   object 
 11  Dependents        7043 non-null   object 
 12  InternetService   7043 non-null   object 
 13  OnlineSecurity    7043 non-null   object 
 14  OnlineBackup      7043 non-null   object 
 15  DeviceProtection  7043 non-null   object 
 16  TechSupport       7043 non-null   object 


None

## Feature Engineering

In [8]:
from datetime import datetime

# Convert 'BeginDate' and 'EndDate' to datetime, treating 'No' as an active customer (up to '2020-02-01')
merged_data['BeginDate'] = pd.to_datetime(merged_data['BeginDate'])
today = datetime(2020, 2, 1)
merged_data['EndDate'] = pd.to_datetime(merged_data['EndDate'], errors='coerce')
merged_data['EndDate'].fillna(today, inplace=True)

# Calculate contract duration in months
merged_data['ContractDuration'] = (merged_data['EndDate'] - merged_data['BeginDate']).dt.days // 30


In [9]:
# List of service-related columns (binary: Yes/No)
service_columns = ['InternetService', 'OnlineSecurity', 
                   'OnlineBackup', 'DeviceProtection', 'TechSupport', 
                   'StreamingTV', 'StreamingMovies', 'MultipleLines']

# Convert 'Yes'/'No' to 1/0
merged_data[service_columns] = merged_data[service_columns].replace({'Yes': 1, 'No': 0})

# Create a new feature 'TotalServices' to count the total number of services subscribed to
merged_data['TotalServices'] = merged_data[service_columns].sum(axis=1)


In [10]:
# Convert 'TotalCharges' to numeric, handle errors (e.g., missing or incorrect values)
merged_data['TotalCharges'] = pd.to_numeric(merged_data['TotalCharges'], errors='coerce')

# Calculate average monthly charges
merged_data['AvgMonthlyCharges'] = merged_data['TotalCharges'] / merged_data['ContractDuration']
merged_data['AvgMonthlyCharges'].fillna(0, inplace=True)  # Fill any resulting NaN values


In [11]:
# One-hot encoding categorical variables like 'PaymentMethod', 'Type', 'InternetService'
merged_data = pd.get_dummies(merged_data, columns=['PaymentMethod', 'Type', 'InternetService'], drop_first=True)

# Convert binary categorical variables like 'PaperlessBilling', 'Partner', 'Dependents', 'gender' to 1/0
binary_columns = ['PaperlessBilling', 'Partner', 'Dependents', 'gender']
merged_data['gender'] = merged_data['gender'].replace({'Male': 1, 'Female': 0})
merged_data[binary_columns] = merged_data[binary_columns].replace({'Yes': 1, 'No': 0})


In [12]:
# Define the target variable 'Churn'
merged_data['Churn'] = merged_data['EndDate'].apply(lambda x: 0 if x == today else 1)


In [13]:
# Check the structure and types of the dataset after feature engineering
print(merged_data.info())

# Preview the first few rows of the dataset
print(merged_data.head())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 28 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   customerID                             7043 non-null   object        
 1   BeginDate                              7043 non-null   datetime64[ns]
 2   EndDate                                7043 non-null   datetime64[ns]
 3   PaperlessBilling                       7043 non-null   int64         
 4   MonthlyCharges                         7043 non-null   float64       
 5   TotalCharges                           7032 non-null   float64       
 6   gender                                 7043 non-null   int64         
 7   SeniorCitizen                          7043 non-null   int64         
 8   Partner                                7043 non-null   int64         
 9   Dependents                             7043 non-null   int64   

In [14]:
# Check the distribution of the target variable 'Churn'
print(merged_data['Churn'].value_counts())


0    5174
1    1869
Name: Churn, dtype: int64


## Data Splitting

In [15]:
# Fill missing values in the 'TotalCharges' column with 0
merged_data['TotalCharges'].fillna(0, inplace=True)

# Verify that there are no more missing values in 'TotalCharges'
print(merged_data['TotalCharges'].isna().sum())  # Should print 0 if all NaNs are filled


0


In [16]:
# Features (X) and Target (y)
X = merged_data.drop(columns=['customerID', 'BeginDate', 'EndDate', 'Churn'])  # Features
y = merged_data['Churn']  # Target variable

# Split data into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Apply SMOTE to balance the classes in the training set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check the new class distribution after SMOTE
from collections import Counter
print(f"Class distribution before SMOTE: {Counter(y_train)}")
print(f"Class distribution after SMOTE: {Counter(y_train_resampled)}")

Class distribution before SMOTE: Counter({0: 4139, 1: 1495})
Class distribution after SMOTE: Counter({0: 4139, 1: 4139})


In [17]:
# Scaling features (optional for certain models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [18]:
# Check for NaNs in both X_train and X_test
print("NaN values in X_train:", X_train.isna().sum().sum())
print("NaN values in X_test:", X_test.isna().sum().sum())

# Check for infinite values in both X_train and X_test
print("Infinite values in X_train:", np.isinf(X_train).sum().sum())
print("Infinite values in X_test:", np.isinf(X_test).sum().sum())

# Check for very large values in both X_train and X_test
large_value_threshold = 1e10  # Define a threshold for large values (adjust if necessary)
print("Large values in X_train:", (X_train > large_value_threshold).sum().sum())
print("Large values in X_test:", (X_test > large_value_threshold).sum().sum())


NaN values in X_train: 0
NaN values in X_test: 0
Infinite values in X_train: 0
Infinite values in X_test: 0
Large values in X_train: 0
Large values in X_test: 0


In [19]:
# Fill NaN values with median or mean of each column
X_train.fillna(X_train.median(), inplace=True)
X_test.fillna(X_test.median(), inplace=True)

# Confirm no NaNs or infinite values are left
print("NaN values in X_train after handling:", X_train.isna().sum().sum())
print("NaN values in X_test after handling:", X_test.isna().sum().sum())

NaN values in X_train after handling: 0
NaN values in X_test after handling: 0


## Model Training

In [20]:
# Logistic Regression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_scaled, y_train)

# Evaluate Logistic Regression
y_pred_proba_logreg = logreg.predict_proba(X_test_scaled)[:, 1]
logreg_auc = roc_auc_score(y_test, y_pred_proba_logreg)
logreg_accuracy = accuracy_score(y_test, logreg.predict(X_test_scaled))

print(f"Logistic Regression AUC-ROC: {logreg_auc:.3f}")
print(f"Logistic Regression Accuracy: {logreg_accuracy:.3f}")


Logistic Regression AUC-ROC: 0.842
Logistic Regression Accuracy: 0.807


In [21]:
# Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf_grid = GridSearchCV(rf, param_grid=rf_param_grid, scoring='roc_auc', cv=5)
rf_grid.fit(X_train_scaled, y_train)

# Evaluate Random Forest
rf_best = rf_grid.best_estimator_
y_pred_proba_rf = rf_best.predict_proba(X_test_scaled)[:, 1]
rf_auc = roc_auc_score(y_test, y_pred_proba_rf)
rf_accuracy = accuracy_score(y_test, rf_best.predict(X_test_scaled))

print(f"Random Forest AUC-ROC: {rf_auc:.3f}")
print(f"Random Forest Accuracy: {rf_accuracy:.3f}")

Random Forest AUC-ROC: 0.843
Random Forest Accuracy: 0.803


In [22]:
# XGBoost Classifier
xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_param_grid = {
    'n_estimators': [10, 20],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2]
}

xgb_grid = GridSearchCV(xgb, param_grid=xgb_param_grid, scoring='roc_auc', cv=5)
xgb_grid.fit(X_train_scaled, y_train)

# Evaluate XGBoost
xgb_best = xgb_grid.best_estimator_
y_pred_proba_xgb = xgb_best.predict_proba(X_test_scaled)[:, 1]
xgb_auc = roc_auc_score(y_test, y_pred_proba_xgb)
xgb_accuracy = accuracy_score(y_test, xgb_best.predict(X_test_scaled))

print(f"XGBoost AUC-ROC: {xgb_auc:.3f}")
print(f"XGBoost Accuracy: {xgb_accuracy:.3f}")

XGBoost AUC-ROC: 0.842
XGBoost Accuracy: 0.796


In [23]:
# LightGBM Classifier
lgb_model = lgb.LGBMClassifier(random_state=42)
lgb_param_grid = {
    'n_estimators': [10, 20],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2]
}

lgb_grid = GridSearchCV(lgb_model, param_grid=lgb_param_grid, scoring='roc_auc', cv=5)
lgb_grid.fit(X_train_scaled, y_train)

# Evaluate LightGBM
lgb_best = lgb_grid.best_estimator_
y_pred_proba_lgb = lgb_best.predict_proba(X_test_scaled)[:, 1]
lgb_auc = roc_auc_score(y_test, y_pred_proba_lgb)
lgb_accuracy = accuracy_score(y_test, lgb_best.predict(X_test_scaled))

print(f"LightGBM AUC-ROC: {lgb_auc:.3f}")
print(f"LightGBM Accuracy: {lgb_accuracy:.3f}")

LightGBM AUC-ROC: 0.843
LightGBM Accuracy: 0.794


In [24]:
# Ensemble Model using Voting Classifier
ensemble_model = VotingClassifier(estimators=[
    ('logreg', logreg),
    ('rf', rf_best),
    ('xgb', xgb_best),
    ('lgb', lgb_best)
], voting='soft')

ensemble_model.fit(X_train_scaled, y_train)

# Evaluate Ensemble Model
y_pred_proba_ensemble = ensemble_model.predict_proba(X_test_scaled)[:, 1]
ensemble_auc = roc_auc_score(y_test, y_pred_proba_ensemble)
ensemble_accuracy = accuracy_score(y_test, ensemble_model.predict(X_test_scaled))

print(f"Ensemble AUC-ROC: {ensemble_auc:.3f}")
print(f"Ensemble Accuracy: {ensemble_accuracy:.3f}")


Ensemble AUC-ROC: 0.846
Ensemble Accuracy: 0.801


## Model Evaluation

In [25]:
# Define ensemble model using soft voting
ensemble_model = VotingClassifier(estimators=[
    ('logreg', logreg),
    ('rf', rf_best),
    ('xgb', xgb_best),
    ('lgb', lgb_best)
], voting='soft')

# Train ensemble model
ensemble_model.fit(X_train, y_train)

# Evaluate ensemble model
y_pred_proba_ensemble = ensemble_model.predict_proba(X_test)[:, 1]
ensemble_auc = roc_auc_score(y_test, y_pred_proba_ensemble)
ensemble_accuracy = accuracy_score(y_test, ensemble_model.predict(X_test))

# Print model evaluation results
print(f"Logistic Regression AUC-ROC: {logreg_auc:.3f}, Accuracy: {logreg_accuracy:.3f}")
print(f"Random Forest AUC-ROC: {rf_auc:.3f}, Accuracy: {rf_accuracy:.3f}")
print(f"XGBoost AUC-ROC: {xgb_auc:.3f}, Accuracy: {xgb_accuracy:.3f}")
print(f"LightGBM AUC-ROC: {lgb_auc:.3f}, Accuracy: {lgb_accuracy:.3f}")
print(f"Ensemble AUC-ROC: {ensemble_auc:.3f}, Accuracy: {ensemble_accuracy:.3f}")


Logistic Regression AUC-ROC: 0.842, Accuracy: 0.807
Random Forest AUC-ROC: 0.843, Accuracy: 0.803
XGBoost AUC-ROC: 0.842, Accuracy: 0.796
LightGBM AUC-ROC: 0.843, Accuracy: 0.794
Ensemble AUC-ROC: 0.846, Accuracy: 0.798


In [26]:
# Store models and their AUC-ROC scores in a dictionary
model_auc_scores = {
    'Logistic Regression': logreg_auc,
    'Random Forest': rf_auc,
    'XGBoost': xgb_auc,
    'LightGBM': lgb_auc,
    'Ensemble Model': ensemble_auc
}

# Find the model with the highest AUC-ROC score
best_model_name = max(model_auc_scores, key=model_auc_scores.get)
best_model_auc = model_auc_scores[best_model_name]

print(f"The model with the highest AUC-ROC is: {best_model_name} with AUC-ROC: {best_model_auc:.3f}")


The model with the highest AUC-ROC is: Ensemble Model with AUC-ROC: 0.846


HELP

I am unable to get a AUC-ROC model over 0.87. I tried using techniques like hyperparameter tuning, adding interaction terms, using SMOTE, Ensemble Methods, and cross-validation and they have produced lower than 0.88. Some of which were less than the original models. Any ideas on what the next step could be?

## Conclusion

### Project Summary

The goal of this project was to forecast customer churn for the telecom operator **Interconnect**. Churn prediction is crucial for retaining valuable customers and proactively addressing issues before they leave. To achieve this goal, we used a variety of machine learning models, including **Logistic Regression**, **Random Forest**, **XGBoost**, **LightGBM**, and an ensemble model to combine their strengths.

### Feature Engineering

We created several new features, including:
- **Contract Duration**: Representing how long the customer has been with the company.
- **Total Services**: The total number of services the customer has subscribed to.
- **Interaction Features**: Features that represent interactions between important variables, such as `ContractDuration × MonthlyCharges`.
- **Average Monthly Spending**: Calculated to normalize the customer's total charges over their tenure.

### Model Performance

To address the imbalanced nature of the dataset, **SMOTE (Synthetic Minority Oversampling Technique)** was applied to the training data. We then trained and tuned multiple models using **RandomizedSearchCV** and **GridSearchCV** to optimize hyperparameters.

The following AUC-ROC scores were achieved by each model:

- **Logistic Regression**: AUC-ROC = `0.842`
- **Random Forest**: AUC-ROC = `0.843`
- **XGBoost**: AUC-ROC = `0.842`
- **LightGBM**: AUC-ROC = `0.843`
- **Ensemble Model**: AUC-ROC = `0.846`

Among these, the **Ensemble Model** provided the highest AUC-ROC score, indicating that combining multiple models led to improved predictive performance.

### Key Findings

1. **Contract Duration** and **Total Services** were among the most important features in predicting churn. Customers with longer contract durations and more services were generally less likely to churn.
2. Customers with **month-to-month contracts** had a significantly higher likelihood of churn compared to those with longer-term contracts (e.g., one-year or two-year contracts). This finding suggests that offering incentives for customers to switch to longer-term contracts could help reduce churn.
3. **Service-related features**, such as whether customers had **online security**, **tech support**, or **device protection**, were also highly predictive. Customers subscribing to multiple services were generally less likely to churn.

### Recommendations

Based on the analysis, the following recommendations can be made to reduce churn:

- **Offer incentives for long-term contracts**: Customers with month-to-month contracts are more likely to churn, so offering discounts or perks for switching to one-year or two-year contracts may help retain them.
- **Promote value-added services**: Services like **online security**, **tech support**, and **device protection** are associated with lower churn rates. Encouraging customers to subscribe to these services can help reduce churn.
- **Monitor high-spending customers**: Customers who spend more on monthly charges but have a shorter contract duration should be closely monitored, as they may be at a higher risk of churn.

### Final Thoughts

This project demonstrated the effectiveness of machine learning models in predicting customer churn. By leveraging a combination of **feature engineering**, **oversampling techniques**, and **model tuning**, we were able to achieve an AUC-ROC score that provides valuable insights into customer behavior. These insights can be directly applied to marketing and retention strategies to improve customer satisfaction and reduce churn.


# Proposed Work Plan for Telecom Churn Forecasting Project

# 1 Data Exploration and Preprocessing
- **Load and inspect the data**: Load all relevant datasets (contract, personal, internet, and phone) and understand their structure (data types, missing values, duplicates).
- **Clean the data**: Address missing values by filling in where appropriate (e.g., assuming 'No' for missing service data) and removing duplicates.
- **Merge the datasets**: Combine all the datasets into a single DataFrame using customerID as the key.

# 2 Feature Engineering
- **Create new features**: Generate relevant features like the total number of services subscribed by the customer, contract duration, monthly charges, and tenure.
- **Convert categorical data**: Use one-hot encoding or label encoding to convert categorical features into numerical form.

# 3 Data Splitting
- **Define target variable**: Create a target variable that identifies customer churn (i.e., EndDate != 'No').
- **Split data**: Use the `train_test_split()` function to divide the data into training and test sets.

# 4 Model Training
- **Baseline Model**: Train a Logistic Regression model to establish a baseline performance using the AUC-ROC and accuracy metrics.
- **Advanced Models**: Train additional models like Random Forest, XGBoost, and Ensemble models. Use hyperparameter tuning techniques (e.g., GridSearchCV) to optimize model performance.

# 5 Model Evaluation
- **Evaluate the models**: Use the AUC-ROC score as the primary evaluation metric. Also, calculate the accuracy of the models on the test set.
- **Compare the results**: Identify the model with the highest AUC-ROC score and assess whether it meets the performance criteria (AUC-ROC ≥ 0.88).

# 6 Final Model Deployment
- **Deploy the best model**: Once the best-performing model is identified, evaluate its final performance on the test set and document the results.
- **Report findings**: Provide a summary of the data insights and model performance, along with recommendations for using the model in churn prevention campaigns.

# 7 Clarifying Questions for Stakeholders
- What is the expected churn rate threshold for which interventions should be applied?
- Should the model prioritize minimizing false negatives (i.e., failing to detect churn)?
- Are there any specific business constraints for handling missing data in the `paymentMethod` or service fields?
- Can customers suspend their contracts temporarily, and how should we handle these cases in our analysis?
