
# Telco Customer Churn Prediction

Churn prediction is a critical aspect for businesses, especially in the subscription-based sector, to retain their customers and enhance profitability. In this notebook, we will be analyzing a dataset of telco customers and developing a predictive model to understand the key factors that influence customer churn. 

## Objectives
- Perform exploratory data analysis (EDA) to understand the data distribution and key characteristics.
- Preprocess the data including feature selection, transformation, and scaling.
- Train several machine learning models and select the one with the best performance.
- Interpret the results to understand the most important features driving churn.



# Step 1: Importing Essential Libraries

In this step, we import the essential libraries required for data manipulation, visualization, and model building.


In [2]:
# Importing Essential Libraries
# Import the core libraries for data manipulation (pandas), numerical computation (numpy), 
# and visualization (matplotlib, seaborn) that we will need for EDA and model building.
# Import others libraries for processing and moduling 

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from sklearn.cluster import KMeans
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, recall_score, precision_score, ConfusionMatrixDisplay, confusion_matrix, classification_report, f1_score
from sklearn.model_selection import RandomizedSearchCV
import joblib


# Setting global plot style for consistency
sns.set(style="whitegrid")


# Step 2: Data Loading and Initial Exploration

Here, we load the Telco Customer Churn dataset and conduct initial exploration to understand the data types, number of columns, and any missing values.


In [3]:
# Data Loading
# Loading the dataset into a pandas DataFrame to start our data analysis and model building.
data = pd.read_excel("Telco_customer_churn.xlsx")

# Basic Data Exploration
# Let's explore the structure of our dataset to understand the types of features available.
data.info()  # Displays information about columns and data types

# Checking for missing values
print("Missing values per column:")
print(data.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 

Here are some quick comments for the Data Loading and Initial Exploration step:

1. **Dataset Size**: The dataset contains 7,043 entries across 33 columns. This is a moderately sized dataset suitable for typical machine learning models.

2. **Data Types**: The dataset has mixed data types, including numerical (`int64`, `float64`) and categorical (`object`). Some columns like `Total Charges` are of type `object` and might need conversion for numerical analysis.

3. **Non-Null Counts**: Most columns have no missing values, except for `Churn Reason`, which has a large number of missing entries (5,174 out of 7,043). We need to decide how to handle this column—either by imputing, dropping, or analyzing its impact.

4. **Unique Identifiers**: Columns like `CustomerID` are unique identifiers and should be dropped for modeling purposes as they do not contribute to predicting churn.

5. **Redundant Columns**: Some columns such as `Count`, `Lat Long`, `Zip Code`, and `Churn Score` may not provide additional predictive power and could be dropped during data cleaning.

These observations will help guide the feature selection and preprocessing steps.

# Step 3: Data Cleaning and Preprocessing
In this section, we will clean the dataset by handling missing values, dropping unnecessary features, and encoding categorical variables.

In [None]:
# Identify Categorical Columns
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns
print(f"Categorical Columns: {numerical_cols}")

print("\n-------numerical Values---------")
for col in numerical_cols:
    print(f"{col}: {data[col].unique()}")

I will drop ((Count)) variable, because it is has unique value and it does not contain any pattern or relationship with the target variable.

In [None]:
data.drop('Count',axis=1, inplace=True)

I will CLTV (drop Customer Lifetime Value) variable because it is already a prediction based on many variables known to affect churn, including it in your machine learning model introduces target leakage.

In [None]:
data.drop('CLTV',axis=1, inplace=True)

In [None]:
# Compute the correlation matrix for numerical features
correlation_matrix = data.corr()

# Plot a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

#### Key Observations from the Correlation Matrix:
- The Tenure Months is correlated negatively with the target about - 0.35, that indicates the cunstumers  with longer tenure are less likely to churn.
- Monthly Charge is weak corrected positively with the target about 0.19, that indictes the cunstumers with Higher monthly charges slightly increase the likelihood of churn.
- CLTV (Customer Lifetime Value) is  weak corrected negatively with the target about -0.13,that indictes the cunstumers with higher CLTV customers are slightly less likely to churn.
- Churn Score is strong corrected positively with the target about 0.66, however, this is a data leakage risk if Churn Score is derived from or influenced by churn data.
- Zip Code, Latitude, and Longitude:
 -- Zip Code & Latitude: 0.90
 -- Latitude & Longitude: -0.88
 --These high correlations indicate these geographical features are redundant.


I will drop ((Churn Score)) variable because it is already a prediction based on many variables known to affect churn, including it in your machine learning model introduces target leakage.

In [None]:
data.drop('Churn Score', axis=1, inplace=True)

I will drop ip Code, Latitude, and Longitud because they are a redundant atures and are not directly predictive of churn (correlation with Churn Value is close to 0).

In [None]:
data_city = data[['Latitude', 'Longitude']] #I will save this dataset to use it for creation the Feature Engineering for City.
data.drop(['Zip Code', 'Latitude', 'Longitude'], axis=1, inplace=True)

In [None]:
# Identify columns with only one unique value
single_value_cols = [col for col in data.columns if data[col].nunique() == 1]

# Print the columns with a single unique value
print(f"Columns with a single unique value: {single_value_cols}")

I have to drop these variables: ((Count, Country, State)) because they have only one unique value across all rows.
This columns don't add any useful information to the model because they have no variability. they are essentially constant 
for all data points and will not help differentiate between target outcomes.

I will drop ((CustomerID)), because it is an Unique Identifiers and it does not contain any pattern or relationship with the target variable.

In [None]:
# Drop these columns from the dataset
data.drop(single_value_cols, axis=1, inplace=True)
data.drop('CustomerID', axis=1,inplace=True)

In [None]:
# Identify columns with a binary types (type of categorical) and tronsform them to numerical 0 and 1
binary_values_cols_catg = [col for col in data.columns if data[col].nunique() == 2 and data[col].dtype == 'object']
# Print the columns with a binary values
print(f"-- Columns with a binary values: {binary_values_cols_catg}")

print("\n-------Binary Values---------")
for col in binary_values_cols_catg:
    print(f"{col}: {data[col].unique()}")

In [None]:
print('these are the columns that we sill have :', data.columns)


# Step 4: Feature Engineering

We perform feature engineering by encoding categorical variables. Binary features are transformed into numeric values, and one-hot encoding is used for multi-category features.


In [None]:
# Encoding categorical variables into numerical representations is essential for most machine learning algorithms.
# Binary features are mapped to 0 and 1 for simplicity.

binary_cols = ['Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Paperless Billing']
for col in binary_cols:
    data[col] = data[col].map({'Yes': 1, 'No': 0})

# Encoding 'Gender' as a binary variable: Male -> 1, Female -> 0.
data['Gender'] = data['Gender'].map({'Male': 1, 'Female': 0})

# One-Hot Encoding for multi-category features such as Internet Service, Contract, and Payment Method.
# One-hot encoding helps the model to interpret categorical variables effectively by creating binary columns for each category.
data = pd.get_dummies(data, columns=['Internet Service', 'Contract', 'Payment Method'], drop_first=True)
# Feature Engineering - Categorical Encoding
# We convert categorical variables into numerical forms. Binary features are mapped to 0 and 1.
binary_cols = ['Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Paperless Billing']
for col in binary_cols:
    data[col] = data[col].map({'Yes': 1, 'No': 0})

# Encoding 'Gender' into a binary feature: 1 for Male, 0 for Female
data['Gender'] = data['Gender'].map({'Male': 1, 'Female': 0})

# One-Hot Encoding for multi-category features such as Internet Service, Contract, and Payment Method
data = pd.get_dummies(data, columns=['Internet Service', 'Contract', 'Payment Method'], drop_first=True)
#I will drop Churn Label because it is the same of Churn value and it is our target
data.drop('Churn Label', axis=1,inplace=True)
# Transform them to 1/0 values
for col in ['Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Paperless Billing']:
    data[col] = data[col].map({'Yes': 1, 'No': 0})

# Binary encoding for Gender
data['Gender'] = data['Gender'].map({'Male': 1, 'Female': 0})

In [None]:
categorical_cols = data.select_dtypes(include=['object', 'category']).columns
print(f"Categorical Columns: {categorical_cols}")

print("\n-------categorical Values---------")
for col in categorical_cols:
    print(f"{col}: {data[col].nunique()}")

# Step 5: Feature Exploration and Cleaning base for further processing
In this section, we will visualize some of the key features to understand their relationship with the target variable (Churn Value).

In [None]:
# Plot each categorical feature against Churn Value
 for col in categorical_cols:
        plt.figure(figsize=(8, 4))
        sns.countplot(x=col, hue='Churn Value', data=data)
        plt.title(f'Churn Distribution by {col}')
        plt.show()


In [None]:
# Function to plot stacked bar charts
def plot_stacked_bar(column):
    # Create a contingency table
    crosstab = pd.crosstab(data[column], data['Churn Value'], normalize='index')

    # Plot the stacked bar chart
    crosstab.plot(kind='bar', stacked=True, figsize=(8, 4), colormap='coolwarm')
    plt.title(f'Stacked Bar Plot: {column} vs Churn Value')
    plt.xlabel(column)
    plt.ylabel('Proportion')
    plt.legend(title='Churn Value', loc='upper right')
    plt.show()

# Plot stacked bar charts for all categorical columns
for col in categorical_cols:
    plot_stacked_bar(col)

In [None]:
# Perform Chi-Square test for each categorical feature
for col in categorical_cols:
    crosstab = pd.crosstab(data[col], data['Churn Value'])
    chi2, p, dof, expected = chi2_contingency(crosstab)
    
    print(f'{col}: Chi2 = {chi2:.2f}, p = {p:.4f}')
    if p < 0.05:
        print(f"  -> Significant relationship with Churn (p < 0.05)\n")
    else:
        print(f"  -> No significant relationship with Churn (p >= 0.05)\n")

- Multiple Lines (keep it): Customers with multiple lines or no phone service churn at similar rates.Single line customers seem to churn less frequently. Significant relationship with Churn (p < 0.05)
- Internet Service (keep it): Fiber optic customers churn more than DSL customers. Customers with no internet service have minimal churn. Significant relationship with Churn (p < 0.05)
- Online Security & oline Backup  & Device Protection & Tech Support (testing for multicollinearity (correlation between these features) during feature selection because they could be redundant): CMore churn for customers without these services. Again, no internet service customers rarely churn. Significant relationship with Churn (p < 0.05)
- Streaming Movies & Streaming TV (testing for multicollinearity (correlation between these features) during feature selection because they could be redundant) : Observation: Both "Yes" and "No" categories have a similar churn proportion. Customers with no internet service have minimal churn. Significant relationship with Churn (p < 0.05)
- Lat Long (drop it): The Lat Long is highly fragmented with too many unique values, making it hard to derive meaningful patterns from these visualizations. No significant relationship with Churn (p >= 0.05)
- City (keep it): It has a significant relationship with Churn (p < 0.05)
- Total Charges (drop it): The Total Charges values are scattered across a wide range, with no clear pattern in the churn distribution. No significant relationship with Churn (p >= 0.05)
- Payment Method (keep it): Customers paying with Electronic Check show higher churn than those using Bank Transfer, Credit Card, or Mailed Check. Significant relationship with Churn (p < 0.05)
- Contract (keep it): Month-to-Month contracts have higher churn compared to One-Year and Two-Year contracts. One-Year and Two-Year contracts have very low churn rates. Significant relationship with Churn (p < 0.05)
- Churn Reason (drop it): It has a significant relationship with Churn (p < 0.05)

In [None]:
##Churn Reason, Lat Long
data.drop(['Churn Reason', 'Lat Long'], axis=1,inplace=True)

In [None]:
data.select_dtypes(include=['object', 'category']).columns

In [None]:
##Multiple Lines
# Map Multiple Lines to 3 values
data['Multiple Lines'] = data['Multiple Lines'].map({'Yes':1, 'No':0, 'No phone service':0})

In [None]:
##Internet Service, Contract,Payment Method
# Apply one-hot encoding to these categorical features
catg_to_oneHot = ['Internet Service', 'Contract','Payment Method']


- City: I will Cluster Cities Using K-Means Based on Latitude and Longitude with K=5
- I will create a new variable called ((City Cluster)) and delet the old one (City)

In [None]:
##City
# Select Latitude and Longitude columns for clustering
coordinates = data_city

# Perform K-means clustering with an arbitrary number of clusters (e.g., 5)
kmeans = KMeans(n_clusters=5, random_state=42)
data['City Cluster'] = kmeans.fit_predict(coordinates)

# Drop city
data.drop(['City'], axis=1, inplace=True)


# Perform one-hot encoding
data = pd.get_dummies(data, columns=catg_to_oneHot, drop_first=True)

In [None]:
data['City Cluster'].info()

In [None]:
##Online Security & oline Backup & Device Protection & Tech Support
data['Online Security'].value_counts()

In [None]:
# Calculation of Confusion Matrix to see if there is any correlation between these variables

# Map Online Security & oline Backup & Device Protection & Tech Support to 3 values
data['Online Security'] = data['Online Security'].map({'Yes':1, 'No':0, 'No internet service':0})
data['Online Backup'] = data['Online Backup'].map({'Yes':1, 'No':0, 'No internet service':0})
data['Device Protection'] = data['Device Protection'].map({'Yes':1, 'No':0, 'No internet service':0})
data['Tech Support'] = data['Tech Support'].map({'Yes':1, 'No':0, 'No internet service':0})

# Compute the correlation matrix
correlation_matrix = data[['Online Security', 'Online Backup','Device Protection','Tech Support']].corr()

# Display the matrix
print(correlation_matrix)

# Optional: Plot the heatmap for better visualization
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(5, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Calculation of Variance Inflation Factor to see if there is any correlation between these variables (Methos statistic)

# Select relevant features
X = data[['Online Security', 'Online Backup','Device Protection','Tech Support']]

# Add a constant (intercept) to the features
X = sm.add_constant(X)

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)


- Based on Confusion Matrix; we got a Moderate Correlation (~0.5 - 0.8): Some overlap, but both may still add value.
- Based on VIF (Variance Inflation Factor); all the variables their VIF between 1 < VIF < 5 and thwy consider as a Moderate correlation so they don't have a strong corelation.
- Decision: i will keep all of them

In [None]:
##Streaming Movies & Streaming TV
data['Streaming Movies'].value_counts()

In [None]:
# Map Streaming Movies and Streaming TV to 3 values
data['Streaming Movies'] = data['Streaming Movies'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
data['Streaming TV'] = data['Streaming TV'].map({'Yes': 1, 'No': 0, 'No internet service': 0})

# Compute the correlation matrix
correlation_matrix = data[['Streaming Movies', 'Streaming TV']].corr()

# Display the matrix
print(correlation_matrix)

# Optional: Plot the heatmap for better visualization

plt.figure(figsize=(5, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix: Streaming Movies & Streaming TV')
plt.show()

High Correlation (> 0.8): Indicates multicollinearity. Both features contain redundant information, and you may want to drop one of them.
- I will drop (('Streaming Movies'))

In [None]:
data.drop('Streaming Movies', axis=1, inplace=True)


# Step 6: Exploratory Data Analysis - Target Variable

In this step, we visualize the distribution of the target variable (`Churn Value`) to understand the class balance, which is crucial for model evaluation.


In [None]:
# Visualizing the distribution of the target variable 'Churn Value' to understand the class balance.
# Understanding class distribution is important for choosing the right evaluation metric and handling class imbalance, if necessary.

plt.figure(figsize=(6, 4))
sns.countplot(x='Churn Value', data=data, palette='Set2')
plt.title('Distribution of Churn Value')
plt.xlabel('Churn Value (0 = No Churn, 1 = Churn)')
plt.ylabel('Count')
plt.show()

In [None]:
# Calculate the proportion of Churn vs Non-Churn
churn_counts = data['Churn Value'].value_counts()

# Plot the pie chart
plt.figure(figsize=(6, 6))
plt.pie(churn_counts, labels=['No Churn', 'Churn'], autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title('Churn vs Non-Churn Proportion')
plt.show()


In [None]:
# Get a summary of the target variable
churn_stats = data['Churn Value'].describe()

print("Statistics for Churn Value:")
print(churn_stats)

- 26.5370 % in the churn rate is means that about 1 in 4 customers in the dataset has churned.
- While not perfectly balanced, the churn rate is within a reasonable range (~25-30%), because most standard machine learning algorithms like Logistic Regression, Random Forest ect, can handle this case.
- If the churn rate low than 20%, then we have to use Smote or other method to make it balanced.
- std = 0.441561 close to 0.5 indicates that the classes are mostly balanced.

 ---> Introduce Smote methode to balance the dataset

In [None]:
#Checking for Missing Values:
print("\nMissing values per column:")
data.isnull().sum()

In [None]:
#Cheking for Outliers:
# Create boxplots for each numerical feature
for feature in data.columns:
    plt.figure(figsize=(8, 4))
    sns.boxplot(x=data[feature])
    plt.title(f'Boxplot of {feature}')
    plt.show()

Based on the box plots, we do not have any outliers in the dataset, although some features showed some outliers and after re-examining them, they turned out to be non-outliers.

In [None]:
## replacing blank strings with NULL in `Total Charges`
data['Total Charges'] = data['Total Charges'].apply(lambda x: np.nan if type(x) == str else x)
## confirming NULL values in `Total Charges`
data['Total Charges'].isna().sum()

data = data.dropna()
data['Total Charges'] = data['Total Charges'].astype(float)

In [None]:
## fetching duplicated rows (as a whole) in the dataframe
data.duplicated().sum()

In [None]:
# Fetch the duplicated rows
duplicated_rows = data[data.duplicated()]

# Display duplicated rows
print(duplicated_rows)

In [None]:
# Remove duplicated rows, keeping the first occurrence
data = data.drop_duplicates()


# Step 7:  Applying SMOTE and Splitting the Dataset

- Resampling the dataset using SMOTE oversampling
- We split the data into training and testing sets, typically with an 80/20 split, to evaluate the model's ability to generalize to unseen data.


In [None]:
# We split the data into training and testing sets to evaluate our model's performance on unseen data.
# 80% of the data is used for training, while 20% is held back for testing.
# This helps in assessing the model's generalization capability.

# Define the features (X) and target (y)
X = data.drop(['Churn Value'], axis=1)  # Drop target column from features
y = data['Churn Value']  # Define the target


## creating an instance of SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors=5, random_state=101, sampling_strategy=1)
## resampling the dataset using SMOTE oversampling
X, y = smote.fit_resample(X, y)


# Split the data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the datasets
print(f"Train Set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test Set: X_test={X_test.shape}, y_test={y_test.shape}")


# Step 8: Feature Scaling

Scaling the features ensures that they are on the same scale, which is particularly important for algorithms like SVM, K-Nearest Neighbors, and others sensitive to feature magnitude.


In [None]:
# Many machine learning algorithms, such as Logistic Regression and SVM, benefit from scaled features.
# Here we use StandardScaler to standardize features by removing the mean and scaling to unit variance.

scaler = StandardScaler()
# Fit and transform the training data, and transform the test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Step 9: Model Training and Evaluation

We will train a Random Forest model to predict customer churn. The performance of the model will be evaluated using metrics such as accuracy, precision, recall, and F1-score.

In [None]:
def create_models(seed=42):
    '''
    Create a list of machine learning models.
            Parameters:
                    seed (integer): random seed of the models
            Returns:
                    models (list): list containing the models
    '''
    models = []
    
    # Add various models
    models.append(('logistic_regression', LogisticRegression(random_state=seed)))
    models.append(('support_vector_machines', SVC(random_state=seed)))
    models.append(('random_forest', RandomForestClassifier(random_state=seed)))
    models.append(('gradient_boosting', GradientBoostingClassifier(random_state=seed)))
    models.append(('k_nearest_neighbors', KNeighborsClassifier()))
    models.append(('naive_bayes', GaussianNB()))
    models.append(('xgboost', XGBClassifier(random_state=seed, use_label_encoder=False, eval_metric='logloss')))
    models.append(('lightgbm', LGBMClassifier()))
    
    return models

# Create a list with all the algorithms we are going to assess
models = create_models()

# Test the accuracy of each model using cross-validation
results = []
names = []
scoring = 'accuracy'

for name, model in models:
    # Perform cross-validation
    score = cross_val_score(model, X_train_scaled, y_train, cv=10, scoring=scoring)
    results.append(score)
    names.append(name)
    # Print classifier accuracy (mean) and standard deviation
    print(f'Classifier: {name}, Accuracy: {score.mean():.4f}, Std: {score.std():.4f}')

# Step 10: Fine Tuning (Hyperparameter Tuning: RandomSearchCV)


In [None]:
## defining a function to print classification report and plot the confusion matrix
def modelPerformance(model, test, result):
    y_pred = model.predict(test)
    print(classification_report(result, y_pred))
    print('Accuracy - ',accuracy_score(result, y_pred))
    print('Recal- ',recall_score(result, y_pred))
    print('Precisio - ',precision_score(result, y_pred))
    print('F1-Score - ',f1_score(result, y_pred))

#### RandomForestClassifier

In [None]:
## setting up the hyperparameters for RandomForestClassifier
paramsRF = {
    "n_estimators": np.linspace(64, 256, 10, dtype = int),
    "criterion": ['gini', 'log_loss'],
    "max_depth": [6, 8, None],
    "max_features": ['sqrt', 'log2', None]
}
## creating the RandomizedSearchCV object
randomRF = RandomizedSearchCV(RandomForestClassifier(random_state=101), paramsRF, n_jobs=4, verbose=1, cv=5, scoring='f1', refit=True)
## fiting the RandomizedSearchCV model to find the best parameters for Random Forest
randomRF.fit(X_train_scaled, y_train)
## fetching the best parameters four
randomRF.best_params_

    
## evaluating the result on validation set
modelPerformance(randomRF, X_test_scaled, y_test)

####  SVM

In [None]:
## setting up the hyperparameters for SVC
paramsSVC = {
    'kernel': ['linear', 'rbf'], 
    'C': [0.1, 1, 10],
    'gamma': ['auto', 'scale']
}
## creating a RandomizedSearchCV object for SVC
randomSVC = RandomizedSearchCV(SVC(random_state=101, max_iter=-1, probability=True), paramsSVC, n_jobs=4, verbose=1, cv=5, scoring = 'f1', refit=True)
## fiting the RandomizedSearchCV model to find the best parameters for Random Forest
randomSVC.fit(X_train_scaled, y_train)
best_randomSVC_model = randomSVC.best_estimator_
    
## evaluating the result on validation set
modelPerformance(best_randomSVC_model, X_test_scaled, y_test)

#### XGBoostClassifier 

In [None]:
## setting the hyperparameters for XGBoost 
paramsXG = {
    "n_estimators": np.linspace(64, 256, 10, dtype = int),
    "eta": [0.01, 0.05, 0.1, 0.2],
    "max_depth": [6, 8, None],
    "subsample": [0.5, 0.8, 1.0]    
}

## creating the RandomizedSearchCV object for XGBoostClassifier
randomXGB = RandomizedSearchCV(XGBClassifier(seed=101, eval_metric='logloss', objective='binary:logistic'), paramsXG, n_jobs=4, verbose=1, cv=5, scoring='f1', refit=True)
## fiting the model
randomXGB.fit(X_train_scaled, y_train)
best_randomXGB_model = randomXGB.best_estimator_
    
## evaluating the result on validation set
modelPerformance(best_randomXGB_model, X_test_scaled, y_test)

### LGBMClassifier

In [None]:
# Define the parameter grid
param_grid = {
    'num_leaves': [20, 30, 40, 50],
    'max_depth': [10, 15, 20, 25],
    'learning_rate': [0.01, 0.05, 0.1, 0.15],
    'n_estimators': [100, 200, 300, 400],
    'min_child_samples': [10, 20, 30],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0, 0.1, 0.5, 1],
    'reg_lambda': [0, 0.1, 0.5, 1]
}

# Initialize the LGBMClassifier
lgbm_model = LGBMClassifier()

# Initialize RandomizedSearchCV
randomlgbm = RandomizedSearchCV(estimator=lgbm_model, 
                                   param_distributions=param_grid, 
                                   n_iter=20, # Number of parameter settings that are sampled
                                   cv=5,      # 5-fold cross-validation
                                   verbose=1, # To show the progress of the search
                                   random_state=42,
                                   n_jobs=-1) # Use all processors

# Fit the model
randomlgbm.fit(X_train_scaled, y_train)
best_lgbm_model = randomlgbm.best_estimator_

## evaluating the result on validation set
modelPerformance(best_lgbm_model, X_test_scaled, y_test)

## Model Interpretation 

In our decision-making process for Churn Prediction models, we choose to go with LGBMClassifie, we chose to use LGBMClassifier, because it provides better metrics than other classifiers.
- Accuracy -  0.8506304558680893
- Recal-  0.8592092574734812
- Precisio -  0.8461538461538461
- F1-Score -  0.8526315789473684


# Step 11: Feature Importance Analysis

Analyzing the feature importance helps us understand which features contribute the most to predicting churn, which can provide valuable business insights.


In [None]:
# Feature Importance Analysis
# Analyzing feature importances to understand which features contribute most to the prediction of churn.

# Get the best estimator (the final trained model)
best_model = randomlgbm.best_estimator_

# Get feature importances from the best model
feature_importances = best_model.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Plot the feature importances

plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.title('Feature Importances')
plt.show()


 Based on the plot, it looks like the most important features are:

- Total Charges: This has the highest importance, indicating that it plays a significant role in predicting churn.
- Monthly Charges: Also a very important feature.
- Tenure Months: The length of time a customer has been with the company is a strong indicator of churn likelihood.
- Gender: It also appears to have a moderate impact.
- Other features like Internet Service, Payment Method, and Contract have much smaller impacts.


# Step 12: Saving the Model

The trained model is saved to disk using joblib. This allows us to reuse the model for predictions without retraining, which is efficient for deployment.


In [None]:
# Saving the trained model using joblib for future use.
# This allows us to easily deploy or re-use the model without retraining it from scratch.

# Save the model
joblib.dump(best_model, 'lgbm_churn_model.pkl')

# To load the model later
#loaded_model = joblib.load('lgbm_churn_model.pkl')

# Step 13: Deploy the Model

See app.py file