## **Predictive Modelling of Customer Churn and Retention Strategies for a Telecommunication Company: An Analysis for Vodafone Corporation**
## BUSINESS UNDERSTANDING

As a leading telecommunication corporation recognizes the criticality of customer retention in sustaining business growth. The escalating issue of customer churn can have a detrimental impact on a company's revenue and profitability, as acquiring new customers is often more costly than retaining existing ones. To address this challenge, this project aims to assist a telecommunication company in analysing customer churn patterns, identifying factors influencing churn, and developing effective customer retention strategies by leveraging machine learning techniques.

The primary objective is to develop robust machine-learning models to predict customer churn accurately. By analysing historical customer data, the aim is to identify key indicators of churn and formulate targeted retention strategies to reduce customer attrition to achieve higher profitability.

This dataset contains different features information such as:

* Gender — Whether the customer is a male or a female

* SeniorCitizen — Whether a customer is a senior citizen or not

* Partner — Whether the customer has a partner or not (Yes, No)

* Dependents — Whether the customer has dependents or not (Yes, No)

* Tenure — Number of months the customer has stayed with the company

* Phone Service — Whether the customer has a phone service or not (Yes, No)

* MultipleLines — Whether the customer has multiple lines  

* InternetService — Customer’s internet service provider (DSL, Fiber Optic, No)

* OnlineSecurity — Whether the customer has online security or not (Yes, No, No Internet)

* OnlineBackup — Whether the customer has online backup or not (Yes, No, No Internet)

* DeviceProtection — Whether the customer has device protection or not (Yes, No, No internet service)

* TechSupport — Whether the customer has tech support or not (Yes, No, No internet)

* StreamingTV — Whether the customer has streaming TV or not (Yes, No, No internet service)

* StreamingMovies — Whether the customer has streaming movies or not (Yes, No, No Internet service)

* Contract — The contract term of the customer (Month-to-Month, One year, Two year)

* PaperlessBilling — Whether the customer has paperless billing or not (Yes, No)

* Payment Method — The customer’s payment method (Electronic check, mailed check, Bank transfer(automatic), Credit 
card(automatic))

* MonthlyCharges — The amount charged to the customer monthly

* TotalCharges — The total amount charged to the customer

* Churn — Whether the customer churned or not (Yes or No)

##             HYPOTHESIS

* Null Hypothesis (H0): There is no significant difference in churn rates between customers with longer contract terms and those using the payment method (Automatic).

* Alternative Hypothesis (H1): There is a significant difference in churn rates between customers with longer contract terms and those using the payment method (Automatic).

## QUESTIONS

 
1. How do contract terms and payment methods correlate with customer churn?

2. Are there specific services that significantly impact churn rates?

3. Are there specific services that customers with longer contract terms tend to use more frequently?

4. Do customers using automatic payment methods show different churn patterns compared to other payment methods?

5. Are senior citizens more or less likely to churn compared to non senior citizens?





## DATA UNDERSTANDING
## Importation

In [None]:
# Import necessary libraries for data handling 
import pyodbc
import pandas as pd
import numpy as np
from dotenv import dotenv_values

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px 

# hypothesis testing
from scipy.stats import chi2_contingency

# Machine learning classification model from sklean
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Feature Processing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler
from sklearn.preprocessing import FunctionTransformer,OneHotEncoder,LabelEncoder,OrdinalEncoder
from sklearn.base import TransformerMixin
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from functools import partial
from sklearn.metrics import roc_curve, auc

# class imbalance
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import Pipeline as impipeline

# Hyperparameters Fine-tuning
from sklearn.model_selection import GridSearchCV

# Other utilities
import os
import pickle





: 

## Data Loading
#### **Training Set_1 (SQL)**

In [None]:
#PULLING VARIABLES from an environment
environment_variables = dotenv_values('.env')
 
database = environment_variables.get("database_name")
server = environment_variables.get("server_name")
username = environment_variables.get("Login")
password = environment_variables.get("password")
 
# defining a connection string for connecting to our SQL server datatbase
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"
 
# establish a database connection using the 'pyodbc' library
connection = pyodbc.connect(connection_string)

: 

In [None]:
query = 'SELECT * FROM dbo.LP2_Telco_churn_first_3000'

data1 = pd.read_sql(query,connection)
data1.head()

: 


#### **Training Set_2 (CSV)**

In [None]:
data2 = pd.read_csv('LP2_Telco-churn-last-2000.csv')
data2

: 

In [None]:
# Checking the column headers for Data1
data1.columns

: 

In [None]:
# Checking the column headers for Data2 
data2.columns

: 


#### **Concatenating the dataset from the two sources into one dataframe**

In [None]:
df = pd.concat([data1,data2])

#reset index of the conc dataset
df = df.reset_index(drop=True)
df = pd.DataFrame(df)

df

: 

* INSIGHTS: The Variables; 'Churn', 'PaperlessBilling', 'StreamingMovies','StreamingTV', 'TechSupport', 'DeviceProtection', 'OnlineBackup', 'OnlineSecurity', 'MultipleLines', 'PhoneService', 'Dependents' and 'Partner' have inconsistent enteries (Yes, No, False and True)

In [None]:
df.info()

: 

* INSIGHTS: The above (.info()) shows that:
 
Categorical variable: SeniorCitizen column is of datatype int64.
Numeric variable: TotalCharges column is of datatype object.

The variables: MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection,
TechSupport, StreamingTV, StreamingMovies, TotalCharges and Churn have missing values.
 


In [None]:
# Change the datatype of the variable 'TotalCharges' to a float
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Verify Changes
df.info()

: 

In [None]:
# Standardizing the enteries in specific columns 
# list out the columns with inconsistent enteries
boolean_columns = ['Churn', 'PaperlessBilling', 'StreamingMovies','StreamingTV', 'TechSupport', 'DeviceProtection', 'OnlineBackup', 'OnlineSecurity', 'MultipleLines', 'PhoneService', 'Dependents', 'Partner']

# Iterate through each column and replace True/False with 'Yes'/'No'
for column in boolean_columns:
    df[column] = df[column].replace({True: 'Yes', False: 'No'})

# drop unneeded column
columns_drop = ['customerID']
df = df.drop(columns=columns_drop)

# Verify changes
df

: 

In [None]:
# check for duplicates
dup = df.duplicated().sum()
print(f'This dataset has',dup,'duplicates')

: 

In [None]:
# Drop duplicates
df = df.drop_duplicates()

# reset the index
df = df.reset_index(drop=True)

# Verify Changes
ver = df.duplicated().sum()
print(f'This dataset has',ver,'duplicates')

: 

In [None]:
# check null values
df.isnull().sum()

: 

In [None]:
# Check the percentage of missing values
(df.isnull().sum()/(len(df)))*100

: 

In [None]:
# veiwing the type of null values 
selected_columns = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'TotalCharges', 'Churn']
df_selected = df[selected_columns]
df_selected


: 

In [None]:
# View the unique enteries in the selected columns
print("Unique values in 'Churn' column:")
print(df_selected['Churn'].unique())

print("\nUnique values in 'MultipleLines' column:")
print(df_selected['MultipleLines'].unique())

print("\nUnique values in 'OnlineSecurity' column:")
print(df_selected['OnlineSecurity'].unique())

print("\nUnique values in 'OnlineBackup' column:")
print(df_selected['OnlineBackup'].unique())

print("\nUnique values in 'DeviceProtection' column:")
print(df_selected['DeviceProtection'].unique())

print("\nUnique values in 'TechSupport' column:")
print(df_selected['TechSupport'].unique())

print("\nUnique values in 'StreamingTV' column:")
print(df_selected['StreamingTV'].unique())

print("\nUnique values in 'StreamingMovies' column:")
print(df_selected['StreamingMovies'].unique())

print("\nUnique values in 'StreamingTV' column:")
print(df_selected['StreamingTV'].unique())

: 

* INSIGHTS: 
* The percentage of missing values is ~5% for MultipleLines and ~13% for OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV and StreamingMovies. 
* Null values are of the type 'None' and not 'NaN'.

In [None]:
# Replace 'None' with NaN for consistency
df.replace({None: np.nan}, inplace=True)

: 

## EDA

In [None]:
# Getting the summary statistics of numerical columns
df.describe().T

: 

* INSIGHTS: 
1. **Tenure:**

* The average tenure of customers is around 32.58 months, with a wide range from 0 to 72 months.
* The majority of customers have a tenure below the 75th percentile (56 months), as indicated by the relatively lower median (29 months).

2. **Monthly Charges:**

* The average monthly charge is $65.09, with a standard deviation of $30.07.
* Monthly charges range from $18.40 to $118.65, indicating variability in pricing.
* The median monthly charge is $70.55, which is higher than the mean, suggesting a right-skewed distribution.

3. **Total Charges:**

* The total charges have a wide range, with an average of $2302.06 and a standard deviation of $2269.48.
* Some customers have significantly higher total charges, as indicated by the high maximum value of $8670.10.
* There is a noticeable difference between the median ($1401.15) and mean, suggesting potential skewness or possible outliers.

In [None]:
# Seperating Numerical and categorical variables for easy analysis
cat_cols = df.select_dtypes(include=['object', 'bool']).columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

: 

### Univariate Analysis

In [None]:
# Visualizing the distribution of the numerical columns using histogram and box plot side by side while printing the skewness
for col in num_cols:
    print(col)
    print('Skew :', round(df[col].skew(), 2))
    plt.figure(figsize = (15, 4))
    plt.subplot(1, 2, 1)
    df[col].hist(grid=False, color ="blue")
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df[col], color='blue')
    plt.show()

: 

* INSIGHTS: The visualization reveals the distributions of continuous numerical columns are notably uneven, predominantly exhibiting positive skewness with monthly Charges negatively skewed. This observation suggests that these columns may benefit from transformations to achieve a more balanced distribution, which can positively impact the performance of machine learning models.

In [None]:
plt.figure(figsize=(8,4))
ax = sns.countplot(data = df, x="Contract",color = "blue")
ax.set(ylabel=None)
plt.show()

plt.figure(figsize=(12,4))
ax = sns.countplot(x ="PaymentMethod", data = df, color = "blue")
ax.set(ylabel=None)
plt.show()

plt.figure(figsize=(8,4))
ax = sns.countplot(x="SeniorCitizen",data= df,color = "blue")
ax.set(ylabel=None)
plt.show()


plt.figure(figsize=(8,4))
ax = sns.countplot(x="PhoneService",data= df,color = "blue")
ax.set(ylabel=None)
plt.show()

plt.figure(figsize=(8,4))
ax = sns.countplot(x="InternetService",data= df,color = "blue")
ax.set(ylabel=None)
plt.show()

: 

* INSIGHT: The visualization reveals skewness in majority of the categories. This imbalance in our key features may adversely impact the quality and predictive accuracy of our model, emphasizing the need for balance to enhance model performance.

### Bi-variate Analysis

In [None]:
# relationship among features
correlation_matrix = df.corr(numeric_only=True)
correlation_matrix

: 

In [None]:
# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()

: 

* INSIGHTS: The correlation matrix reveals positive correlations between tenure and both monthly and total charges, with a stronger correlation observed between tenure and total charges, while senior citizenship exhibits a modest positive correlation with both monthly and total charges in the dataset

In [None]:
# Create a contingency table 
contingency_table = pd.crosstab(df['Contract'], df['Churn'])
contingency_table 

: 

In [None]:
# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(contingency_table, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()

: 

* INSIGHTS: Customers with month-to-month contracts exhibit the highest churn rate, suggesting that this contract type may be associated with a greater likelihood of customer attrition. In contrast, customers with two-year contracts show a substantially lower churn rate, indicating a potential correlation between contract duration and customer retention. This insight emphasizes the importance of considering contract terms when analyzing and addressing customer churn in the dataset and thus, aligns closely with our hypothesis

In [None]:
# Create a contingency table 
contingency_table2 = pd.crosstab(df['PaymentMethod'], df['Churn'])
contingency_table2 

: 

In [None]:
# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(contingency_table2, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()

: 

* INSIGHTS: The observed variations in churn across different payment methods, notably higher churn among customers using Electronic check compared to Bank transfer (automatic), Credit card (automatic), and Mailed check, indicate that the choice of payment method could be a significant factor influencing customer churn. These findings offer valuable insights for further investigation and validation of our initial hypothesis.

### Multivariate Analysis

In [None]:
data = df[["tenure","MonthlyCharges",'TotalCharges','Churn']]
plt.figure(figsize=(10, 8))
sns.pairplot(data, palette={'Yes':'Firebrick', 'No':'blue'}, hue = 'Churn')
plt.show()

: 

In [None]:
# Defining mode value
most_frequent = df['Churn'].mode()[0]

# Filling null values in 'Churn' column with the most frequent value
df['Churn'].fillna(most_frequent, inplace=True)

# Verfiy Changes
df['Churn'].isnull().sum()

: 

### Rename and handle null values and save the clean dataframe to excel file

In [None]:
df_preprocessed = pd.DataFrame(df)

# Check for null values after preprocessing
print("Null values after preprocessing:\n", df_preprocessed.isnull().sum())


: 

In [None]:
# Specify columns with missing values
numerical_cols = ['TotalCharges']
categorical_cols = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

# Impute missing values for numerical columns
numerical_imputer = SimpleImputer(strategy='median')
df_preprocessed[numerical_cols] = numerical_imputer.fit_transform(df_preprocessed[numerical_cols])

# Impute missing values for categorical columns
categorical_imputer = SimpleImputer(strategy='most_frequent')
df_preprocessed[categorical_cols] = categorical_imputer.fit_transform(df_preprocessed[categorical_cols])

# Verify Changes
print("Null values after preprocessing:\n", df_preprocessed.isnull().sum())


: 

In [None]:
# Saving the DataFrame to an Excel file
desktop_path = r'C:\Users\USER\Desktop'
file_path = desktop_path + r'\df_preprocessed.xlsx'

df_preprocessed.to_excel(file_path, index=False)

: 

## Answering Business Questions

1. How do contract terms and payment methods correlate with customer churn?

In [None]:
#  a contingency table
contingency_table3 = pd.crosstab(index=df_preprocessed['Contract'], columns=[df_preprocessed['PaymentMethod'], df_preprocessed['Churn']])


# plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(contingency_table3, annot=True, cmap='coolwarm', fmt='g')
plt.title('Correlation between Contract, Payment Method, and Churn')
plt.show()

: 

**INSIGHTS:**
1. **Payment Method Impact:**
Customers using "Electronic check" as their payment method exhibit a higher churn rate compared to other methods.
Customers using "Bank transfer (automatic)" and "Credit card (automatic)" generally have lower churn rates.

2. **Contract Duration:**
Customers with a "Two-year" contract have the lowest churn rate across all payment methods, indicating that longer-term contracts are associated with higher customer retention.
"Month-to-month" contract customers show higher churn rates, emphasizing the importance of contract duration in customer retention.

3. **Churn Across Payment Methods:**
Among customers with "Month-to-month" contracts, "Electronic check" users experience the highest churn, while "Bank transfer (automatic)" users have a comparatively lower churn rate.

4. **Variability in Churn Rates:**
Churn rates vary significantly across different contract durations and payment methods, underscoring the importance of understanding these factors when devising customer retention strategies.

    In summary, the data suggests a correlation between payment methods, contract duration, and customer churn. Exploring strategies to encourage longer-term contracts and promoting specific payment methods might help mitigate churn and enhance overall customer retention.

2. Are there specific services that significantly impact churn rates?

In [None]:
# bar plot
plt.figure(figsize=(8, 4))
ax = sns.countplot(x='InternetService', hue='Churn', data=df_preprocessed, palette={'Yes':'Firebrick', 'No':'blue'})
ax.set(ylabel=None)
plt.title('Impact of InternetService on Churn Rates')
plt.show()

# bar plot 
plt.figure(figsize=(8, 4))
ax = sns.countplot(x='PhoneService', hue='Churn', data=df_preprocessed, palette={'Yes':'Firebrick', 'No':'blue'})
ax.set(ylabel=None)
plt.title('Impact of PhoneService on Churn Rates')
plt.show()

: 

**INSIGHTS:** 
The visualization indicates that the selected services; InternetService and PhoneService, have a notable impact on the churn rate. The predominance of "No Churn" in the visualization implies that customers utilizing these services are less likely to churn. This observation aligns with the understanding that specific services indeed play a significant role in influencing customer churn rates

3. Are there specific services that customers with longer contract terms tend to use more frequently?

In [None]:
plt.figure(figsize=(12, 8))

# InternetService
plt.subplot(2, 2, 1)
sns.barplot(x='Contract', y='tenure', hue='InternetService', data=df_preprocessed, ci= None, palette={'Fiber optic':'Firebrick','DSL':'blue','No':'orange'})
plt.title('Average Tenure for Different Internet Services and Contract Types')

# PhoneService
plt.subplot(2, 2, 2)
sns.barplot(x='Contract', y='tenure', hue='PhoneService', data=df_preprocessed, ci= None, palette={'Yes':'Firebrick', 'No':'blue'})
plt.title('Average Tenure for Different Phone Services and Contract Types')

plt.tight_layout()
plt.show()


: 

**INSIGHTS:**
Examining the bar plot for two services; specifically (InternetService) and (PhoneService), reveals that customers with extended contract terms predominantly choose Fiber Optic for internet service and use phone services more frequently. This observation implies a positive correlation between longer contract terms and increased usage of these selected services.

4. Do customers using automatic payment methods show different churn patterns compared to other payment methods?

In [None]:
fig = px.histogram(df_preprocessed, x='PaymentMethod', color='Churn', barmode='stack',
                   color_discrete_map={'Yes':'Firebrick', 'No':'blue'},
                   labels={'PaymentMethod': 'Payment Method', 'Churn': 'Churn'},
                   title='Churn Patterns by Payment Method')

fig.update_layout(xaxis_title='Payment Method', yaxis_title='Count', showlegend=True)
fig.show()

: 

**INSIGHTS:**
Analyzing the varied churn patterns associated with different customer payment methods, it becomes apparent that customers utilizing the (automatic) payment method tend to exhibit a pattern of 'No churn' (retaining services). In contrast, the (Electronic check) payment method displays a pattern of 'Yes churn' (churning). This observation aligns with the initial hypothesis, offering valuable insights into the relationship between payment methods and churn behavior.

5. Are senior citizens more or less likely to churn compared to non senior citizens?

In [None]:
# Create a DataFrame with churn counts for senior and non-senior citizens
churn_counts = df_preprocessed['SeniorCitizen'].value_counts()

# Plotting a pie chart
plt.figure(figsize=(6, 6))
plt.pie(churn_counts, labels=['Non-Senior Citizens', 'Senior Citizens'], autopct='%1.1f%%', colors=['blue', 'Firebrick'])
plt.title('Churn rate: Senior vs Non-Senior Citizens')
plt.show()


: 

**INSIGHTS:**
The chart above indicates that Senior Citizens have a churn rate of 16.3%, while non-Senior Citizens have a higher churn rate of 83.7%. Therefore, Senior Citizens are more likely to remain customers (as lower churn rate indicates a higher likelihood of customer retention) compared to non-Senior Citizens. This insight can be leveraged for strategic considerations, recognizing the lower churn rate among Senior Citizens."

# Hypothesis Testing

In [None]:

# create a contingency table, extract relevant columns (Contract and PaymentMethod)
contingency_table = pd.crosstab(df_preprocessed['Contract'], df_preprocessed['PaymentMethod'])

# Perform chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output the results
print(f"Chi-squared value: {chi2}")
print(f"P-value: {p}")

# Define significance level
alpha = 0.05

# Check the p-value against the significance level
if p < alpha:
    print("Reject the null hypothesis. There is a significant difference in churn rates.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in churn rates.")


: 

Insight:

* There is sufficient statistical evidence to reject the null hypothesis, revealing a noteworthy difference in churn rates between customers with extended contract terms and those utilizing the Automatic payment method.

## Data Preparation

Split data into input (x) and target (y) features

In [None]:
# Drop unnecessary columns and split the data
X = df.drop(['Churn'], axis=1)  # Features
y = df['Churn']  # Target variable

# Convert boolean values to strings
y_stratify = y.astype(str)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y_stratify)

: 

### Feature Engineering

In [None]:
X.info()

: 

In [None]:
X_num_cols = X.select_dtypes(include=np.number).columns

X_cat_cols = X.select_dtypes(include=['object']).columns

# Verify changes
print("Categorical Variables:")
print(X_cat_cols)
print("Numerical Variables:")
print(X_num_cols)

: 

#### Creating a pipeline

In [None]:
# LogTransformer class
class LogTransformer:
    def __init__(self, constant=1):
        self.constant = constant
 
    def transform(self, X_train):
        return np.log1p(X_train + self.constant)
 
 
# Numerical transformer with LogTransformer
numerical_pipeline = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),
    ('log_transform', FunctionTransformer(LogTransformer().transform)),
    ('scaler', StandardScaler())
])
 
class BooleanToStringTransformer(TransformerMixin):
    def fit(self, X, y=None):
        # Fit logic here, if needed
        return self
 
    def transform(self, X):
        # Transformation logic here
        # Ensure to return the transformed data
        return X.astype(str)
 
 
# Categorical transformer
categorical_pipeline = Pipeline(steps=[
    ('bool_to_str', BooleanToStringTransformer()),
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ('cat_encoder', OneHotEncoder())
])
 
 
# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, X_num_cols),
        ('cat', categorical_pipeline, X_cat_cols)
    ])

: 

### Label Encoder

In [None]:
# Fit and transform the label encoder on y_train
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

: 

### Machine Learning Models

#### Compare Models - Unbalanced

In [None]:
# List of models to evaluate
models = [
    ('tree_classifier', DecisionTreeClassifier(random_state=42)),
    ('logistic_classifier', LogisticRegression(random_state=42)),
    ('K-nearest_classifier', KNeighborsClassifier()),
    ('svm_classifier', SVC(random_state=42, probability=True)),
    ('sgd_classifier', SGDClassifier(random_state=42)),
    ('rf_classifier', RandomForestClassifier(random_state=42))
]

# Iterate through models
for model_name, classifier in models:
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)                
    ])
    
    # Train the model
    pipeline.fit(X_train, y_train_encoded)

    # Make predictions
    y_pred = pipeline.predict(X_test)

    # Print classification report
    print(f'Report for {model_name}')
    print(classification_report(y_test_encoded, y_pred))
    print('=' * 58)

: 

* Insight: The logistic classifier appears to demonstrate strong overall performance by effectively identifying churn cases (recall) and minimizing false positives (precision). It achieves a good balance between precision and recall for both classes ('No' and 'Yes'). This stands in contrast to other models, where precision and recall are higher for the 'No' class, signifying superior predictive accuracy for the 'No' class but potentially sacrificing performance on the 'Yes' class.

### Balanced Dataset - RandomOverSampler

In [None]:
# balance the dataset using the randomoversmaplier
samplier = RandomOverSampler(random_state=42)

X_train_resampled, y_train_resampled = samplier.fit_resample(X_train, y_train_encoded)

: 

In [None]:
# List of models to evaluate
models = [
    ('tree_classifier', DecisionTreeClassifier(random_state=42)),
    ('logistic_classifier', LogisticRegression(random_state=42)),
    ('K-nearest_classifier', KNeighborsClassifier()),
    ('svm_classifier', SVC(random_state=42, probability=True)),
    ('sgd_classifier', SGDClassifier(random_state=42)),
    ('rf_classifier', RandomForestClassifier(random_state=42))
]
# Iterate through models
for model_name, classifier in models:
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)                
    ])
    
    # Train the model
    pipeline.fit(X_train_resampled, y_train_resampled)

    # Make predictions
    y_pred = pipeline.predict(X_test)

    # Print classification report
    print(f'Report for {model_name}')
    print(classification_report(y_test_encoded, y_pred))
    print('=' * 58)


: 

##### Insights
* Balancing the dataset has positively impacted the models, particularly in their ability to identify customers who will churn (improved recall for 'Yes').
* The logistic_classifier and svm_classifier stand out as models with balanced improvements in precision and recall for the 'Yes' class.

### Balanced Dataset - Using the SMOTE

In [None]:
# Initialize SMOTE for oversampling the minority class
smote = SMOTE(random_state=42)

# List of models to evaluate
models = [
    ('tree_classifier', DecisionTreeClassifier(random_state=42)),
    ('logistic_classifier', LogisticRegression(random_state=42)),
    ('K-nearest_classifier', KNeighborsClassifier()),
    ('svm_classifier', SVC(random_state=42, probability=True)),
    ('sgd_classifier', SGDClassifier(random_state=42)),
    ('rf_classifier', RandomForestClassifier(random_state=42))
]
# Iterate through models and apply SMOTE within the pipeline
for model_name, classifier in models:
    pipeline = impipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('classifier', classifier)                
    ])
    
    # Train the model
    pipeline.fit(X_train, y_train_encoded)

    # Make predictions
    y_pred = pipeline.predict(X_test)

    # Print classification report
    print(f'Report for {model_name}')
    print(classification_report(y_test_encoded, y_pred))
    print('=' * 58)


: 

Insights:

* Balancing the dataset with SMOTE has positively impacted the models, particularly in their ability to identify customers who will churn (improved recall for 'Yes').
* The logistic_classifier, svm_classifier, and sgd_classifier stand out as models with balanced improvements in precision and recall for the 'Yes' class.

Generally: Both SMOTE and RandomOverSampler have improved the models' ability to identify customers who will churn (improved recall for 'Yes') compared to the original imbalanced dataset, however, SMOTE tends to result in slightly better recall values for the 'Yes' class in most models when compared to the RandomOverSampler.

### Feature Importance and Selection 

In [None]:
# Initialize SelectKBest for feature selection and setting the number of features
selection = SelectKBest(score_func=partial(mutual_info_classif, random_state=42), k=15)


# List of models to evaluate
models = [
    ('tree_classifier', DecisionTreeClassifier(random_state=42)),
    ('logistic_classifier', LogisticRegression(random_state=42)),
    ('K-nearest_classifier', KNeighborsClassifier()),
    ('svm_classifier', SVC(random_state=42, probability=True)),
    ('sgd_classifier', SGDClassifier(loss='log_loss', random_state=42)),
    ('rf_classifier', RandomForestClassifier(random_state=42))
]
all_pipeline = {}

# Iterate through models and apply SMOTE within the pipeline
for model_name, classifier in models:
    pipeline = impipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('feature_importance', selection),
        ('classifier', classifier)                
    ])
    
    # Train the model
    pipeline.fit(X_train, y_train_encoded)

    all_pipeline[model_name] = pipeline

    # Make predictions
    y_pred = pipeline.predict(X_test)

    # Print classification report
    print(f'Report for {model_name}')
    print(classification_report(y_test_encoded, y_pred))
    print('=' * 58)

: 

### Visualisualizing ROC - Overlapping

In [None]:
# Create a figure and axis for the plot
fig, ax = plt.subplots(figsize=(8, 8))

roc_curve_data = {}
all_pipeline = {}
    
# Iterate through models and apply SMOTE within the pipeline
for model_name, classifier in models:
    pipeline = impipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('feature_importance', selection),
        ('classifier', classifier)                
    ])

    # Train the model
    pipeline.fit(X_train, y_train_encoded)

    y_score = pipeline.predict_proba(X_test)[:, 1]
    all_pipeline[model_name] = pipeline
    
    fpr, tpr, threshold = roc_curve(y_test_encoded, y_score)

    roc_auc = auc(fpr, tpr)

    roc_curve_df = pd.DataFrame({'False Positive Rate': fpr, 'True Positive Rate': tpr, 'Threshold': threshold})

    roc_curve_data[model_name] = roc_curve_df

    ax.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.2f})')

# Plot the diagonal line
ax.plot([0, 1], [0, 1], color='navy', linestyle='-')

# Set labels and title
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver Operating Characteristics (ROC) Curve')

# Add legend to the plot
ax.legend(loc='lower right')

# Show the plot after the for loop
plt.show()


: 

* Insight:

From the above curve, the Logistic Classifier stands out as the preferred model. It becuase of its strong overall performance, particularly in achieving a balance between precision and recall for both churn and non-churn classes. The Logistic Classifier demonstrated good discrimination ability with a high ROC AUC value (0.85) and provided a comprehensive view of its performance through precision, recall, and F1-score metrics. Therefore, utilizing the Logistic Classifier as the preferred model for predicting customer churn allows for effective identification of potential churners while minimizing false positives.

In [None]:
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
roc_curve_data['logistic_classifier']

: 

In [None]:
logistic_pipeline = all_pipeline['logistic_classifier']

logistic_y_pred = logistic_pipeline.predict(X_test)

matrix = confusion_matrix(y_test_encoded, logistic_y_pred)
matrix

: 

In [None]:
# visualizing the matrix
sns.heatmap(data=matrix, annot=True, fmt='d', cmap='coolwarm')

: 

In [None]:
threshold = 0.28

y_pred_proba = logistic_pipeline.predict_proba(X_test)[:, 1]

binary_prediction = (y_pred_proba >= threshold)

threshold_matrix = confusion_matrix(y_test_encoded, binary_prediction)
threshold_matrix

: 

In [None]:
sns.heatmap(data=threshold_matrix, annot=True, fmt='d', cmap='coolwarm')

: 

### Hyperparameter Tuning

In [None]:
param_grid = {
    'feature_importance__k': [5, 10, 20],
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__max_iter': [100, 200, 300],
}

grid_search = GridSearchCV(
    logistic_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='f1')

grid_search.fit(X_train, y_train_encoded)

: 

In [None]:
best_parameters = grid_search.best_params_
best_parameters

: 

In [None]:
best_estimator = grid_search.best_estimator_


test_accuracy = best_estimator.score(X_test, y_test_encoded)
print("Test Accuracy", test_accuracy)

: 

### Retrain Model with Best Parameters

In [None]:
logistic_pipeline.set_params(**best_parameters)
logistic_pipeline.fit(X_train, y_train_encoded)

: 

### Model Persistence

In [None]:
import joblib

joblib.dump(logistic_pipeline, './models/finished_model.joblib')

joblib.dump(label_encoder, './models/encoder.joblib')

: 

## LOAD TEST DATASET

In [None]:

# Load the new test data from Excel
test_data = pd.read_excel('LP2_Test_Dataset_Telco-churn-last-2000.xlsx')
test_data

: 

In [None]:
# viewing column names
test_data.columns

: 

In [None]:
# Drop 'customerID' column
test_data.drop('customerID', axis=1, inplace=True)

# Verify changes
test_data.columns

: 

In [None]:
test_data.info()

: 

In [None]:
# Change the datatype of the variable 'TotalCharges' to a float
test_data['TotalCharges'] = pd.to_numeric(test_data['TotalCharges'], errors='coerce')

: 

In [None]:
logistic_pipeline_2 = joblib.load('./models/finished_model.joblib')
encoder = joblib.load('./models/encoder.joblib')

prediction = logistic_pipeline_2.predict(test_data)

threshold = 0.28

y_pred_proba = logistic_pipeline_2.predict_proba(test_data)[:, 1]

binary_prediction_2 = (y_pred_proba >= threshold)

print(binary_prediction_2)

: 

In [None]:
test_data['Churn'] = binary_prediction_2

: 

In [None]:
test_data

: 

In [None]:
# Iterate through each column and replace True/False with 'Yes'/'No'
test_data['Churn'] = test_data['Churn'].replace({True: 'Yes', False: 'No'})

test_data

: 

In [None]:
test_data['Churn'].value_counts()

: 