# Title: Maximizing Customer Retention: A Churn Prediction Analysis The for Vodafone Group

## Project description

Customer attrition is a prevalent problem for many businesses, resulting in large financial losses. We intend to investigate customer churn or attrition in this project, which refers to the percentage of consumers that discontinue using a company's product or service within a specified time frame. Understanding the primary causes of customer churn can assist businesses in developing effective retention strategies to reduce customer attrition and boost revenue.

This project's dataset includes information about users' demographics, service usage, and billing information. We will use this dataset to conduct an exploratory data analysis in order to find patterns and trends linked to customer attrition. We will next use machine learning techniques to create a predictive model that will estimate the likelihood of a customer leaving the firm.

__Our project's objectives are to:__

1.Investigate and display the data to uncover patterns and trends in customer attrition.

2.Using machine learning methods, create a predictive model to forecast the possibility of client attrition.

3.Determine the major churn indicators, such as client demographics, service usage, and billing information.

4.Create retention techniques to assist reduce client turnover while increasing customer loyalty.

5.Based on the model's results, assess the success of retention efforts and make recommendations for changes.

The project's outcome will provide valuable insights for Vodafone to understand customer churn and implement effective retention strategies to reduce customer attrition and increase revenue.


## Hypothesis 

Null hypothesis : 

Gender has a significant impact on churn for vodafone customers. 

Alternative hypothesis :

Gender doesnot have a significant impact on churn for vodafone customers      
    

## Questions 

1. Which age group (Senior Citizen Column) paid the highest monthly charges?
2. Which gender has the highest count of churn ? 
3. Which Internet Service is patronized the most?
4. How much total charge and monthly charge revenue does churners generate?
5. Which payment method is the most popular?

## Library Installation

In [1]:

##!pip install imblearn ##for handling imbalanace data
##! pip install phik ##for our phik correlation

# Library Importation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns 

import matplotlib.pyplot as plt 
%matplotlib inline
import matplotlib.ticker as mtick
import seaborn as sns 
import random

import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

# Feature Processing (Scikit-learn processing, etc. )
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

#Algorithms and pipeline

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer 
from imblearn.pipeline import Pipeline
from sklearn.ensemble import BaggingClassifier

##handling imbalance datasets

from imblearn.over_sampling import SMOTE

from sklearn.utils.class_weight import compute_class_weight

##hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# statistic model
import statsmodels.api as sm
...

# Other packages
import os


# Data Loading

In [None]:
df =pd.read_csv('../LP2_Telco-churn-last-2000.csv')

In [None]:
##taking a look at our dataset
df

# Exploratory Data Analysis

## Dataset Overview

In [None]:
df.info()

### Notes of .info():

- Also, the TotalCharges column is an object data type, so we might want to check that. 
- There are no missing values
- We have a total of 7043 rows 
- We have a total of 21 columns 
- Our target variable has  yes/no attributes, therefore it is a binary classification problem

In [None]:
df.describe(include="all").transpose()

### __Key Notes__:

__SeniorCitizen__: 

- This is a binary variable that indicates whether or not the customer is a senior citizen. 

- The output shows that out of 7,043 observations, 16.2% (or about 1,142) are senior citizens on average.

__tenure__: 

- This variable represents the number of months the customer has been with the company. 

- The output shows that on average, customers have been with the company for 32.4 months, but the standard deviation is quite large (24.6 months), indicating that there is a wide range of values for this variable. 

- Less than 75% of the customers spend less than 55 months with the company.

__MonthlyCharges__:  

- This variable represents the amount the customer pays each month for the company's services. 

- The output shows that on average, customers pay \\$64.76 per month, with 75 percent paying less than  \\$89.85. 

- Again, the standard deviation is quite large (30.1 dollars), indicating that there is a wide range of values for this variable as well. 


#### Note:

since our total charges column is a object dataframe, we will be creating a copy of our dataset and use the copy for our EDA with our TotalCharges being converted to numeric. We will use the orginal frame for our modelling and build a pipeline which can handle everything. 

In [None]:
##let's create a copy to make  it easy for us to revert to our original dataframe if we make a mistake

df_copy= df.copy()

### Converting the totalcharges column to a numerical variable

from .info(), we realized that the TotalCharge feature was an object data type; however, looking at it, it was an object, there we willbe chaning it to a numeric data type 

In [None]:
df_copy["TotalCharges"]= pd.to_numeric(df_copy["TotalCharges"], errors= "coerce")

In [None]:
##Let's check to see if there are any further missing values

df_copy.isna().sum()

there are 11 missing values so we will replace them with the mean of the TotalCharges column

In [None]:
df_copy["TotalCharges"].mean()

In [None]:
df_copy.fillna(value= df_copy["TotalCharges"].mean(), inplace= True)

In [None]:
##Let's check again for missing values
df_copy.isna().sum()

In [None]:
##Since there are no more missing values, we can proceed to use the df_copy for our analysis

In [None]:
##We will like to drop our ID column for both dataset since it composes of unique values

In [None]:
df= df.drop("customerID", axis= 1)

df_copy= df_copy.drop("customerID", axis= 1)

##Note we will be using the df_copy for our analysis and the df for our modeling 

In [None]:
### I am saving it so that I can use it for power BI:

df_copy.to_csv("Documents/Vodafone_churn.csv")

## Univariate Analysis 

In [None]:
##Let's take a look at the distrubution of the columns 
##this our funciton allows you to plot any number of columns 

def plot_distribution(df,cols):
    for col in cols:
        sns.displot(df[col],kde= True)
        plt.figure(figsize=(5,5))

In [None]:
##we are using the function above to plot the distrubution of the columns below
plot_distribution(df_copy, ['SeniorCitizen', 'tenure', 'MonthlyCharges']);


In [None]:
df_copy.skew()

__Observations__:

- Most customers for vodafone are non-senior citizens

- Most customers pay a monthly charge of 20 units 

- Most customers stay on the network for 0 months, and 70 months

In [None]:
##checking for outliers 

In [None]:
sns.boxplot(data=df_copy);

##### Observation:

- It can be seen that there are no outliers 

### Univariate on the Churn Column

In [None]:
##the idea here is to check to see the ratio of our label variables 

In [None]:
##sns.countplot(x = df_copy["Churn"], data = df_copy)
##plt.title("Plot of Ratio of the Label Variables(Churn)")

fig = px.pie(df_copy, names='Churn', title='Plot of Ratio of the Label Variables (Churn)')
fig.show()

In [None]:
##let's see the percentage

count= df_copy["Churn"].value_counts()

percen= count/len(df_copy["Churn"]) *100
print("The percentage of No is: ", round(percen[0], 2))
print("The percentage of Yes is: ", round(percen[1], 2))

In [None]:
percen[0]/percen[1]

#### Notes from exploring the Label Variable:

- We can see there is an imbalance in our dataset; therefore we will have to deal with that.

- 73% of the vodafone customers are still loyal

- 26.54% of the customers in the current dataset have left the company 

- Those who are still in the company are 2.77 times larger than those who have left 

In [None]:
df_copy.nunique()

In [None]:
##We can see the tenure has 73 unique values, therefore it will be a good idea to bin it up to help make our analysis easier.

labels= ["{0} - {1}".format(i,i+11) for i in range (0, 73, 12) ]

df_copy["tenure_group"]= pd.cut(df_copy.tenure, range(1, 80, 12), right= False, labels=labels[:-1])

#### We are Trying to visualize to see how each categorical column variaes with Churn

In [None]:
for  col in df_copy.drop(["Churn", "tenure", "MonthlyCharges", "TotalCharges"], axis= 1):
    fig= px.histogram(df_copy, color= "Churn", x= col, title= f"Count of {col}, with respect to Churn", 
                      color_discrete_sequence=random.sample(["red", "pink", "blue", "green"], k=2))
    fig.show()

In [None]:
##Since there is an unequal number of both attributes, it will be a good idea to see how they vary based on percentages

In [None]:
##in this code, we are checking to see the churn YES to NO churn percentage of each categorical feature

for col in df_copy.drop(["Churn", "tenure", "MonthlyCharges", "TotalCharges"], axis= 1):
    churn_by_col = df_copy.groupby(col)['Churn'].value_counts(normalize=True).mul(100).rename('percent').reset_index()
    fig = px.bar(churn_by_col, x=col, y='percent',color="Churn", text='percent', title= f"Percentage of: {col} with respect to Churn", 
                      color_discrete_sequence=random.sample(["chartreuse","darkblue","darkviolet","aqua", "aquamarine", "beige", "burlywood"], k=2) )
    fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
    fig.show()

### Notes after Analysis:

- Individuals with 0 to 11 months of use are most likely to churn
- Individuals with electronic checks churn the most 
- Individuals on the fiber optics plan had the highest churn
- Female customers churned more than male customers 
- Even though there were more non senior citizens than senior citizens (from our count plot), senior citizens churned more percentage-wise. This could be because of old age, death or other factors. 
- Customers with no dependents are most likely to churn 

## Bivariate Analysis

In [None]:
"""Let's see the relationship between monthly charges and totalcharges. We will like to see if total charges increase 
with Monthlycharges """

px.scatter(df_copy, x="MonthlyCharges", y= "TotalCharges", color_discrete_sequence=["chartreuse"])

In [None]:
##Let's see the relationship between TotalCharges and tenure, how does one using the network for long affec their total charge

px.scatter(df_copy, x="tenure", y= "TotalCharges")

In [None]:
##Let's see which age group is most likely to pay the highest totalcharges

sns.set_theme(style="whitegrid")
sns.barplot(x="SeniorCitizen", y="TotalCharges", data=df_copy, palette= "pastel")
plt.title("Average Total Charges of the SeniorCitizens Column")
#plt.xticks(rotation=90)
plt.figure(figsize= (20,15))
plt.show();

#### Notes From Bivariate Analysis:

- We can see that total charges and monthly charges have a positive correlation. Therefore, the more monthly charges a customer pays, the more likely their totalcharge will increase 

- Also, total charges and tenure has a positive correlation. Therefore, the more time (tenure) a customer spends the more total charges they pay 

- Also, on average, senior citzens are paying a higher total charge than non-senior citizens

### Multivariate

In [None]:
sns.heatmap(df_copy.corr(), annot= True)

## Answering Questions 

We will be answering the questions below:

1. Which age group paid the highest total charge?
2. Which gender has the highest count of churn ? 
3. Which Internet Service is patronized the most?
4. How much total charge and monthly charge revenue does churners generate?
5. Which payment method is the most popular?

#### Question 1: Which Age Group paid the highest Monthly Charges 

In [None]:
##The only age group given to us was the seniors column, therefore, we will see the amount each group paid 

sum_agegroup= df_copy.groupby("SeniorCitizen").agg({"TotalCharges": "sum"}).reset_index()

sns.barplot(x= "SeniorCitizen", y= "TotalCharges", data= sum_agegroup, palette= "colorblind")

#### Answer to Question1:

- From the chart above, we can see that non-senior citizens spend more money (interms of total charges) than senior citizens

#### Question 2: Which Gender Recorded the highest Churn

In [None]:
churners= df_copy[df_copy["Churn"]== "Yes"]

In [None]:
px.histogram(x="gender", data_frame=churners, color= "gender", color_discrete_sequence=["teal", "grey"], title= "Plot of the Occurence Of Churn Across the Various Gender")

#### Answer to Question2:
- Females tend to churn more than males 

### Question 3: Which Internet Service Is Patronized The Most?



In [None]:
internet_count= df.InternetService.value_counts().to_frame(name= "count")

In [None]:
internet_count

In [None]:
fig = px.scatter(internet_count, x=internet_count.index, y='count', size='count', color= internet_count.index, hover_name=internet_count.index,
                 log_y=False, size_max=60, title= "Plot of The Popularity Of The Various Internet Services")

fig.show()

#### Answer to Question3:

- Fiber Options was the most patronized Internet Service 

#### Question 4: How much total charge and monthly charge revenue does churners generate?

In [None]:
##since we are looking at churners, we will use the churners dataframe

In [None]:
money= churners.agg({"MonthlyCharges": "sum", "TotalCharges": "sum"}).reset_index()

money.columns= ["Charge", "Amount"]

px.bar(data_frame= money, x= "Charge",
       
       y= "Amount", title= "Amount Paid By Churners In Terms", text= "Amount", color= "Charge", color_discrete_sequence= ["yellow", "black"])

#### Answer to Question 4:

From the dataframe above, we can see that customers that churned generataed a whooping \\$139130 in monthly charges and \\$2,862,927 in total charges. This means Vodafone is losing \\$139130 monthly and a total of \\$2,862,927 due to Churns







#### Question 5: Which payment method is the most popular?

In [None]:
sns.countplot(x="PaymentMethod", data=df, palette= "pastel"  )
plt.title("Plot of Counts of Various Payment Methods")
plt.xticks(rotation= 45)
plt.figure(figsize= (15,5))
plt.show()

#### Answer to Question 5:
- We can see that electronic Check is the most popular payment method

### Note:

- Now we are done with our analysis, therefore, we will be using our df for our modelling. 

## Primary Feature Selection:

- in this section we will be selecting the best features for our algorithm. We will be using the Phi-Correlation

In [None]:
import phik

In [None]:
##getting the correlation of other features with churn

churn_corr= df.phik_matrix().loc["Churn"]

In [None]:
##sorting the values 
churn_cor=churn_corr.sort_values()

churn_cor

In [None]:
##ploting the phi-k correlation mattress
sns.heatmap(churn_cor.to_frame(), annot= True, cmap= "coolwarm")

plt.title("Phi_k Correlation Matrix for all  Features")

plt.figure(figsize= (10,15));

From our feature selection, we will be dropping columns with correlation coeeficient less than 0.2:
    
    - Gender
    - PhoneService
    - MultipleLines
  
    
    
Since they have a correlation less than 0.1

In [None]:
df_drop= df.drop(["gender", "PhoneService", "MultipleLines"], axis= 1)

In [None]:
df_drop

## Modeling 

Note:
- We will be using the df_drop dataframe for our analysis 

In [None]:
df_drop.replace(r'^\s*$', np.nan, regex=True).isna().sum()

### Step 1: Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
##creating our features and label

X= df_drop.drop("Churn", axis=1)
y= df_drop.Churn

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

In [None]:
###converting our label to a numeric variable for easy analysis 

LE= LabelEncoder() ##initializing the model


num_y_train= LE.fit_transform(y_train) ##fitting and transforming on the train data

num_y_test= LE.transform(y_test) ##transforming on the test data

### Step 2: Creating Our Attributes

In [None]:

##getting our categorical attributes 
cat_attr= [i for i in df_drop.drop(["TotalCharges", "MonthlyCharges", "tenure", "Churn"], axis= 1)]


##getting our numerical attributes
num_attr= ["TotalCharges", "MonthlyCharges", "tenure"]


In [None]:
cat_attr

In [None]:
num_attr

### Step 3: Creating Pipeline

#### Creating numeric pipeline

##### For Our Empty Rows:

- We will create a function to handle that

##### For our numeric values, we need to:

- Scale since our monthly transaction and total transaction are of different magnitudes and also since we will be using models sensitive to unscaled values.

- Also, we will create a function to handle the missing values in the numeric attribute. 


##### For our categorical 

- We will need to transform our categorical features to numeric using a onehotencoder
- We will also handle data in balance using Sklearns class_balance



In [None]:
### handling the empty space. The aim of the this function is to replace the missing values with NaN values

def remove_space(in_df):
    in_df["TotalCharges"]= in_df["TotalCharges"].replace(r"^\s*$", np.nan, regex= True)
    return in_df

In [None]:
"""Since we cannot fit and transform the function above, we will create a class with the function embedded to help 

us call, fit, and transform with the function above"""

class SpaceImputer():
    def __init__(self,func):
        self.func= func
    
    def transform(self, in_df, **transform_params):
        return self.func(in_df)
    
    def fit(self, X, y=None, **fit_params):
        return self
        

In [None]:
X_train.isna().sum()

In [None]:
##This pipeline will handle the nan values in our dataset and also standardize our

## we are using mean because from our previous analysis, there were no outliers

num_pipeline= Pipeline([("mean_imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])

cat_pipeline= Pipeline([("one_hot", OneHotEncoder())])

In [None]:
##we are combining our numeric and categorical pipelines with a Columntransformer

col_pipe= ColumnTransformer([("num_pipe", num_pipeline, num_attr),("cat_pipe", cat_pipeline, cat_attr)])


### Creating a pipeline for each Classifier (ML Algorithm)

#### DecisionTree CLassifier


In [None]:

DTP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
              ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
              ("model", DecisionTreeClassifier(random_state= 100))
              ])

In [None]:
DTP.fit(X_train, num_y_train)

In [None]:
result_1= DTP.predict(X_test)

In [None]:

print(classification_report(num_y_test,result_1))

#### Logistic Regressor Pipeline

In [None]:

LRP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
              ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
              ("model", LogisticRegression(random_state= 100))
              ])

In [None]:
LRP.fit(X_train, num_y_train)

In [None]:
result_2= LRP.predict(X_test)

In [None]:
print(classification_report(num_y_test,result_2))

#### Random Forest pipeline

In [None]:

RFP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
              ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
              ("model", RandomForestClassifier(n_estimators= 50, random_state= 100))
              ])

In [None]:
RFP.fit(X_train, num_y_train)

In [None]:
result_3= RFP.predict(X_test)

In [None]:
print(classification_report( num_y_test,result_3))

#### XGBoost Pipeline

In [None]:
XGP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),
               ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", XGBClassifier(random_state= 100))
              ])

In [None]:
XGP.fit(X_train, num_y_train)

In [None]:
result_4= XGP.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_4))

#### SVM Pipeline

In [None]:
SVP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
               ("feature_selection: ", SelectKBest(score_func=f_classif, k= 10)),
               ("model", SVC(random_state= 100))
              
              ])

In [None]:
SVP.fit(X_train, num_y_train)

In [None]:
result_5= SVP.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_5))

In [None]:
#### Results after base_modeling:


base_result= {"DTP": result_1, "LRP":result_2, "RFP": result_3, "XGP": result_4, "SVP":result_5}


for key, value in base_result.items():
    
    print(f"The performance of {key} is: \n\n", classification_report(num_y_test, value))

### Notes After Baseline Modeling:

- All our models did relatively well in predicting non-churners or the No or 0 class 

however, for the "Yes" class:

- Logistic regression did quite well in reducing false negatives (no classifying churners as non-churners)

- Whiles SVM had the highest precision for the "Yes" class; therefore it did well in predicting false positives(not classifying non-churners as churner)

### Dealing with Imbalance

##### In this section, we are going to see how functions like:

- class_weight  for models that we will be using class weight, i will be appending "_CW" to the name to signify class_weight

- SMOTE for models that we will be using class weight, i will be appending "_SM" to the name to signify SMOTE

affect a model's performance

### Using Class_Weight to Handle imbalance

In [None]:
##initializing our class weight for each class

class_weights = compute_class_weight('balanced', classes=[0, 1], y=num_y_train)


In [None]:
##assigning our weight to the respective class 

weight= dict(zip([0, 1], class_weights))

In [None]:
##viewing the weight of each class 

weight

#### Decision Tree with class weight

In [None]:

CW_DTP = Pipeline([
    ("spaceImputer", SpaceImputer(remove_space)),
    ("coltrans", col_pipe),
    ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
    ("model", DecisionTreeClassifier(
        random_state= 100, 
        class_weight= weight))
])

In [None]:
CW_DTP.fit(X_train, num_y_train)

In [None]:
result_6= CW_DTP.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_6))

#### Logistic Regression with Class Weight

In [None]:
CW_LRP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
            ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", LogisticRegression(
                   random_state=100,
                   class_weight=weight))
              
              ])

In [None]:
CW_LRP.fit(X_train, num_y_train)

In [None]:
result_7=CW_LRP.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_7))

#### Random Forest Class Weight

In [None]:
CW_RFC= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
                ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model",RandomForestClassifier(random_state= 100, n_estimators= 50,
                                               class_weight=weight))
              ])

In [None]:
CW_RFC.fit(X_train, num_y_train)

In [None]:
result_8= CW_RFC.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_8))

#### XGBoost with Class Wieghts

In [None]:
# Set "scale_pos_weight" based on class balance
##we divie the majority class by the minority class

pos_weight = (sum(df_drop["Churn"]== "No"))/(sum(df_drop["Churn"]== "Yes"))

In [None]:
CW_XGB= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
                ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", XGBClassifier(random_state= 100,  scale_pos_weight=pos_weight))
              ])

In [None]:
CW_XGB.fit(X_train, num_y_train)

In [None]:
result_9= CW_XGB.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_9))

#### SVM with Class Weights

In [None]:

CW_SVM= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
                ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", SVC( random_state= 100, class_weight=weight))
              ])

In [None]:
CW_SVM.fit(X_train, num_y_train)

In [None]:
result_10= CW_SVM.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_10))

#### Results after dealing with imbalance with class_weight

In [None]:
result_cw= {"Decision Tree":result_1, "Decision_Tree_CW":result_6, "Logistic Regression":result_2, 
            
            "Logistic_Regression_CW": result_7, 
            
            "Random Forest": result_3, "Random_Forest_CW":result_8, "XGB":result_4, "XGB_CW":result_9, 
            
            "SVM": result_5,"SVM_CW":result_10}

for key, value in result_cw.items():
    print(f"Classification Report for {key}, is: \n\n",(classification_report(num_y_test,value)))

#### Notes: 

- After adding class weights, our models performed better in terms of improvement in the yes class predictions. There was an increase in the recalll for the yes class, which means the model had less false negatives which is something we want

- Again SVM and Logistic Regression topped the charts. However, it is worth noting that while the recall of the Yes class increased the precision decreased, and this is a normal thing thanks to recall-precision trade off

- overall, adding class weights improved the performance of the model for both classes as opposed to not using weights 

### Trying our model with SMOTE:

#### SMOTE with DecisionTree

In [None]:
DTP_SM= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
               ("feature_selection", SelectKBest(score_func=f_classif, k=10)),# Perform feature selection
               ("smote", SMOTE(random_state=100)),  # Apply SMOTE for oversampling
               ("model", DecisionTreeClassifier(random_state= 100))  
              ])

In [None]:
DTP_SM.fit(X_train, num_y_train)

In [None]:
result_11= DTP_SM.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_11))

#### SMOTE with Logistic Regression

In [None]:
LGR_SM= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
               ("feature_selection", SelectKBest(score_func=f_classif, k=10)),# Perform feature selection
               ("smote", SMOTE(random_state=100)),  # Apply SMOTE for oversampling
               ("model", LogisticRegression(random_state= 100))  
              ])

In [None]:
LGR_SM.fit(X_train, num_y_train)

In [None]:
result_12= LGR_SM.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_12))

#### SMOTE with Random Forest

In [None]:
RF_SM= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
               ("feature_selection", SelectKBest(score_func=f_classif, k=10)),# Perform feature selection
               ("smote", SMOTE(random_state=100)),  # Apply SMOTE for oversampling
               ("model",RandomForestClassifier(random_state= 100, n_estimators= 50))  
              ])

In [None]:
RF_SM.fit(X_train, num_y_train)

In [None]:
result_13= RF_SM.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_13))

#### SMOTE with XGBOOST

In [None]:
XGB_SM= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
               ("feature_selection", SelectKBest(score_func=f_classif, k=10)),# Perform feature selection
               ("smote", SMOTE(random_state=100)),  # Apply SMOTE for oversampling
               ("model", XGBClassifier(random_state= 100))  
              ])

In [None]:
XGB_SM.fit(X_train, num_y_train)

In [None]:
result_14= XGB_SM.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_14))

#### SMOTE with SVM

In [None]:
SVM_SM= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe), 
               ("feature_selection", SelectKBest(score_func=f_classif, k=10)),# Perform feature selection
               ("smote", SMOTE(random_state=100)),  # Apply SMOTE for oversampling
               ("model", SVC(random_state= 100))  
              ])

In [None]:
SVM_SM.fit(X_train, num_y_train)

In [None]:
result_15= SVM_SM.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_15))

### Comparing results of Class_weight vs SMOTE vs Baseline

In [None]:
imbalance_result= {"Decision Tree":result_1, "Decision Tree_SM":result_11, "Decision_Tree_CW":result_6, 
               
                   "Logistic Regression": result_2, "Logistic Regression_SM":result_12, "Logistic_Regression_CW": result_7, 
            
             "Random Forest": result_3, "Random Forest_SM": result_13, "Random_Forest_CW":result_8, 
               
               "XGBoost": result_4, "XGB_SM":result_14, "XGB_CW":result_9, 
                   
                   "SVM": result_5, "SVM_SM": result_15,"SVM_CW":result_10}

for key, value in imbalance_result.items():
    
    print(f"Classification Report for {key}, is: \n\n",(classification_report(num_y_test,value)))

### Notes:

- Balancing did improve the performance of our model, especially for the yes class. 

- At the end of this, we realized that the class-weights method did relatively better than the SMOTE method

### Ensemble Techniques

In this section, we will take a look at how various ensemble learning techniques affect our models performance. As we know certain models have low biases and high variance, whiles others have high bias and low variance, therefore, we will use the appopiriate ensemble technique to curb these.



### Frame of work:

- Bagging, works well fore models with high variance and low bias. These include algorithms such as: Random Forest, XGBoost, Decision Tree, and SVM

- Boosting for models that have low variance and high bias like Logistic Regression

The idea here is to see how these affect our models, and which to choose for our hyperparameter tuning

### Bagging Decision Tree with Class Weight 

In [None]:
en_DTP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
            ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", BaggingClassifier(DecisionTreeClassifier(
                   random_state=100,
                   class_weight=weight), bootstrap_features= True,random_state= 100))
              
              ])

In [None]:
en_DTP.fit(X_train, num_y_train)

In [None]:
result_16= en_DTP.predict(X_test)
print(classification_report(num_y_test, result_16))

### Logistic Regression with Class Weight

In [None]:
en_LRP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
            ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", LogisticRegression(
                   random_state=100,
                   class_weight=weight))
              
              ])

In [None]:
en_LRP.fit(X_train, num_y_train)

In [None]:
result_17= en_LRP.predict(X_test)
print(classification_report(num_y_test, result_17))

### Bagging Random Forest with Class Weight

In [None]:
en_RFP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
            ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", BaggingClassifier(RandomForestClassifier(
                   random_state=100,
                   class_weight=weight), bootstrap_features= True,random_state= 100))
              
              ])

In [None]:
en_RFP.fit(X_train, num_y_train)

In [None]:
result_18= en_DTP.predict(X_test)
print(classification_report(num_y_test, result_18))

### Bagging XGBOOST with Class Weight

In [None]:
en_XGB= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
            ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", BaggingClassifier(XGBClassifier(
                   random_state=100,
                  scale_pos_weight=pos_weight), bootstrap_features= True,random_state= 100))
              
              ])

In [None]:
en_XGB.fit(X_train, num_y_train)

In [None]:
result_19= en_XGB.predict(X_test)
print(classification_report(num_y_test, result_19))

### Bagging of SVM with class weight

In [None]:
en_SVM= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
            ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", BaggingClassifier(SVC(
                   random_state=100,
                   class_weight=weight), bootstrap_features= True,random_state= 100))
              
              ])

In [None]:
en_SVM.fit(X_train, num_y_train)

In [None]:
result_20= en_SVM.predict(X_test)
print(classification_report(num_y_test, result_20))

### Ensemble Result Comparison:

- In this section, we will be comparing the result of our base model, our ensembled model, and our classweight without ensemble model

In [None]:
imbalance_result= {"Decision Tree":result_1, "Decision_Tree_CW":result_6, "Decision Tree_en":result_16,  
               
                   "Logistic Regression": result_2, "Logistic_Regression_CW": result_7, "Logistic Regression_en":result_17,  
            
             "Random Forest": result_3,"Random_Forest_CW":result_8,  "Random Forest_en": result_18, 
               
               "XGBoost": result_4,"XGB_CW":result_9,  "XGB_en":result_18, 
                   
                   "SVM": result_5,"SVM_CW":result_10, "SVM_en": result_20}

for key, value in imbalance_result.items():
    
    print(f"Classification Report for {key}, is: \n\n",(classification_report(num_y_test,value)))
   

### Notes after Ensembling:

- Our models prediction for the Yes class greatly improved as we balanced the weights and used ensemble methods.

- The base model for SVM did well by having the highest precision for the Yes class, while Logistic Regression with weights did well by having the highest recall for the Yes class

In summary:

- The class_weights imbalance handling method performed better than SMOTE 

- Overall, Logistic Regression and SVM topped the chart, respectively

### Using Stacking to Create a Hybrid Model

From our initial modelling, we realized that the base model for SVM did well, while the balanced model for Logistic Regression did well. Therefore, we will like to combine them to see how well these two will do together

In [None]:
from sklearn.ensemble import StackingClassifier

In [None]:
##building our stacking classifier

# Define the stacking classifier
stacking_clf = StackingClassifier(
    estimators=[
        ("logistic", LogisticRegression(
                   random_state=100,
                   class_weight=weight)),
        ("svm", SVC(random_state=100))
    ],
    final_estimator=LogisticRegression(
                   random_state=100, class_weight=weight))

In [None]:
st_LRP= Pipeline([("spaceImputer", SpaceImputer(remove_space)),
               ("coltrans", col_pipe),  
            ("feature_selection", SelectKBest(score_func=f_classif, k=10)),
               ("model", stacking_clf)
              
              ])

In [None]:
st_LRP.fit(X_train, num_y_train)

In [None]:
result_21= st_LRP.predict(X_test)

In [None]:
print(classification_report(num_y_test, result_21))

In [None]:
print(classification_report(num_y_test, result_17))

### Results after stacking:

- There was an increase in the f-1 score and accuracy of our stacked model by 1 percent


## Hyper Parameter Tuning

From our modeling, Logistic regression stood out, when we considered the recall and relative precision of the Yes class (Churners). So, what we will do here is to run a Grid_search CV to find the best hyperparameters for our model to increase its performance. 

#### What will we do?

- We will perform a hyperparameter gridsearch on just the model then find the best parameters, and then add it to our pipeline. 

In [None]:
CW_LRP

In [None]:
###creating the params

params = {
    "model__penalty": ["l1", "l2", "elasticnet", None],
    "model__C": np.logspace(-4, 4, 20),
    "model__intercept_scaling": [1, 2, 3, 4, 5],
    "model__solver": ["lbfgs", "liblinear", "newton-cg", "sag", "saga"],
    "model__max_iter": [100, 1200, 2000, 3000],
    "model__random_state": [24, 42, 57, 100, 500]
}

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV

In [None]:
Random_HPT= RandomizedSearchCV(estimator=CW_LRP, param_distributions=params, cv=5)

In [None]:
Random_HPT.fit(X_train, num_y_train)

In [None]:
best_params = Random_HPT.best_params_
best_score = Random_HPT.best_score_
cv_results = Random_HPT.cv_results_

In [None]:
print("Best Parameters:", best_params)
print("Best Parameters:", best_score)


In [None]:
CW_LRP.set_params(model__solver='sag',
                  model__random_state= 24,
                  model__penalty='l2',
                  model__max_iter=3000,
                  model__intercept_scaling=4,
                  model__C=0.08858667904100823)

In [None]:
CW_LRP

In [None]:
CW_LRP.fit(X_train, num_y_train)

In [None]:
final_result= CW_LRP.predict(X_test)

In [None]:
print(classification_report(num_y_test, final_result))

## Hypothesis Testing

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Specify the dependent variable and independent variables
dependent_variable = 'Churn'
independent_variable = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 
                        'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                        'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 
                        'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges']

# Perform ANOVA for each independent variable
for iv in independent_variable:
    formula = f"{dependent_variable} ~ C({iv})"
    model = ols(formula, data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    print(f"ANOVA for {iv}")
    print(anova_table)
    print("\n")


-The p-value (PR(>F)) for "gender" is 0.469905. Interpretation: The p-value is greater than the typical significance level of 0.05, indicating that there is no statistically significant difference in the "Churn" variable based on gender