# TELCO CHURN ANALYSIS

## Hypothesis

Null Hypothesis: There is no relationship between the monthly charges and the churn of customers.

Alternate Hypothesis: There is a relationship between the monthly charges and the churn of customers.

## Analytical Questions
1. What is the overall churn rate of the telecommunication company?
2. What is the average monthly charges to churn customers compared to non-churn customers?
3. What percentage of the top 100 most charged customers churned?
4. What percentage of the top 100 least charged customers churned?
5. What is the churn rate of male customers with partners, dependents and high monthly charges?
6. What is the churn rate of customers without online security?
7. What is the churn rate of customers without online backup?
8. What is the churn rate of customers without device protection?
9. What is the churn rate of customers without Tech support?
10. How does the absence of online security, online backup, device protection and Tech support add up to lead to churning?
11. How does the length of customers' contract affect their likelihood of churn?
12. How does the length of customers' tenure affect their likelihood of churn?

In [980]:
# Importing the needed packages
import pandas as pd
import numpy as np

# Libraries to create connection string to SQL server
import pyodbc
from dotenv import dotenv_values

# Libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library for testing the hypothesis
import scipy.stats as stats

# Library for splitting the train data
from sklearn.model_selection import train_test_split

# Library for feature scaling
from sklearn.preprocessing import MinMaxScaler

# Library for feature encoding
from sklearn.preprocessing import OneHotEncoder

# Libraries for balancing the dataset
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Libraries for modelling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

# Libraries for evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import warnings

warnings.filterwarnings('ignore')

There are three datasets for this analysis which are located in three different places. The first dataset can be found on a SQL server and is to be accessed remotely. The second dataset can be found as an excel file on One Drive. The access link was provided and used to access and download the dataset. While the third dataset can be found as a csv file on a Github repository whose link was provided as well and used to clone the dataset into the local machine.

The first and last datasets have been identified as the train datasets, while the second dataset has been identifed as the test dataset. The train datasets will be assessed and merged together, and used to build models independently. While the test dataset will be used to test the models independently.

## Accessing the first dataset from SQL database

In [981]:
# Load the environment variable in the .env file into a dictionary

environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the .env file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")

# The connection string is an f string that includes all the variable above to extablish a connection to the server.
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"


In [982]:
# Using the connect method of the pyodbc library to pass in the connection string.

# N/B: This will connect to the server and might take a few seconds to be complete.
# Check your internet connection if it takes more time than necessary.

connection = pyodbc.connect(connection_string)

In [983]:
# Get the dataset using the SQL query shown below
# dbo.LP2_Telco_churn_first_3000 is the name of the dataset, dbo being a naming convention in Microsoft SQL Server.

query = "Select * from dbo.LP2_Telco_churn_first_3000"
df1 = pd.read_sql(query, connection)

df1.head()

## Accessing the last dataset

In [984]:
# Loading the last dataset.

df3 = pd.read_csv('LP2_Telco-churn-last-2000.csv')

In [985]:
# Displaying the first five rows of the last dataset.

df3.head()

In [986]:
# Inspecting the columns of the first dataset.

df1.columns

In [987]:
# Inspecting the columns of the last dataset.

df3.columns

In [988]:
# Since both datasets have the same column names and index number, they can be concatenated to have the train dataset.

train_data = pd.concat([df1, df3])

# Saving the train dataset to a new csv file.
train_data.to_csv('Train-Data.csv')

train_data.head()

## Accessing the second dataset (test dataset)

In [989]:
# Loading the test dataset.

test_data = pd.read_excel('Telco-churn-second-2000.xlsx')
test_data.head()

## EDA

In [990]:
# Checking the number of rows and columns on the train dataset.

train_data.shape

In [991]:
# Checking the number of rows and columns on the test dataset.

test_data.shape

The train dataset has 5043 columns and 21 columns while the test dataset has 2000 rows and 20 columns. Let's identify the column in the train dataset that is absent in the test dataset.

In [992]:
# Inspecting the columns of the train dataset.

train_data.columns

In [993]:
# Inspecting the columns of the test dataset.

test_data.columns

As can be seen, the test dataset has the same columns as the train dataset with the exception of the churn column which can be found on the train dataset alone. This is understandable as the churn column on the train dataset provides information on whether customers churn or  not, which is used to build the best ML model. This column is not needed on the test dataset, rather the model built is tested on the test dataset to check the ability of the model to predict whether a customet will churn or not.

In [994]:
# Checking the datatypes and the presence of missing values on the train dataset.

train_data.info()

In [995]:
# Checking the datatypes and the presence of missing values on the test dataset.

test_data.info()

Note that the datatype of the 'SeniorCitizen' column is an object on the train data but an integer on the test data.

In [996]:
# Confirming the number of cells with missing values on each column of the train dataset.

train_data.isna().sum()

The 'MultipleLines' column has 269 missing values. The 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV' and 'StreamingMovies' columns all have 651 missing values. This needs to be evaluated further to find out if these missing values are exactly on the same rows. The 'TotalCharges' column has 5 missing values while the 'Churn' column has 1 missing value.

In [997]:
# Confirming that there are no missing values on the test dataset.

test_data.isna().sum()

There are no missing values on the test dataset.

In [998]:
# Checking for the presence of duplicates on the train dataset.

train_data.duplicated().sum()

There are no duplicate rows on the train dataset.

In [999]:
# Checking for the presence of duplicates on the test dataset.

test_data.duplicated().sum()

There are no duplicate rows on the test dataset.

# Data Transformation

In [1000]:
# Investigating the columns on the train dataset.

train_data.columns
for column in train_data.columns:
    print('column: {} - unique value: {}'.format(column, train_data[column].unique()))

There are many coloumns with unnecessary values such as True, False, 'No phone service' and/or  'No internet service'.
These values will be replaced with 'Yes' or 'No' as appropriate to ensure consistency.

True will be replaced with 'Yes' because True means that the customers receive those services. While False, 'No phone service' and 'No internet service' will be replaced with 'No' because they mean that the customers do not (or cannot) receive those services. 

In [1001]:
# Replace True with 'Yes' and replace False, 'No internet service' and 'No phone service' with 'No' in the train dataset.
train_data = train_data.replace({
    True: 'Yes',
    False: 'No',
    'No internet service': 'No',
    'No phone service': 'No'
}, inplace = False)

# Confirm that the changes on the train dataset have been effected.
train_data.columns
for column in train_data.columns:
    print('column: {} - unique value: {}'.format(column, train_data[column].unique()))

None is the Boolean representation of missing values. The columns with None are all categorical columns. The mode of these columns will be obtained and used to replace None.

In [1002]:
# Replace None on each column with the mode of the column.
def replace_none_with_mode(train_data):
   categorical_cols = train_data.select_dtypes(include='object').columns  # Select categorical columns

   for col in categorical_cols:
       mode_val = train_data[col].mode()[0]  # Calculate the mode of the column
       train_data[col] = train_data[col].replace({None: mode_val})  # Replace None values with the mode

   return train_data
train_data = replace_none_with_mode(train_data)

# Confirm that the changes on the train dataset have been effected.
train_data.columns
for column in train_data.columns:
    print('column: {} - unique value: {}'.format(column, train_data[column].unique()))

In [1003]:
# Investigate the columns on the test dataset.

test_data.columns
for column in test_data.columns:
    print('column: {} - unique value: {}'.format(column, test_data[column].unique()))

In [1004]:
# Replace True with 'Yes' and replace False, 'No internet service' and 'No phone service' with 'No' in the test dataset.
test_data = test_data.replace({
    True: 'Yes',
    False: 'No',
    'No internet service': 'No',
    'No phone service': 'No'
}, inplace = False)

# Confirm that the changes on the test dataset have been effected.
test_data.columns
for column in test_data.columns:
    print('column: {} - unique value: {}'.format(column, test_data[column].unique()))

In [1005]:
# Drop the customerID column from the train dataset since it has unique values that are non-beneficial to our modelling

train_data.drop(columns='customerID', inplace=True)

In [1006]:
# Drop the customerID column from the test dataset since it has unique values that are non-beneficial to our modelling

test_data.drop(columns='customerID', inplace=True)

In [1007]:
# Checking for missing values on the train dataset.

train_data.isna().sum()

The train dataset no longer has missing values. This is because None values (which is a boolean representation of missing values) has been changed to No.

In [1008]:
# Investigating the 'SeniorCitizen' column in the test dataset.

test_data['SeniorCitizen'].unique()

The 'SeniorCitizen' column has numerical vales (0 and 1). These will be changed to 'No' and 'Yes' respectively in order to change the column datatype to an object.

In [1009]:
# Replace 0 with 'No' and 1 with Yes in the 'SeniorCitizen' column of the test dataset.
test_data['SeniorCitizen'] = test_data['SeniorCitizen'].replace({
    0: 'No',
    1: 'Yes'
}, inplace = False)

# Confirm that the changes on the test data has been effected
test_data.info()

In [1010]:
# Checking for missing values on the test dataset.

test_data.isna().sum()

The test data has no missing values.

In [1011]:
train_data.info()

Total Charges column has an object datatype. This needs to be converted to float to enable us do some calculations on the column. The column will be investigated to see some of its features.

In [1012]:
# Investigating the 'TotalCharges' column in the train dataset.

train_data['TotalCharges'].unique()

In [1013]:
# Converting the datatype of the 'TotalCharges' column in the train dataset to float by changing the contents to
# numerical values.

train_data['TotalCharges'] = pd.to_numeric(train_data['TotalCharges'], errors = 'coerce')
train_data.info()

In [1014]:
# Checking for missing values

train_data.isna().sum()

The 'TotalCharges' column now has three missing values. These will be replaced with the mean value since it's now a numerical column.

In [1015]:
# Calculate the mean of the 'TotalCharges' column
mean_value = train_data['TotalCharges'].mean()

# Fill missing values with the mean value
train_data['TotalCharges'] = train_data['TotalCharges'].fillna(mean_value)

# Confirm that their are no missing values
train_data.isna().sum()

In [1016]:
# Investigating the 'TotalCharges' column in the test dataset.

test_data['TotalCharges'].unique()

In [1017]:
# Converting the datatype of the 'TotalCharges' column in the test dataset to float by changing the contents to
# numerical values.

test_data['TotalCharges'] = pd.to_numeric(test_data['TotalCharges'], errors = 'coerce')
test_data.info()

In [1018]:
# Checking for missing values

test_data.isna().sum()

In [1019]:
# Calculate the mean of the 'TotalCharges' column
mean_value = test_data['TotalCharges'].mean()

# Fill missing values with the mean value
test_data['TotalCharges'] = test_data['TotalCharges'].fillna(mean_value)

# Confirm that their are no missing values
test_data.isna().sum()

In [1020]:
# Obtain the categorical and numerical columns of the train dataset
train_cat = train_data.select_dtypes(include=['object']).columns
train_num = train_data.select_dtypes(include=['float64', 'int64']).columns

# Obtain the categorical and numerical columns of the test dataset
test_cat = test_data.select_dtypes(include=['object']).columns
test_num = test_data.select_dtypes(include=['float64', 'int64']).columns

In [1021]:
# Evaluate the categorical values on the train dataset.

train_data[train_cat].describe()

In [1022]:
# Evaluate the numerical values on the train dataset.

train_data[train_num].describe()

In [1023]:
# Evaluate the correlation of the numerical values on the train dataset.

train_data[train_num].corr()

In [1024]:
# Visualizing the correlation with a heatmap

sns.heatmap(train_data[train_num].corr(), annot=True)

Since there are calculations to be done on the 'Churn' column of the train dataset while answering the analytical questions, 'No' and 'Yes' values will be changed to 0 and 1 respectively in order to convert the column to a numerical column.

In [1025]:
# Changing 'No' and 'Yes' to 0 and 1 respectively on the 'Churn' column of the train dataset.

train_data['Churn'] = train_data['Churn'].replace(['No', 'Yes'], [0,1])
train_data['Churn'].unique()

# Hypothesis Testing

Hypothesis
Null Hypothesis: There is no relationship between the monthly charges and the churn of customers.

Alternate Hypothesis: There is a relationship between the monthly charges and the churn of customers.

The hypothesis was tested using chi-square test.

In [1026]:
# Define the null hypothesis.
null_hypothesis = "There is no relationship between the monthly charge and churn of customers."

# Define the alternative hypothesis.
alternative_hypothesis = "There is a relationship between the monthly charge and churn of customers."

# Perform the chi-square test
observed = pd.crosstab(train_data['MonthlyCharges'], train_data['Churn'])

chi2, p_value, _, _ = stats.chi2_contingency(observed)


# Set the significance level
alpha = 0.05


# Print the test results
print("Null Hypothesis:", null_hypothesis)

print("Alternative Hypothesis:", alternative_hypothesis)

print("Significance Level (alpha):", alpha)

print("Chi-square statistic:", chi2)

print("P-value:", p_value)


# Compare the p-value with the significance level
if p_value < alpha:

    print("Result: Reject the null hypothesis. There is a relationship between the monthly charges and churn of customers.")

else:

    print("Result: Fail to reject the null hypothesis. There is no relationship between monthly charges and churn of customers.")

# Answering Questions with Visualizations

###  Analytical Questions
1. What is the overall churn rate of the telecommunication company?
2. What is the average monthly charges to churn customers compared to non-churn customers?
3. What percentage of the top 100 most charged customers churned?
4. What percentage of the top 100 least charged customers churned?
5. What is the churn rate of male customers with partners, dependents and high monthly charges?
6. What is the churn rate of customers without online security?
7. What is the churn rate of customers without online backup?
8. What is the churn rate of customers without device protection?
9. What is the churn rate of customers without Tech support?
10. How does the absence of online security, online backup, device protection and Tech support add up to lead to churning?
11. How does the length of customers' contract affect their likelihood of churn?
12. How does the length of customers' tenure affect their likelihood of churn?

# Question 1
What is the overall churn rate of the telecommunication company?

In [1027]:
# Calculate the churn rate
total_customers = len(train_data)
churned_customers = train_data['Churn'].sum()
churn_rate = (churned_customers / total_customers) * 100

# Display the churn rate
print('Total Customers:', total_customers)
print('Churned Customers:', churned_customers)
print(f'Churn Rate: {churn_rate.round(1)}%')

# Plot the churn rate
plt.bar(['Churned', 'Not Churned'], [churn_rate, 100-churn_rate])
plt.title('Overall Churn Rate Of The Telecommunication Network')
plt.xlabel('Churn Status')
plt.ylabel('Percentage')
plt.ylim([0, 100])
plt.show()

With 5043 customers initially out of which 1336 have churned, the telecommunication network has a churn rate of 26.5%.

# Question 2
What is the average monthly charges to churn customers compared to non-churn customers?

In [1028]:
# Separate churn and non-churn customers
churn_customers = train_data[train_data['Churn'] == 1]
non_churn_customers = train_data[train_data['Churn'] == 0]

# Calculate the average monthly charges for churn and non-churn customers
avg_churn_charges = churn_customers['MonthlyCharges'].mean()
avg_non_churn_charges = non_churn_customers['MonthlyCharges'].mean()

# Display the average charges
print(f'Average Monthly Charges To Churn Customers: ${round(avg_churn_charges, 2)}')
print(f'Average Monthly Charges To Non-Churn Customers: ${round(avg_non_churn_charges, 2)}')

# Plot the average charges
labels = ['Churn Customers', 'Non-Churn Customers']
charges = [avg_churn_charges, avg_non_churn_charges]
plt.bar(labels, charges)
plt.ylabel('Average Monthly Charges ($)')
plt.title('Average Monthly Charges To Churn Customers Compared to Non-Churn Customers')
plt.show()

As shown, the average monthly charges to churn customers is 75.21 dollars, higher than the average monthly charges to non-churn customers (61.44 dollars).

# Question 3
What percentage of the top 100 most charged customers churned?

In [1029]:
# Sort the DataFrame by TotalCharges in descending order

most_charged_data = train_data.sort_values(by='TotalCharges', ascending=False)

# Select the top 100 most charged customers
top_100_most_charged_customers = most_charged_data.head(100)

# Count the number of churned customers among the top 100
most_charged_churned_customers = top_100_most_charged_customers[top_100_most_charged_customers['Churn'] == 1]
most_charged_churned_count = most_charged_churned_customers.shape[0]

# Calculate the percentage of churned customers
most_charged_percentage_churned = (most_charged_churned_count / 100) * 100

# Display the result
print(f'The Percentage Of The Top 100 Most Charged Customers Who Churned: {most_charged_percentage_churned}%')

# Create a pie chart to visualize the results
labels = ['Churned Customers', 'Non-Churned Customers']
sizes = [most_charged_percentage_churned, 100 - most_charged_percentage_churned]
colors = ['#FF7F7F', '#7FB3FF']

# Plot the top 100 most charged customers who churned
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.title('The Percentage Of The Top 100 Most Charged Customers Who Churned')
plt.legend(loc=(1,0.5))
plt.show()

At an overall churn rate of 26.5%, only 7% of the top 100 most charged customers churned. This suggests that the total charges are not the only determinant of churn. There will be other factors that influenced the churn rate.

# Question 4
What percentage of the top 100 least charged customers churned?

In [1030]:
# Sort the DataFrame by TotalCharges in ascending order

least_charged_data = train_data.sort_values(by='TotalCharges', ascending=True)

# Select the top 100 least charged customers
top_100_least_charged_customers = least_charged_data.head(100)

# Count the number of churned customers among the top 100
least_charged_churned_customers = top_100_least_charged_customers[top_100_least_charged_customers['Churn'] == 1]
least_charged_churned_count = least_charged_churned_customers.shape[0]

# Calculate the percentage of churned customers
least_charged_percentage_churned = (least_charged_churned_count / 100) * 100

# Display the result
print(f'The Percentage Of The Top 100 Least Charged Customers Who Churned: {least_charged_percentage_churned}%')

# Create a pie chart to visualize the results
labels = ['Churned Customers', 'Non-Churned Customers']
sizes = [least_charged_percentage_churned, 100 - least_charged_percentage_churned]
colors = ['#FF7F7F', '#7FB3FF']

# Plot the top 100 most charged customers who churned
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.title('The Percentage Of The Top 100 Least Charged Customers Who Churned')
plt.legend(loc=(1,0.5))
plt.show()

33% of the top 100 least charged customers churned. This strenghtens the argument that the charges are not the only determinant of churn. A good number of those that got the least charges churned although the charges they paid were reasonably low.

# Question 5
What is the churn rate of male customers with partners, dependents and high monthly charges?

N/B: It is assumed that high monthly charges refer to monthly charges equal to or above $100.

In [1031]:
# Filter the data to include only male customers with partners and dependents and high monthly charges
filtered_train_data = train_data[(train_data['gender'] == 'Male') & (train_data['Partner'] == 'Yes') & (train_data['Dependents'] == 'Yes') & (train_data['MonthlyCharges'] > 100)]

# Calculate the churn rate.
rate_churned = filtered_train_data[filtered_train_data['Churn'] == 1].shape[0]/len(filtered_train_data) * 100

# Calculate the non-churn rate.
rate_non_churned = filtered_train_data[filtered_train_data['Churn'] == 0].shape[0]/len(filtered_train_data) * 100

# Display the result
print(f'Churn Rate of Male Customers with Partners, Dependents and High Monthly Charges: {rate_churned}%')
print(f'Non-Churn Rate of Male Customers with Partners, Dependents and High Monthly Charges: {rate_non_churned}%')

# Create a bar plot to visualize the churn rate and non-churn rate.
labels = ['Churned', 'Non-Churned']
charges = [rate_churned, rate_non_churned]

# Plot the churn rate of male customers with partners, dependents and high monthly charges
plt.bar(labels, charges)
plt.xlabel('Churn Status')
plt.ylabel('Churn Rate (%)')
plt.title('Churn Rate Of Male Customers With Partners, Dependents \nAnd High Monthly Charges')
plt.show()

On the assumption that male customers with partners and dependents have higher financial demands to meet under average conditions, this question analyzed the rate at which high charges influenced the churn rate of this set of customers. It was discovered that only 20% of male customers with partners, dependents and high monthly charges churned. This confirms that the charges are not the only determinant of churn. There are other factors that greatly influenced the churn rate.

# Question 6
What is the churn rate of customers without online security?

In [1032]:
# Filter the DataFrame to include only customers without online security
no_security_customers = train_data[train_data['OnlineSecurity'] == 'No']

# Calculate the churn rate of customers without online security
churned_customers = no_security_customers[no_security_customers['Churn'] == 1]
churned_rate = (churned_customers.shape[0] / no_security_customers.shape[0]) * 100

# Display the result
print(f'Churn Rate Of Customers Without Online Security: {churned_rate}%')

# Create a pie chart to visualize the results
labels = ['Churned', 'Non-Churned']
sizes = [churned_rate, 100 - churned_rate]
colors = ['#FF7F7F', '#7FB3FF']

# Plot the churn rate of customers without online security
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Churn Rate Of Customers Without Online Security')
plt.legend(loc=(1,0.5))
plt.show()

31.3% of customers without online security churned. This means that the absence of online security influenced the churn rate of customers.

# Question 7
What is the churn rate of customers without online backup?

In [1033]:
# Filter the DataFrame to include only customers without online backup
no_backup_customers = train_data[train_data['OnlineBackup'] == 'No']

# Calculate the churn rate of customers without online backup
churned_customers = no_backup_customers[no_backup_customers['Churn'] == 1]
churned_rate = (churned_customers.shape[0] / no_backup_customers.shape[0]) * 100

# Display the result
print(f'Churn Rate Of Customers Without Online Backup: {churned_rate}%')

# Create a pie chart to visualize the results
labels = ['Churned', 'Non-Churned']
sizes = [churned_rate, 100 - churned_rate]
colors = ['#FF7F7F', '#7FB3FF']

# Plot the churn rate of customers without online backup
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Churn Rate Of Customers Without Online Backup')
plt.legend(loc=(1,0.5))
plt.show()

29.2% of customers without online backup churned. This means that the absence of online backup influenced the churn rate of customers.

# Question 8
What is the churn rate of customers without device protection?

In [1034]:
# Filter the DataFrame to include only customers without device protection
no_device_protection_customers = train_data[train_data['DeviceProtection'] == 'No']

# Calculate the churn rate of customers without device protection
churned_customers = no_device_protection_customers[no_device_protection_customers['Churn'] == 1]
churned_rate = (churned_customers.shape[0] / no_device_protection_customers.shape[0]) * 100

# Display the result
print(f'Churn Rate Of Customers Without Device Protection: {churned_rate}%')

# Create a pie chart to visualize the results
labels = ['Churned', 'Non-Churned']
sizes = [churned_rate, 100 - churned_rate]
colors = ['#FF7F7F', '#7FB3FF']

# Plot the churn rate of customers without device protection
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Churn Rate Of Customers Without Device Protection')
plt.legend(loc=(1,0.5))
plt.show()

28.6% of customers without device protection churned. This means that the absence of device protection influenced the churn rate of customers.

# Question 9
What is the churn rate of customers without Tech support?

In [1035]:
# Filter the DataFrame to include only customers without Tech support
no_tech_support_customers = train_data[train_data['TechSupport'] == 'No']

# Calculate the churn rate of customers without Tech support
churned_customers = no_tech_support_customers[no_tech_support_customers['Churn'] == 1]
churned_rate = (churned_customers.shape[0] / no_tech_support_customers.shape[0]) * 100

# Display the result
print(f'Churn Rate Of Customers Without Tech Support: {churned_rate}%')

# Create a pie chart to visualize the result
labels = ['Churned', 'Non-Churned']
sizes = [churned_rate, 100 - churned_rate]
colors = ['#FF7F7F', '#7FB3FF']

# Plot the churn rate of customers without Tech support
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Churn Rate Of Customers Without Tech Support')
plt.legend(loc=(1,0.5))
plt.show()

31.4% of customers without Tech support churned. This means that the absence of Tech support influenced the churn rate of customers.

# Question 10
How does the absence of online security, online backup, device protection and Tech support add up to lead to churning?

In [1036]:
# Calculate the number of churned customers based on absence of each feature
churned_customers = train_data[train_data['Churn'] == 1]

absence_of_security = churned_customers[churned_customers['OnlineSecurity'] == 'No'].shape[0]
absence_of_backup = churned_customers[churned_customers['OnlineBackup'] == 'No'].shape[0]
absence_of_protection = churned_customers[churned_customers['DeviceProtection'] == 'No'].shape[0]
absence_of_tech_support = churned_customers[churned_customers['TechSupport'] == 'No'].shape[0]

# Calculate the rate at which the absence of each feature contributes to churning
total_churned_customers = churned_customers.shape[0]

rate_of_security = (absence_of_security / total_churned_customers) * 100
rate_of_backup = (absence_of_backup / total_churned_customers) * 100
rate_of_protection = (absence_of_protection / total_churned_customers) * 100
rate_of_tech_support = (absence_of_tech_support / total_churned_customers) * 100

# Display the results
print(f'Churn Percentage Of Customers Without Online Security: {rate_of_security}%')
print(f'Churn Percentage Of Customers Without Online Backup: {rate_of_backup}%')
print(f'Churn Percentage Of Customers Without Device Protection: {rate_of_protection}%')
print(f'Churn Percentage Of Customers Without Tech Support: {rate_of_tech_support}%')

# Create a bar plot to visualize the results
labels = ['No Online Security', 'No Online Backup', 'No Device Protection', 'No Tech Support']
rates = [rate_of_security, rate_of_backup, rate_of_protection, rate_of_tech_support]

# Plot the rate at which the absence of each feature contributes to churning
plt.figure(figsize=(10, 6))
sns.barplot(x=labels, y=rates, palette='viridis')
plt.title('Churn Rate Due To Absence Of Features')
plt.xlabel('Features')
plt.ylabel('Rate (%)')
plt.xticks(rotation=45)
plt.show()

This means that out of all the churned customers, 84.0% did not have online security, 72.4% did not have online backup, 70.6% did not have device protection, and 83.5% did not have Tech support. These factors are majorly responsible for churning.

# Question 11
How does the length of customers' contract affect their likelihood of churn?

In [1037]:
# Calculate churn rates for different contract lengths
contract_lengths = train_data['Contract'].unique()
churn_rates = []

for length in contract_lengths:
   churn_rate = train_data[train_data['Contract'] == length]['Churn'].value_counts(normalize=True).get(1, 0) * 100
   churn_rates.append(churn_rate)

# Display the contract lenghts and churn rates
print(f'Contract Lengths: {contract_lengths}')
print(f'Churn Rates: {churn_rates}')

# Create a bar plot to visualize the churn rates by contract length
plt.figure(figsize=(8, 6))
plt.bar(contract_lengths, churn_rates)
plt.xlabel('Contract Length')
plt.ylabel('Churn Rate (%)')
plt.title('Churn Rates by Contract Length')
plt.show()

As shown, the length of a customer's contract affects the churn rate. Customers with month-to-month contract have a high churn rate of 43.1%. This is followed by customers with one year contract length with a churn rate of 11.6%. While customers with two year contract length have a low churn rate of 2.4%.

# Question 12
How does the length of customers' tenure affect their likelihood of churn?

In [1038]:
# Calculate churn rates for different tenure lengths
tenure_lengths = train_data['tenure'].unique()
churn_rates = []

for length in tenure_lengths:
   churn_rate = train_data[train_data['tenure'] == length]['Churn'].value_counts(normalize=True).get(1, 0)
   churn_rates.append(churn_rate)

# Create a bar plot to visualize the churn rates by tenure length
plt.figure(figsize=(8, 6))
plt.bar(tenure_lengths, churn_rates)
plt.xlabel('Tenure Length (Number of Months)')
plt.ylabel('Churn Rate')
plt.title('Churn Rates by Tenure Length')
plt.show()

As shown, the more the tenure length (number of months) a customer stays with the telecommunication company, the lower the likelihood of churn. This is ralated to the result of the likelihood of churn based on the contract length of customers.

# Recommendation

In [1039]:
#

In [1040]:
# # Pandas Profiling
# # TRAIN

# profile = ProfileReport(train_data, title = "Train Dataset", html = {'style': {full_width: True}})
# profile.to_notebook_iframe()
# profile.to_file('(Trainset) Pandas-Profiling_Report.html')

# Feature Engineering

In [1041]:
# First create a copy of the train and test datasets on which to carry out the feature engineering processes

train_data_transformed = train_data.copy()
test_data_transformed = test_data.copy()

### Feature scaling

In [1042]:
# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Define the columns to scale
columns_to_scale = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Use MinMaxScaler to scale the numerical columns on the train dataset
train_data_transformed[columns_to_scale] = scaler.fit_transform(train_data_transformed[columns_to_scale])
train_data_transformed = pd.DataFrame(train_data_transformed, columns=train_data.columns)

# View the scaled data
train_data_transformed[columns_to_scale].head()

In [1043]:
# Use MinMaxScaler to scale the numerical columns on the test dataset
test_data_transformed[columns_to_scale] = scaler.fit_transform(test_data_transformed[columns_to_scale])
test_data_transformed = pd.DataFrame(test_data_transformed, columns=test_data.columns)

# View the scaled data
test_data_transformed[columns_to_scale].head()

### Feature encoding

Since the values of the churn column in the train dataset were changed from 'Yes' and 'No' to 1 and 0 respectively to aid the calculations used to answer the analytical questions, and scaling has been applied to the numerical columns of both train and test datasets, it is important to relist the numerical and categorical columns of both datasets.

In [1044]:
# Obtain the categorical and numerical columns of the train dataset
train_cat = train_data.select_dtypes(include=['object']).columns
train_num = train_data.select_dtypes(include=['float64', 'int64']).columns

# Obtain the categorical and numerical columns of the test dataset
test_cat = test_data.select_dtypes(include=['object']).columns
test_num = test_data.select_dtypes(include=['float64', 'int64']).columns

In [1045]:
# Create an encoder object using OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop="first")

# Create seperate DataFrames for categorical columns and numerical columns for the train dataset
train_cat_df = train_data_transformed[train_cat]
train_num_df = train_data_transformed[train_num]


# Use OneHotEncoder to encode the categorical columns on the train dataset
encoder.fit(train_cat_df)
train_encoded = encoder.transform(train_cat_df).tolist()
train_encoded_data = pd.DataFrame(train_encoded, columns=encoder.get_feature_names_out())
train_encoded_data.head()

In [1046]:
# concatenate the train_encoded_data and the train_num_df to have the train

train = pd.concat([train_encoded_data, train_num_df.set_axis(train_encoded_data.index)], axis=1)
train.head()

In [1047]:
# Create seperate DataFrames for categorical columns and numerical columns for the test dataset
test_cat_df = test_data_transformed[test_cat]
test_num_df = test_data_transformed[test_num]


# Use OneHotEncoder to encode the categorical columns on the test dataset
encoder.fit(test_cat_df)
test_encoded = encoder.transform(test_cat_df).tolist()
test_encoded_data = pd.DataFrame(test_encoded, columns=encoder.get_feature_names_out())
test_encoded_data.head()

In [1048]:
# concatenate the test_encoded_data and the test_num_df to have the test

test = pd.concat([test_encoded_data, test_num_df.set_axis(test_encoded_data.index)], axis=1)
test.head()

# Splitting the train dataset

In [1049]:
X = train.drop('Churn', axis=1)
y = train['Churn']

# Split train dataset into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Print the shape of the train dataset
print("Train set shape:", X_train.shape, y_train.shape, X_val.shape, y_val.shape)

# Balancing the dataset

In [1050]:
# Perform oversampling using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Perform undersampling using RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

# Print the class distribution before and after balancing
print("Before balancing:")
print(y_train.value_counts())

print("After balancing:")
print(pd.Series(y_train_resampled).value_counts())

# Model Training and Evaluation

In [1051]:
# Create a list of models to train and evaluate
models = [
    ('Logistic Regression', LogisticRegression(random_state=42, solver='liblinear', max_iter=1000)),
    ('Decision Tree', DecisionTreeClassifier(random_state=42, criterion='gini', min_samples_leaf=8, max_depth=5)),
    ('Random Forest', RandomForestClassifier(random_state=42, n_estimators=100, max_depth=5)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42, n_estimators=100, learning_rate=0.1)),
    ('Support Vector Machine', SVC(random_state=42, kernel='rbf', C=1.0)),
    ('Gaussian Naive Bayes', GaussianNB()),
    ('K-Nearest Neighbors', KNeighborsClassifier(n_neighbors=5)),
]

### Model training and evaluation with unbalanced dataset

In [1052]:
# Create an empty DataFrame to store the performance metrics of unbalanced dataset
unbal_performance_metrics = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC_AUC'])

# Model training, evaluation and result calculation
for model_name, model in models:
    # Model training with unbalanced dataset
    model.fit(X_train, y_train)
    
    # Using models to make predictions on validation set
    y_pred = model.predict(X_val)
    
    # Calculate performance metrics of unbalanced dataset
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred)
    recall = recall_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred)
    roc_auc = roc_auc_score(y_val, y_pred)
    
    # Store the calculation results in the unbalanced performance metrics DataFrame
    unbal_performance_metrics = unbal_performance_metrics._append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC_AUC': roc_auc
    }, ignore_index=True)

# Print the performance metrics DataFrame
print('The performance metrics of the unbalanced dataset')
unbal_performance_metrics

Based on the f1 score of the models, Guassian Naive Bayes is the best model for the unbalanced dataset with an f1 score of 0.617647.

### Model training and evaluation with balanced dataset

In [1053]:
# Create an empty DataFrame to store the performance metrics of balanced dataset
bal_performance_metrics = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC_AUC'])

# Model training, evaluation and result calculation
for model_name, model in models:
    # Model training with balanced dataset
    model.fit(X_train_resampled, y_train_resampled)
    
    # Using models to make predictions on validation set
    y_pred = model.predict(X_val)
    
    # Calculate performance metrics of balanced dataset
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred)
    recall = recall_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred)
    roc_auc = roc_auc_score(y_val, y_pred)
    
    # Store the calculation results in the balanced performance metrics DataFrame
    bal_performance_metrics = bal_performance_metrics._append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC_AUC': roc_auc
    }, ignore_index=True)

# Print the performance metrics DataFrame
print('The performance metrics of the balanced dataset')
bal_performance_metrics

Based on the f1 score of the models, Random Forest is the best model for the balanced dataset with an f1 score of 0.630339.

# Confusion Matrix

# Hyper-parameter tuning