# Telco Customer Churn Prediction

# 1.0 Introduction

### 1.1 Business Understanding / Project Objective

Per [Paddle](https://www.paddle.com/resources/customer-attrition#:~:text=Customer%20attrition%20is%20defined%20as,of%20business%20health%20over%20time.), customer churn may be defined as the loss of customers by a business. Despite being a normal part of the customer cycle, it is viewed as a key indicator of business health over time and must be managed to ensure some stability in the business' survival, (retention) strategy development, and/or growth. 

It is also known as customer attrition or customer turnover, and is calculated as the percentage of customers that stopped using a company's product or service within a specified timeframe. To better manage customer churn, companies should be able to predict it with reasonable accuracy, and that is where machine learning comes in.

This project is focused on Vodafone - a telecommunications company - and  aims to predict the likelihood that a customer will churn by identifying and modelling based on the key indicators of churn. Possible strategies that may be explored and implemented to improve retention (or reduce churn) may be recommended in this project.

### 1.2 Data Understanding

The dataset contains demographic information about customers, the services that they use, the related charges, and their churn status. The columns in the dataset are described below:

- **Gender**: Whether the customer is a male or a female

- **SeniorCitizen**: Whether a customer is a senior citizen or not

- **Partner**: Whether the customer has a partner or not. Expressed as (Yes, No)

- **Dependents**: Whether the customer has dependents or not. Expressed as (Yes, No)

- **Tenure**: Number of months the customer has stayed with the company.

- **Phone Service**: Whether the customer has a phone service or not. Expressed as (Yes, No)

- **MultipleLines**: Whether the customer has multiple lines or not.

- **InternetService**: Customer's internet service provider. Categorized as (DSL, Fiber Optic, No)

- **OnlineSecurity**: Whether the customer has online security or not. Expressed as (Yes, No, No Internet)

- **OnlineBackup**: Whether the customer has online backup or not. Expressed as (Yes, No, No Internet)

- **DeviceProtection**: Whether the customer has device protection or not. Expressed as (Yes, No, No internet service)

- **TechSupport**: Whether the customer has tech support or not. Expressed as (Yes, No, No internet)

- **StreamingTV**: Whether the customer has streaming TV or not. Expressed as (Yes, No, No internet service)

- **StreamingMovies**: Whether the customer has streaming movies or not. Expressed as (Yes, No, No Internet service)

- **Contract**: The contract term of the customer. Categorized as (Month-to-Month, One year, Two year)

- **PaperlessBilling**: Whether the customer has paperless billing or not. Expressed as (Yes, No)

- **Payment Method**: The customer's payment method. Categorized as (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic)).

- **MonthlyCharges**: The amount charged to the customer monthly.

- **TotalCharges**: The total amount charged to the customer.

- **Churn**: Whether the customer churned or not. Expressed as (Yes or No).

## 2.0 Hypotheses and Questions

1. Customers with partners & dependents churn less
2. What is the distribution of customers by senior citizenship and how do they churn?
3. Female non-senior citizens churn more than female senior citizens
4. In terms of tenure, which range of users have churned least?
5. At what tenure levels do we lose most customers?
6. Customers who exceed the average tenure are less likely to churn
7. Users who don't use phone service churn more than phone service users
8. Does the use of multiple lines lead to reduced churn?
9. DSL users churn more than fibre-optic users
10. What is the demographic distribution of the service lines with the highest churn proportion? (Demographics: gender, senior citizen, partner, dependent, tenure)
11. Customers with tech support churn less
12. Users who stream both TV & movies churn less than those who stream only one
13. Month-to-month users who stream only one service churn more than other user classes
14. Users with paperless billing & automated payment methods churn less than those with manual payments
15. Customers who stream with fiber optic churn less than DSL users

# Importing Libraries 

In [10]:
!pip install sweetviz







[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
pip install plotly

Collecting plotly
  Downloading plotly-5.13.1-py2.py3-none-any.whl (15.2 MB)
     ---------------------------------------- 15.2/15.2 MB 1.3 MB/s eta 0:00:00
Collecting tenacity>=6.2.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.13.1 tenacity-8.2.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
pip install imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
     -------------------------------------- 226.0/226.0 kB 1.3 MB/s eta 0:00:00
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.10.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [19]:
pip install catboost

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement catboost (from versions: 0.1.1)
ERROR: No matching distribution found for catboost

[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [21]:
pip install pip --upgrade

Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 1.9 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.3.1
    Uninstalling pip-22.3.1:
      Successfully uninstalled pip-22.3.1
Successfully installed pip-23.0.1
Note: you may need to restart the kernel to use updated packages.


In [25]:
pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-3.3.5-py3-none-win_amd64.whl (1.0 MB)
     ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
     - -------------------------------------- 0.0/1.0 MB 991.0 kB/s eta 0:00:01
     ---- ----------------------------------- 0.1/1.0 MB 1.6 MB/s eta 0:00:01
     ---- ----------------------------------- 0.1/1.0 MB 1.0 MB/s eta 0:00:01
     ------- -------------------------------- 0.2/1.0 MB 1.2 MB/s eta 0:00:01
     -------- ------------------------------- 0.2/1.0 MB 1.1 MB/s eta 0:00:01
     --------- ------------------------------ 0.3/1.0 MB 1.1 MB/s eta 0:00:01
     ------------- -------------------------- 0.3/1.0 MB 1.2 MB/s eta 0:00:01
     ----------------- ---------------------- 0.4/1.0 MB 1.3 MB/s eta 0:00:01
     ------------------- -------------------- 0.5/1.0 MB 1.3 MB/s eta 0:00:01
     --------------------- ------------------ 0.6/1.0 MB 1.3 MB/s eta 0:00:01
     ----------------------- ---------------- 0.6/1.0 MB 1.3 MB/s 

In [22]:
pip install catboost

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement catboost (from versions: 0.1.1)
ERROR: No matching distribution found for catboost


In [27]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.7.4-py3-none-win_amd64.whl (89.1 MB)
     ---------------------------------------- 0.0/89.1 MB ? eta -:--:--
     ---------------------------------------- 0.0/89.1 MB 1.4 MB/s eta 0:01:06
     --------------------------------------- 0.1/89.1 MB 787.7 kB/s eta 0:01:54
     ---------------------------------------- 0.1/89.1 MB 1.0 MB/s eta 0:01:27
     ---------------------------------------- 0.2/89.1 MB 1.2 MB/s eta 0:01:17
     ---------------------------------------- 0.3/89.1 MB 1.4 MB/s eta 0:01:03
     ---------------------------------------- 0.3/89.1 MB 1.4 MB/s eta 0:01:06
     ---------------------------------------- 0.3/89.1 MB 1.2 MB/s eta 0:01:12
     ---------------------------------------- 0.5/89.1 MB 1.4 MB/s eta 0:01:05
     ---------------------------------------- 0.5/89.1 MB 1.4 MB/s eta 0:01:02
     ---------------------------------------- 0.5/89.1 MB 1.2 MB/s eta 0:01:12
     ---------------------------------------- 0.6/89.1 MB

In [28]:
# Data manipulation

import numpy as np
import pandas as pd 
import sweetviz as viz

#Visualization 
import matplotlib.pyplot as plt 
import plotly.express as px 
import seaborn as sns 

#Feature Engineering 
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.feature_selection import SelectKBest, chi2  # Univariate Selection using KBest
from sklearn.model_selection import *  #cross_val_score, fbeta_score, KFold, make_scorer, train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler

# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
#from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb
from xgboost import *

# Model evaluation

from sklearn import metrics
from sklearn.metrics import *



In [29]:
# removing restriction on columns 
pd.set_option('display.max_columns',None)


# 4.0 Data Manipulation

In [30]:
dataset = pd.read_csv('Telco-Customer-Churn.csv')

dataset

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [31]:
# Looking at information about the columns
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [32]:
# checking for dupilicates 


dataset[dataset.duplicated()]


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn


From the dataset preview and the info above, we note the following:
- There are no missing values in any of the columns
- There are no duplicates in the dataset
- Senior Citizenship status is expressed as 0 or 1. It would be ideal to convert to "Yes" or "No" for initial data exploration.
- Despite seeming to have numeric values, the datatype of the TotalCharges column is "object". It will therefore have to be converted to numeric

In [33]:
# Checking the minimum and maximum values in the tenure column
dataset['tenure'].min(),dataset['tenure'].max()

(0, 72)

In [35]:
# What is the total charges for customers with 0 tenure?
dataset[dataset['tenure']==0]


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,No,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,Yes,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,Yes,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,3213-VVOLG,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,2923-ARZLG,Male,0,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,4075-WKNIU,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,Yes,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


From here, we note that there are 11 customers with 0 tenure and "missing" values for total charges. The values are not present because they do not exist as those customers have not been with Vodafone long enough to incur any actual charges. The values will have to be dropped since they cannot be used in the analysis and predictions.

In [40]:
# Performing initial cleaning on the dataset

dataset['TotalCharges'] = dataset['TotalCharges'].replace(" ",np.nan)

dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges']) # changing the datatype of the column to float

dataset['SeniorCitizen'] = np.where(dataset['SeniorCitizen']==0,"No","Yes") # dropping the null values in the dataset

dataset.dropna(inplace = True)

#dataset.drop(columns=['customerID'],inplace = True)

dataset.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 7032 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   object 
 1   SeniorCitizen     7032 non-null   object 
 2   Partner           7032 non-null   object 
 3   Dependents        7032 non-null   object 
 4   tenure            7032 non-null   int64  
 5   PhoneService      7032 non-null   object 
 6   MultipleLines     7032 non-null   object 
 7   InternetService   7032 non-null   object 
 8   OnlineSecurity    7032 non-null   object 
 9   OnlineBackup      7032 non-null   object 
 10  DeviceProtection  7032 non-null   object 
 11  TechSupport       7032 non-null   object 
 12  StreamingTV       7032 non-null   object 
 13  StreamingMovies   7032 non-null   object 
 14  Contract          7032 non-null   object 
 15  PaperlessBilling  7032 non-null   object 
 16  PaymentMethod     7032 non-null   object 


*With the columns with no total charges dropped, the dataset now has 7032 non-null distinct entries. We can therefore proceed with further exploration and analysis.*

In [42]:
# Looking at the unique values in the columns

for column in dataset.columns:
    if dataset[column].dtype =="O":
          print(f"Distribution of value counts in the {column} column", "\n", dataset[column].value_counts(), "\n")



Distribution of value counts in the gender column 
 Male      3549
Female    3483
Name: gender, dtype: int64 

Distribution of value counts in the SeniorCitizen column 
 Yes    7032
Name: SeniorCitizen, dtype: int64 

Distribution of value counts in the Partner column 
 No     3639
Yes    3393
Name: Partner, dtype: int64 

Distribution of value counts in the Dependents column 
 No     4933
Yes    2099
Name: Dependents, dtype: int64 

Distribution of value counts in the PhoneService column 
 Yes    6352
No      680
Name: PhoneService, dtype: int64 

Distribution of value counts in the MultipleLines column 
 No                  3385
Yes                 2967
No phone service     680
Name: MultipleLines, dtype: int64 

Distribution of value counts in the InternetService column 
 Fiber optic    3096
DSL            2416
No             1520
Name: InternetService, dtype: int64 

Distribution of value counts in the OnlineSecurity column 
 No                     3497
Yes                    2015
