# Predicting Customer Churn Rate in a Telecommunications Company

'Churn' is the number or percentage of customers who stop using a company's services during a given period. It is a key performance indicator, especially for subscription services, and is often used synonymously with "customer loss". A high churn rate can indicate problems with customer satisfaction, product quality, or pricing, and is a major concern because it directly impacts revenue and growth.

Project Overview: A telecommunications company wants to understand why customers are leaving its service and predict which customers are most likely to cancel their subscriptions next month.

## Data preparation

In [30]:
# load dataset
import pandas as pd
churn_data = pd.read_csv('./data/WA_Fn-UseC_-Telco-Customer-Churn.csv')
churn_data.sample(6)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
2419,2450-ZKEED,Female,0,No,No,11,Yes,No,DSL,No,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),53.8,651.55,No
2520,5788-YPOEG,Female,0,Yes,Yes,34,Yes,No,DSL,Yes,Yes,Yes,Yes,Yes,Yes,Two year,No,Mailed check,84.75,2839.45,No
4624,0325-XBFAC,Male,0,No,No,8,Yes,No,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,94.7,740.3,Yes
3080,1777-JYQPJ,Male,0,No,No,2,No,No phone service,DSL,No,No,No,No,No,No,Month-to-month,No,Mailed check,24.3,38.45,No
429,6152-ONASV,Female,0,Yes,No,68,Yes,Yes,DSL,Yes,No,No,Yes,No,No,One year,No,Bank transfer (automatic),58.25,3975.7,No
6814,2267-WTPYD,Female,1,Yes,No,57,Yes,Yes,Fiber optic,No,Yes,Yes,No,No,Yes,Month-to-month,No,Credit card (automatic),94.0,5438.95,No


In [31]:
churn_data.shape

(7043, 21)

In [33]:
churn_data.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

- customerID – Unique identifier for each customer.
- gender – Customer's gender (Male/Female).
- SeniorCitizen – Indicates whether the customer is a senior citizen (1 = Yes, 0 = No).
- Partner – Indicates if the customer has a partner (Yes/No).
- Dependents – Indicates if the customer has dependents (Yes/No).
- tenure – Number of months the customer has been with the company.
- PhoneService – Whether the customer subscribes to phone service (Yes/No).
- MultipleLines – Indicates if the customer has multiple phone lines (Yes/No/No phone service).
- InternetService – Type of internet service subscribed (DSL, Fiber optic, No).
- OnlineSecurity – Indicates whether the customer has online security add-on (Yes/No/No internet service).
- OnlineBackup – Indicates if the customer uses online backup service (Yes/No/No internet service).
- DeviceProtection – Indicates if the customer has device protection (Yes/No/No internet service).
- TechSupport – Indicates if the customer has tech support subscription (Yes/No/No internet service).
- StreamingTV – Indicates if the customer streams TV (Yes/No/No internet service).
- StreamingMovies – Indicates if the customer streams movies (Yes/No/No internet service).
- Contract – Type of contract (Month-to-month, One year, Two year).
- PaperlessBilling – Whether the customer uses paperless billing (Yes/No).
- PaymentMethod – Payment method used by the customer (Electronic check, Mailed check, Bank transfer, Credit card).
- MonthlyCharges – The amount charged to the customer monthly.
- TotalCharges – Total amount charged to the customer over the tenure.
- Churn – Indicates whether the customer has canceled the service (Yes/No).

In [34]:
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


If we look at the TotalCharges value, we can see that it was loaded incorrectly and loaded as an object instead of a float.

Let's fix it.

In [37]:
churn_data['TotalCharges'] = pd.to_numeric(churn_data['TotalCharges'], errors='coerce')

In [40]:
churn_data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7032.0
mean,0.162147,32.371149,64.761692,2283.300441
std,0.368612,24.559481,30.090047,2266.771362
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.5,401.45
50%,0.0,29.0,70.35,1397.475
75%,0.0,55.0,89.85,3794.7375
max,1.0,72.0,118.75,8684.8


In [41]:
churn_data.isna().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

There are 11 NA values. Since these are few samples compared to the total number of samples, which is 7043, deleting these rows with the value NA in the TotalCharges column doesn't change anything; it's an insignificant sample.