# Churn Analysis of a Telecom Company

# About Dataset

**Content:**<br>
The **Telco Customer Churn** dataset includes a variety of variables or features that provide information about customers and their interactions with a telecommunications company, each row represents a customer, each column contains customer’s attributes.
1. **Customer ID**: A unique identifier for each customer.
2. **Gender**: The gender of the customer (e.g., Male, Female).
3. **Senior Citizen**: Whether the customer is a senior citizen or not, 1: yes / 0: no.
4. **Partner**: Whether the customer has a partner (Yes/No).
5. **Dependents**: Whether the customer has dependents (children, relatives...), (Yes/No).
6. **Tenure**: The number of months the customer has stayed with the company.
7. **Phone Service**: Whether the customer has phone service provided by the company (Yes/No).
8. **Multiple Lines**: Whether the customer has multiple lines (e.g., Yes, No, No phone service).
9. **Internet Service**: Type of internet service subscribed (e.g., DSL, Fiber optic, No).
10. **Online Security**: Whether the customer has online security service (e.g., Yes, No, No internet service).
11. **Online Backup**: Whether the customer has online backup service (e.g., Yes, No, No internet service).
12. **Device Protection**: Whether the customer has device protection service (e.g., Yes, No, No internet service).
13. **Tech Support**: Whether the customer has tech support service (e.g., Yes, No, No internet service).
14. **Streaming TV**: Whether the customer has streaming TV service (e.g., Yes, No, No internet service).
15. **Streaming Movies**: Whether the customer has streaming movie service (e.g., Yes, No, No internet service).
16. **Contract**: The type of contract the customer has (e.g., Month-to-month, One year, Two year).
17. **Paperless Billing**: Whether the customer has opted for paperless billing (Yes/No).
18. **Payment Method**: The method of payment used by the customer (e.g., Electronic check, Credit card, Bank transfer, Mailed check).
19. **Monthly Charges**: The amount charged to the customer on a monthly basis.
20. **Total Charges**: The total amount charged to the customer over the entire tenure.
21. **Churn**: Whether the customer churned (cancelled the service) or not (Yes/No).

**Objective**
- Our target variable is **Churn**, the objective is to analyze what factors affect the churn rate in the company.

# 1. Importing Data from Python and Reading Dataset

In [274]:
import pandas as pd # for data processing
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 2. Inspecting the data

In [275]:
# Inspecting the first few rows 
df.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


In [276]:
# Inspecting the last few rows
df.tail(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.6,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.4,306.6,Yes
7042,3186-AJIEK,Male,0,No,No,66,Yes,No,Fiber optic,Yes,...,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),105.65,6844.5,No


In [277]:
df.shape
#(rows,columns)

(7043, 21)

In [278]:
# checking if customer ID's are unique
len(list(df['customerID'].unique()))

7043

In [279]:
# Looking at info of each column such as column name, number of rows, number of non_null values and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


The variable TotalCharges is a numeric variable, but it is of type 'object' in our data set, we need to change it to float.

In [280]:
# Description of some numerical variables
df.describe().round(3)

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162,32.371,64.762
std,0.369,24.559,30.09
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


- 16% of customers are senior citizens.
- Customers have stayed 32 months on average, that's 2.5+ years.
- Nothing fishy about our data...

# 3. Data Wrangling

### Changing data type

In [281]:
# Converting this variable to numeric
# df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])
# Get an error: we have empty strings in this column.

In [282]:
df.loc[df['TotalCharges'] == ' ', :]

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,3213-VVOLG,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,2923-ARZLG,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,4075-WKNIU,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,...,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


In [283]:
# Finding a relation between monthly and total charges with tenure
df[['tenure', 'MonthlyCharges', 'TotalCharges']].head()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
0,1,29.85,29.85
1,34,56.95,1889.5
2,2,53.85,108.15
3,45,42.3,1840.75
4,2,70.7,151.65


It seems like ***Totalcharges ≈ tenure * MonthlyCharges*** but this is not accurate, there could be discounts or extra costs for delays in payment.

All of the customers who have not stayed with the company for any period of time have a **Tenure** value of 0. We can replace the **TotalCharges** values with 0, based on our previous discovery. These two variables are positively correlated.

In [284]:
# Replacing string values with 0
df['TotalCharges'].replace(' ',0,inplace=True)

In [285]:
# We can cast it into a float now
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])
df['TotalCharges'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 7043 entries, 0 to 7042
Series name: TotalCharges
Non-Null Count  Dtype  
--------------  -----  
7043 non-null   float64
dtypes: float64(1)
memory usage: 55.1 KB


In [296]:
# Confirming the correlation
df[['tenure', 'MonthlyCharges','TotalCharges']].corr()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
tenure,1.0,0.2479,0.826178
MonthlyCharges,0.2479,1.0,0.651174
TotalCharges,0.826178,0.651174,1.0
