# IART Project - Supervised Learning - Sample Telco Customer Churn Dataset
 
 Developed by:  
 Carlos Gomes – up201906622​  
 Domingos Santos – up201906680​  
 Filipe Pinto – up201907747  

 

## Introduction

It is important for a company to retain customers in order to maintain or even increase profit, so it might be very useful to predict their behaviour.​

So, given a dataset with information about telco customers we want to predict if a customer will churn or not​

In other words we want to, the main goal of this project is to predict if a customer will stop buying products/services in telco.​

## Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

## Data Pre-processing

Column Description:
- customerID: A unique ID that identifies each customer.
- gender: The customer’s gender: Male (1), Female (0).
- SeniorCitizen: Indicates if the customer is 65 or older: No (0), Yes (1).
- Partner: Service contract is resold by the partner: No (0), Yes (1).
- Dependents: Indicates if the customer lives with any dependents: No (0), Yes (1).
- Tenure: Indicates the total amount of months that the customer has been with the company.
- PhoneService: Indicates if the customer subscribes to home phone service with the company: No (0), Yes (1).
- MultipleLines: Indicates if the customer subscribes to multiple telephone lines with the company: No (0), Yes (1).
- InternetService: Indicates if the customer subscribes to Internet service with the company: No (0), DSL (1), Fiber optic (2).
- OnlineSecurity: Indicates if the customer subscribes to an additional online security service provided by the company: No (0), Yes (1), NA (2).
- OnlineBackup: Indicates if the customer subscribes to an additional online backup service provided by the company: No (0), Yes (1), NA (2).
- DeviceProtection: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: No (0), Yes (1), NA (2).
- TechSupport: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: No (0), Yes (1), NA (2).
- StreamingTV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: No (0), Yes (1), NA (2). The company does not charge an additional fee for this service.
- StreamingMovies: Indicates if the customer uses their Internet service to stream movies from a third party provider: No (0), Yes (1), NA (2). The company does not charge an additional fee for this service.
- Contract: Indicates the customer’s current contract type: Month-to-Month (0), One Year (1), Two Year (2).
- PaperlessBilling: Indicates if the customer has chosen paperless billing: No (0), Yes (1).
- PaymentMethod: Indicates how the customer pays their bill: Bank transfer - automatic (0), Credit card - automatic (1), Electronic cheque (2), Mailed cheque (3).
- MonthlyCharges: Indicates the customer’s current total monthly charge for all their services from the company.
- TotalCharges: Indicates the customer’s total charges.
- Churn: Indicates if the customer churn or not: No (0), Yes (1).



We first need to load the data and see some information such as statistics and possible missing values.

In [15]:
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_R2_Test.csv')
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1122-JWTJW,1,0,1,1,1,1,0,2,0,...,0,0,0,0,0,1,3,70.65,70.65,1
1,9710-NJERN,0,0,0,0,39,1,0,0,2,...,2,2,2,2,2,0,3,20.15,826.0,0
2,9837-FWLCH,1,0,1,1,12,1,0,0,2,...,2,2,2,2,0,1,2,19.2,239.0,0
3,1699-HPSBG,1,0,0,0,12,1,0,1,0,...,0,1,1,0,1,1,2,59.8,727.8,1
4,7203-OYKCT,1,0,0,0,72,1,1,2,0,...,1,0,1,1,1,1,2,104.95,7544.3,0


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        21 non-null     object 
 1   gender            21 non-null     int64  
 2   SeniorCitizen     21 non-null     int64  
 3   Partner           21 non-null     int64  
 4   Dependents        21 non-null     int64  
 5   Tenure            21 non-null     int64  
 6   PhoneService      21 non-null     int64  
 7   MultipleLines     21 non-null     int64  
 8   InternetService   21 non-null     int64  
 9   OnlineSecurity    21 non-null     int64  
 10  OnlineBackup      21 non-null     int64  
 11  DeviceProtection  21 non-null     int64  
 12  TechSupport       21 non-null     int64  
 13  StreamingTV       21 non-null     int64  
 14  StreamingMovies   21 non-null     int64  
 15  Contract          21 non-null     int64  
 16  PaperlessBilling  21 non-null     int64  
 17 

In [13]:
data.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
Tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

As we can see there are no missing values in our dataset

In [14]:
data.describe()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
count,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0,7011.0
mean,0.504636,0.162316,0.482955,0.298959,32.426615,0.903295,0.422051,1.224076,0.719298,0.77735,0.776351,0.722579,0.816574,0.820996,0.68906,0.592212,1.573242,64.798645,2283.620126,0.265868
std,0.500014,0.368767,0.499745,0.457834,24.542847,0.295577,0.493922,0.778727,0.796531,0.77822,0.778575,0.795621,0.763104,0.761254,0.83317,0.491458,1.067423,30.09403,2266.680399,0.441826
min,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.25,18.8,0.0
25%,0.0,0.0,0.0,0.0,9.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,35.55,401.4,0.0
50%,1.0,0.0,0.0,0.0,29.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,2.0,70.35,1397.3,0.0
75%,1.0,0.0,1.0,1.0,55.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,89.9,3798.375,1.0
max,1.0,1.0,1.0,1.0,72.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,118.75,8684.8,1.0


In [4]:
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_R2_Test.csv', na_values=[2])

In [5]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1122-JWTJW,1,0,1,1,1.0,1,0,,0.0,...,0.0,0.0,0.0,0.0,0.0,1,3.0,70.65,70.65,1
1,9710-NJERN,0,0,0,0,39.0,1,0,0.0,,...,,,,,,0,3.0,20.15,826.0,0
2,9837-FWLCH,1,0,1,1,12.0,1,0,0.0,,...,,,,,0.0,1,,19.2,239.0,0
3,1699-HPSBG,1,0,0,0,12.0,1,0,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,1,,59.8,727.8,1
4,7203-OYKCT,1,0,0,0,72.0,1,1,,0.0,...,1.0,0.0,1.0,1.0,1.0,1,,104.95,7544.3,0


In [7]:
sb.pairplot(data.drop_na(), hue='class')
;

AttributeError: 'DataFrame' object has no attribute 'drop_na'