# Machine Learning for Classification

## Churn Prediction

In this exercise we will be working for a telco and we want to classify our customes based on a churn prediction (a percentage). This can be used to send promotions to those customers that have a high churn percentage.

In this particular case we will be using **Binary Classification**. Because we want to classify using two categories. In our case the y is either 0 or 1 (no churn, churn)

The output of this model (g(Xi)) is a number between 0 and 1 (This a churn probability)

The way we can do this is by taking last month customers and assign a 0 to the users that didn't leave and 1 to the users that left. And then we can check data related to the customers like the type of contracts.



## Data Preparation

- Download the data, read it with pandas
- Look at the data
- Make column names and values look uniform
- Check if all the columns read correctly
- Check if the churn variable needs any preparation

In [26]:
import pandas as pd
import numpy as np

# First we take a look at the columns
df = pd.read_csv('Telco-Customer-Churn.csv')
df.head(n=5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [27]:
# If we have a lot of columns we can Transpose the DataFrame to take a look at all the columns without scrolling
df.head(n=5).T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [28]:
# Standarize the column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# We list the categorical columns
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

# And for each categorical column we standarize the contents of each row
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.head(n=5)

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,29.85,29.85,no
1,5575-gnvde,male,0,no,no,34,yes,no,dsl,yes,...,yes,no,no,no,one_year,no,mailed_check,56.95,1889.5,no
2,3668-qpybk,male,0,no,no,2,yes,no,dsl,yes,...,no,no,no,no,month-to-month,yes,mailed_check,53.85,108.15,yes
3,7795-cfocw,male,0,no,no,45,no,no_phone_service,dsl,yes,...,yes,yes,no,no,one_year,no,bank_transfer_(automatic),42.3,1840.75,no
4,9237-hqitu,female,0,no,no,2,yes,no,fiber_optic,no,...,no,no,no,no,month-to-month,yes,electronic_check,70.7,151.65,yes


In [29]:
# When we check the types we notice a few weird things.
# seniorcitizen is an int, thats ok is 0/1 instead of yes/no
# totalcharges is an object, this is weird we should take a look
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [30]:
# When we check totalcharges, looks like its a number but the dtype is object
df.totalcharges

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: totalcharges, Length: 7043, dtype: object

In [8]:
# If we try to convert to numeric we get an error, so there are some not numeric values
pd.to_numeric(df.totalcharges)

ValueError: Unable to parse string "_" at position 488

In [31]:
# We can add errors='coerce' to pd.to_numeric (this will make it ignore the errors)
tc = pd.to_numeric(df.totalcharges, errors='coerce')
tc

0         29.85
1       1889.50
2        108.15
3       1840.75
4        151.65
         ...   
7038    1990.50
7039    7362.90
7040     346.45
7041     306.60
7042    6844.50
Name: totalcharges, Length: 7043, dtype: float64

In [32]:
# And the values that where not numbers are left as null
tc.isnull().sum()

np.int64(11)

In [33]:
# And if we check the values that are null we notice that for those the value is '_' so thats why those were object
# Those proably where whitespaces and we replaced whitespaces with '_'
df[tc.isnull()][['customerid', 'totalcharges']]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,_
753,3115-czmzd,_
936,5709-lvoeq,_
1082,4367-nuyao,_
1340,1371-dwpaz,_
3331,7644-omvmy,_
3826,3213-vvolg,_
4380,2520-sgtta,_
5218,2923-arzlg,_
6670,4075-wkniu,_


In [34]:
# Ok so we convert the totalcharges to a numeric value and fill the null values with 0
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce').fillna(0)
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                object
dtype: object

In [35]:
# Now we dont have nulls
df.totalcharges.isnull().sum()

np.int64(0)

In [36]:
# We check the churn column
df.churn.head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

In [37]:
# We are going to transform the yes/no to 1/0
(df.churn == 'yes').astype(int).head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

In [38]:
# This line replaces the churn column with 1/0
df.churn = (df.churn == 'yes').astype(int)
df.head()

df.churn.head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

# Setting Up The Validation Framework

As we did before we want to split the data set using 60% for training, 20% for validation and 20% for testing. We did this with pandas before but this time we want to do it with Scikit-Learn

In [40]:
from sklearn.model_selection import train_test_split

# train_test_split?

# Splits data in two parts so 80% | 20%
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

len(df_full_train), len(df_test)

(5634, 1409)

In [43]:
# We need to split df_full_train again in two parts 75% | 25% to actually get the original dataset divided into 60%, 20%, 20%
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

len(df_train), len(df_val), len(df_test)

(4225, 1409, 1409)

In [44]:
# We reset the indexes (This is not needed)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [45]:
# We extract our y
y_train = df_train.churn
y_val = df_val.churn
y_test = df_test.churn

In [46]:
# And we delete the churn variables from our datasets

# del df_train["churn"]
df_train = df_train.drop(columns=['churn'])

# del df_val["churn"]
df_val = df_val.drop(columns=['churn'])

# del df_test["churn"]
df_test = df_test.drop(columns=['churn'])

# EDA (Exploratory Data Analysis)

- Check missing values
- Look at the target variable (churn)
- Look at the numerical and categorical variables

In [48]:
# We will use df_full_train

# We reset the index (not needed)
df_full_train = df_full_train.reset_index(drop=True)

# We check if we have missing values...
# We don't have any missing values
df_full_train.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [49]:
# We want to check how many clients have churn
df_full_train.churn.value_counts()

churn
0    4113
1    1521
Name: count, dtype: int64

In [50]:
# If we want to get a percentage (churn rate)
df_full_train.churn.value_counts(normalize=True)

churn
0    0.730032
1    0.269968
Name: proportion, dtype: float64

In [51]:
# But we can get the churn rate with the mean
# It works because we only have 0 and 1
df_full_train.churn.mean()

np.float64(0.26996805111821087)

In [52]:
global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2) # Clients are churning at 27%

np.float64(0.27)

In [53]:
# We check the types to determine numerical and categorical columns (seniorcitizen is an int64 but is a categorical value)
df_full_train.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object

In [61]:
# We can create a list of the columns that are numerical
numerical = ['tenure', 'monthlycarges', 'totalcharges']

# And we create a list of the columns that are categorical (we remove customerid because is an id not a category)
categorical = [c for c in list(df.columns) if c not in numerical + ['customerid']]

In [63]:
# We get the count of unique values for categorical columns (to see how many categories each column haves)
df_full_train[categorical].nunique()

gender                 2
seniorcitizen          2
partner                2
dependents             2
phoneservice           2
multiplelines          3
internetservice        3
onlinesecurity         3
onlinebackup           3
deviceprotection       3
techsupport            3
streamingtv            3
streamingmovies        3
contract               3
paperlessbilling       2
paymentmethod          4
monthlycharges      1494
churn                  2
dtype: int64