# Module 3 - Classification

## Churn prediction project

For this project we'll use the **[Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download)** dataset available in Kaggle

In this case the file has already been downloaded and placed it in this repository

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [18]:
df = pd.read_csv('./WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [20]:
df.columns = df.columns.str.lower().str.replace(' ','_')

In [21]:
df_categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

In [22]:
for colum in df_categorical_columns:
    df[colum] = df[colum].str.lower().str.replace(' ','_')

In [24]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [25]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

If we take a look at the totalcharges column, it should be a numeric column but it is an object column. 

To try to understand what is happening, we could try to convert it as numeric and see the result

In [26]:
pd.to_numeric(df.totalcharges)

ValueError: Unable to parse string "_" at position 488

Pandas is telling us that there are values that contains "_". This happened when we replaced spaces with underscores.

To solve this we can do the following:

In [27]:
tc = pd.to_numeric(df.totalcharges, errors='coerce') # this will set the errors to NaN

In [28]:
tc.isnull().sum()

11

And now we can see which values are null and do something about it

In [38]:
df[tc.isnull()][['customerid', 'totalcharges']]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,_
753,3115-czmzd,_
936,5709-lvoeq,_
1082,4367-nuyao,_
1340,1371-dwpaz,_
3331,7644-omvmy,_
3826,3213-vvolg,_
4380,2520-sgtta,_
5218,2923-arzlg,_
6670,4075-wkniu,_


Now that we identified the problem we can perform the same to the actual column

In [39]:
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce') # this will set the errors to NaN

In [43]:
# and fill with 0 the null values
df.totalcharges = df.totalcharges.fillna(0)

In [44]:
df.totalcharges.isnull().sum()

0

And now there are no null values and all the values are numeric as we can see:

In [45]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                object
dtype: object

Finally, lets check our target column **churn**

In [46]:
df.churn.head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

The values are yes and no. We need this to be 1's and 0's

For this we can do the following trick

In [49]:
df.churn = (df.churn == 'yes').astype(int)

In [50]:
df.churn.head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

Now we have 1's and 0's values instead of yes and no