# Machine Learning for Classification

Logistic Regression to predict churn

## 3.1 Telco Churn Dataset

- Dataset: https://www.kaggle.com/blastchar/telco-customer-churn
- https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv

## 3.2 Data Preparation

- Download data, read it with pandas
- Look at the data
- Make column names and values look uniform
- Check if all the columns read correctly
- Check if the churn variable needs any preparation

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [3]:
data = "https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv"

In [50]:
# ‘!’ indicates the execution of a shell command
# ‘$’ symbol, as seen in ‘$data,’ is the way to reference data within this shell command.
!wget $data -O data-week-3.csv

--2025-08-14 20:18:02--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 977501 (955K) [text/plain]
Saving to: ‘data-week-3.csv’


2025-08-14 20:18:03 (25.9 MB/s) - ‘data-week-3.csv’ saved [977501/977501]



In [7]:
df = pd.read_csv('data-week-3.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [51]:
# To display all columns simultaneously, we can use the transpose function. 
# This will switch the rows to become columns and the columns to become rows.
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [9]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [15]:
# Column names and values aren't in a consistent format, so a transformation is required

df.columns = df.columns.str.lower()

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for col in categorical_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

In [16]:
# Verification
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [17]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [31]:
df.totalcharges         

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: totalcharges, Length: 7043, dtype: object

In [25]:
# pd.to_numeric(df.totalcharges) ---> ValueError: Unable to parse string "_" at position 488

tc = pd.to_numeric(df.totalcharges, errors='coerce')

# Check which customers have totalcharges as null
df[tc.isnull()][['customerid', 'monthlycharges', 'totalcharges']]

Unnamed: 0,customerid,monthlycharges,totalcharges
488,4472-lvygi,52.55,_
753,3115-czmzd,20.25,_
936,5709-lvoeq,80.85,_
1082,4367-nuyao,25.75,_
1340,1371-dwpaz,56.05,_
3331,7644-omvmy,19.85,_
3826,3213-vvolg,25.35,_
4380,2520-sgtta,20.0,_
5218,2923-arzlg,19.7,_
6670,4075-wkniu,73.35,_


In [42]:
# Set value of totalcharges as 0 if value is null
df.totalcharges = pd.to_numeric(df['totalcharges'],errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

In [44]:
df.totalcharges.isna().sum()

np.int64(0)

In [46]:
df.churn.head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

In [48]:
# Formula to store yes-> 1, no-> 0
(df.churn == 'yes').astype('int').head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

In [49]:
# apply transformation to all values in the column and update it in the df.churn column.
df.churn = (df.churn == 'yes').astype('int')

Commands, functions, and methods:

- !wget - Linux shell command for downloading data
- pd.read.csv() - read csv files
- df.head() - take a look of the dataframe
- df.head().T - take a look of the transposed dataframe
- df.columns - retrieve column names of a dataframe
- df.columns.str.lower() - lowercase all the letters in the columns names of a dataframe
- df.columns.str.replace(' ', '_') - replace the space separator in the columns names of a dataframe
- df.dtypes - retrieve data types of all series
- df.index - retrieve indices of a dataframe
- pd.to_numeric() - convert a series values to numerical values. The errors='coerce' argument allows making the transformation despite some encountered errors.
- df.fillna() - replace NAs with some value
- (df.x == "yes").astype(int) - convert x series of yes-no values to numerical values.