# Week 3 Notes

## 3.1 [Churn prediction project](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/01-churn-project.md)

We want to predict the likelihood of a customer churning. Churning means that the customer stops using the product or service. For example a telecom company would like to know which customers are likely to churn to offer them a promotional discount to prevent them from churning. However, it's important to avoid wrong predictions, because:
- If a customer is predicted to churn, but doesn't (**false positive**), the discount would not have been necessary
- If a customer is not predicted to churn, but does (i.e. **false negative**), we lose that customer

We will create a classification model $g$, which will predict $y_i$, which is either 0 (did not churn) or 1 (churned). $g$ will output a continuous number between 0 and 1 (churned) representing the churn likelihood. $i$ refers to the customer.

$$
g(x_i) \approx y_i \quad y_i \in \{0, 1\}
$$

## 3.2 [Data preparation](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/02-data-preparation.md)

In [1]:
# data = "https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/refs/heads/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv"
# !wget $data -O data-week-3.csv

In [1]:
import pandas as pd


df = pd.read_csv("data-week-3.csv")

In [2]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Let's check all the columns:

In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


Let's normalize the columns:

In [4]:
df.columns = (
    df.columns
    .str.replace(" ", "_")
    .str.lower()
)

In [5]:
df.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

Let's do the same for the values in `object` columns.

In [6]:
for col in df.select_dtypes(object).columns:
    df[col] = (
        df[col]
        .str.lower()
        .str.replace(" ", "_")
    )

In [7]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerid        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   seniorcitizen     7043 non-null   int64  
 3   partner           7043 non-null   object 
 4   dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   phoneservice      7043 non-null   object 
 7   multiplelines     7043 non-null   object 
 8   internetservice   7043 non-null   object 
 9   onlinesecurity    7043 non-null   object 
 10  onlinebackup      7043 non-null   object 
 11  deviceprotection  7043 non-null   object 
 12  techsupport       7043 non-null   object 
 13  streamingtv       7043 non-null   object 
 14  streamingmovies   7043 non-null   object 
 15  contract          7043 non-null   object 
 16  paperlessbilling  7043 non-null   object 


We will clean the data. Columns with 2 values are turned into `boolean` dtypes. Columns with numeric values are turned into numeric dtypes such as `float` and `int`. We use `pd.to_numeric(..., downcast=...)` to downcast to the most memory efficient dtypes while retaining the needed precision.

In [9]:
df = (
    df
    .assign(
        partner=lambda df_: pd.to_numeric((df_.partner=="yes").astype(int), downcast="integer"),
        seniorcitizen=lambda df_: pd.to_numeric((df_.seniorcitizen=="yes").astype(int), downcast="integer"),
        dependents=lambda df_: pd.to_numeric((df_.dependents=="yes").astype(int), downcast="integer"),
        phoneservice=lambda df_: pd.to_numeric((df_.phoneservice=="yes").astype(int), downcast="integer"),
        paperlessbilling=lambda df_: pd.to_numeric((df_.paperlessbilling=="yes").astype(int), downcast="integer"),
        tenure=lambda df_: pd.to_numeric(df_.tenure, downcast="integer"),
        monthlycharges=lambda df_: pd.to_numeric(df_.monthlycharges, downcast="float"),
        totalcharges=lambda df_: pd.to_numeric(df_.totalcharges.replace("_", "0"), errors="coerce", downcast="float"),
        churn=lambda df_: pd.to_numeric((df_.churn=="yes").astype(int), downcast="integer"),
    )
    .astype(
        {
            "customerid": "string",
            "gender": "category",
            "multiplelines": "category",
            "internetservice": "category",
            "onlinesecurity": "category",
            "onlinebackup": "category",
            "deviceprotection": "category",
            "techsupport": "category",
            "streamingtv": "category",
            "streamingmovies": "category",
            "contract": "category",
            "paymentmethod": "category",
        }
    )
)

In [10]:
df.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,1,0,1,0,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,1,electronic_check,29.85,29.85,0
1,5575-gnvde,male,0,0,0,34,1,no,dsl,yes,...,yes,no,no,no,one_year,0,mailed_check,56.950001,1889.5,0
2,3668-qpybk,male,0,0,0,2,1,no,dsl,yes,...,no,no,no,no,month-to-month,1,mailed_check,53.849998,108.150002,1
3,7795-cfocw,male,0,0,0,45,0,no_phone_service,dsl,yes,...,yes,yes,no,no,one_year,0,bank_transfer_(automatic),42.299999,1840.75,0
4,9237-hqitu,female,0,0,0,2,1,no,fiber_optic,no,...,no,no,no,no,month-to-month,1,electronic_check,70.699997,151.649994,1


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   customerid        7043 non-null   string  
 1   gender            7043 non-null   category
 2   seniorcitizen     7043 non-null   int8    
 3   partner           7043 non-null   int8    
 4   dependents        7043 non-null   int8    
 5   tenure            7043 non-null   int8    
 6   phoneservice      7043 non-null   int8    
 7   multiplelines     7043 non-null   category
 8   internetservice   7043 non-null   category
 9   onlinesecurity    7043 non-null   category
 10  onlinebackup      7043 non-null   category
 11  deviceprotection  7043 non-null   category
 12  techsupport       7043 non-null   category
 13  streamingtv       7043 non-null   category
 14  streamingmovies   7043 non-null   category
 15  contract          7043 non-null   category
 16  paperlessbilling  7043 n

## 3.3 [Setting up the validation framework](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/03-validation.md)


We will split the data into train, validation, and test. Instead of doing it using `numpy`, we will use a popular ML library called `scikit-learn`.

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
train_test_split?

[0;31mSignature:[0m
[0mtrain_test_split[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0marrays[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtest_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstratify[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation,
``next(ShuffleSplit().split(X, y))``, and application to input data
into a single call for splitting (and optionally subsampling) data into a
one-liner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with sa

We will make a 60-20-20 split again:

In [14]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [15]:
print(df_train.shape[0])
print(df_val.shape[0])
print(df_test.shape[0])

4225
1409
1409


In [16]:
df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [17]:
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

In [18]:
y_train

array([0, 0, 1, ..., 1, 0, 1], dtype=int8)

## 3.4 [EDA](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/04-eda.md)


For EDA, we will look at `df_full_train`. So we will not look at `df_test`. Let's start by checking missing values:

In [19]:
df_full_train.isna().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

There are no missing values. Next, let's look at the distribution of our target variable `churn`:

In [20]:
import matplotlib.pyplot as plt

df_full_train.churn.value_counts(normalize=True)

churn
0    0.730032
1    0.269968
Name: proportion, dtype: float64

In [21]:
# Another way to normalize
global_churn_rate = df_full_train.churn.mean().round(2)
global_churn_rate

np.float64(0.27)

There is a churn rate of 27%. 

In [22]:
numeric = ["monthlycharges", "totalcharges", "tenure"]
categorical = [col for col in df_full_train.columns if col not in numeric + ["churn"]]


## 3.5 [Feature importance: Churn rate and risk ratio](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/05-risk.md)


## 3.6 [Feature importance: Mutual information](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/06-mutual-info.md)


## 3.7 [Feature importance: Correlation](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/07-correlation.md)


## 3.8 [One-hot encoding](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/08-ohe.md)


## 3.9 [Logistic regression](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/09-logistic-regression.md)


## 3.10 [Training logistic regression with Scikit-Learn](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/10-training-log-reg.md)


## 3.11 [Model interpretation](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/11-log-reg-interpretation.md)


## 3.12 [Using the model](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/12-using-log-reg.md)


## 3.13 [Summary](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/13-summary.md)


## 3.14 [Explore more](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/14-explore-more.md)


## 3.15 [Homework](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/03-classification/homework.md)