# 3. Machine Learning for Classification

[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) for churn prediction

# 3.1 The Project Data
- **Dataset**: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

## 3.2 Data Preparation

- Downloading the data and reading it with pandas
- Looking at the data (not yet EDA)
- Make column names and values look uniform
- Check if all columns are read correctly
- Check if the churn variable needs any preparation

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
data_source = "https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv"
!wget -c $data_source -O data-week-3.csv

--2023-09-25 16:45:36--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8003::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [4]:
# Loading the data
df = pd.read_csv("data-week-3.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Steps done here:
- Making everything lower-case
- Removing all whitespaces in column-names and entries and replacing them with `_`

In [5]:
# Replacing all white-spaces from feature-names
df.columns = df.columns.str.lower().str.replace(" ", "_")

# Replacing all white-spaces in entries
categorical_columns = list(df.dtypes[df.dtypes == "object"].index)
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(" ", "_")

In [6]:
df.head().T  # For visualization purposes (transpose the dataframe)

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [7]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

**General Overview over the data**:

| **`Feature`**     | **`Type`**          |**`Description`**                                                  |
| ----------------- | ------------------- |------------------------------------------------------------------ |
| **customerid**      | `object (String)`     | The customer ID.                                                  |
| **gender**          | `object (String)`     | Whether the customer is a `male` or a `female`.                   |
| **seniorcitizen**   | `numerical (int64)`   | Whether the customer is a senior citizen or not (`1`, `0`).       |
| **partner**         | `object (String)`     | Whether the customer has a partner or not (`yes`, `no`).          |
| **dependents**      | `object (String)`     | Whether the customer has dependents or not (`yes`, `no`).         |
| **tenure**          | `numerical (int64)`   | Number of months the customer has stayed with the company.        |
| **phoneservice**    | `object (String)`     | Whether the customer has a phone service or not (`yes`, `no`).    |
| **multiplelines**   | `object (String)`     | Whether the customer has multiple lines or not (`yes`, `no`, `no_phone_service`). |
| **internetservice** | `object (String)`     | Customer’s internet service provider (`dsl`, `fiber_optic`, `no`).|
| **onlinesecurity**  | `object (String)`     | Whether the customer has online security or not (`yes`, `no`, `no_internet_service`). |
| **onlinebackup**    | `object (String)`     | Whether the customer has online security or not (`yes`, `no`, `no_internet_service`). |
| **deviceprotection**| `object (String)`     | Whether the customer has device protection or not (`yes`, `no`, `no_internet_service`). |
| **techsupport**     | `object (String)`     | Whether the customer has tech support or not (`yes`, `no`, `no_internet_service`). |
| **streamingtv**     | `object (String)`     | Whether the customer has streaming TV or not (`yes`, `no`, `no_internet_service`). |
| **streamingmovies** | `object (String)`     | Whether the customer has streaming movies or not (`yes`, `no`, `no_internet_service`). |
| **contract**        | `object (String)`     | The contract term of the customer (`month-to-month`, `one_year, two_year`). |
| **paperlessbilling**| `object (String)`     | Whether the customer has paperless billing or not (`yes`, `no`). |
| **paymentmethod**   | `object (String)`     | The customer’s payment method (`electronic_check`, `mailed_check`, `bank_transfer_(automatic)`, `credit_card_(automatic)`) | 
| **monthlycharges**  | `numerical (float64)` | The amount charged to the customer monthly. |
| **totalcharges**    | `object (String)`     | The total amount charged to the customer. |
| **churn**           | `object (String)`     | Whether the customer churned or not (`yes`, `no`). | 

In [12]:
print(df["seniorcitizen"].unique())
print(df["totalcharges"].dtypes)
df["totalcharges"].head()

[0 1]
object


0      29.85
1     1889.5
2     108.15
3    1840.75
4     151.65
Name: totalcharges, dtype: object

One can see that `seniorcitizen` is encoded as numeric values and the `totalcharges` are represented as strings, instead of numerical values. The latter observation is a problem, that has to be remedied by converting it to numerical values:

In [14]:
# This operation will result in an error, because missing values were represented as " ".
# Since " " was replaced with "_", there are some values that can not be parsed as numerical values.

tc = pd.to_numeric(df["totalcharges"], errors="coerce") # coerce un-parsable values to NaN

In [18]:
df[tc.isnull()][["customerid", "totalcharges"]] # There are 11 rows with NaN-values / missing entries

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,_
753,3115-czmzd,_
936,5709-lvoeq,_
1082,4367-nuyao,_
1340,1371-dwpaz,_
3331,7644-omvmy,_
3826,3213-vvolg,_
4380,2520-sgtta,_
5218,2923-arzlg,_
6670,4075-wkniu,_


In [20]:
# Using the method used above and filling everythin with 0's
df["totalcharges"] = pd.to_numeric(df["totalcharges"], errors="coerce")
df["totalcharges"] = df["totalcharges"].fillna(0)
df[df["totalcharges"].isnull()][["customerid", "totalcharges"]] # Empty!

Unnamed: 0,customerid,totalcharges


This is maybe not the best approach, since most likely the customers spent money, but this strategy works reasonably well in practice.

**Transforming the `churn`-Variable**

In [22]:
# Alternative: df["churn"] = (df["churn"] == "yes").astype(int)
df["churn"].replace({"yes": 1, "no": 0}, inplace=True)
print("Churn: ", df["churn"].unique())

Churn:  [0 1]


# 3.3 Setting up the validation framework


- Splitting the data into Train-/Val-/Test-set with `scikit-learn`

In [30]:
from sklearn.model_selection import train_test_split
random_state = 1

In [31]:
df_full_train, df_test = train_test_split(df, test_size=0.2, 
                                          random_state=random_state)
print(df_full_train.shape, df_test.shape)

(5634, 21) (1409, 21)


Splitting the `df_full_train` in to `df_train` and `df_val`. We need to know how much many samples $20\%$ out of $80\%$ is.
-  $\frac{20}{80} = \frac{1}{4} = 0.25 = 25\%$

We now split the training set again with the computed ratio

In [36]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25,
                                    random_state=random_state)
print("df_train: ", len(df_train), f" | ratio: {len(df_train) / len(df)*100:.0f}%")
print("df_val: ", len(df_val), f"   | ratio: {len(df_val) / len(df) * 100:.0f}%")
print("df_test: ", len(df_test), f"  | ratio: {len(df_test) / len(df) * 100:.0f}%")


df_train:  4225  | ratio: 60%
df_val:  1409    | ratio: 20%
df_test:  1409   | ratio: 20%


**Extracting the target-variable `churn` as $y$ from the dataframe**

In [39]:
# Resetting random indices
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [None]:
# Extracting the churn-values as numpy arrays
y_train = df_train["churn"].values
y_val = df_val["churn"].values
y_test = df_test["churn"].values

In [43]:
# Removing the target-value from the dataframe
del df_train["churn"]
del df_val["churn"]
del df_test["churn"]


KeyboardInterrupt



## 3.4 EDA

- Checking missing values
- Look at target variable `churn`
- Look at numerical and categorical variables

In [44]:
df_full_train = df_full_train.reset_index(drop=True)

In [47]:
# What is the percentage of churn to non-churn
df_full_train["churn"].value_counts(normalize=True).round(2)

churn
0    0.73
1    0.27
Name: proportion, dtype: float64

The 27% seen above is the (global) **churn-rate**
- Formula: $\mu_{churn} = \frac{1}{n}\sum_i x_i,\quad x_i\in\{0, 1\}$
- $\frac{\text{\# of 1's}}{n} = \text{churn}$


In [48]:
global_churn_rate = df_full_train["churn"].mean()
round(global_churn_rate, 2)

0.27

**Now: Looking at the other values from the dataframe**

In [49]:
df_full_train.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object

There are 3 (real) numerical variables.
- `tenure`
- `monthlycharges`
- `totalcharges`

In [50]:
numerical = ["tenure", "monthlycharges", "totalcharges"]

Getting all the categorical values from the dataframe

In [53]:
categorical = [
    'gender', 'seniorcitizen', 'partner', 
    'dependents', 'phoneservice', 'multiplelines', 
    'internetservice', 'onlinesecurity', 'onlinebackup', 
    'deviceprotection', 'techsupport', 'streamingtv', 
    'streamingmovies', 'contract', 'paperlessbilling',
    'paymentmethod'
]

['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod']


In [54]:
# Most of the columns have only a few discrete values (typical)
df_full_train[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## 3.5 Feature importance: Churn rate and risk ratio

**`Feature importance (part of EDA)`**: identifying which features affect our target variable
- Churn rate
- Risk ratio
- Mutual indormation (later)

**Churn rate**

**Risk ratio**

```SQL
SELECT
    gender,
    AVG(churn),
    AVG(churn) - global_churn AS diff,
    AVG(churn) / global_churn AS risk
FROM
    data
GROUP BY
    gender;
```

## 3.6 Feature Importance: mutual Information

**Mutual information**: Concept from information theory, that tells us how much we can learn about one varibale if we know the value of another
- https://en.wikipedia.org/wiki/Mutual_information

## 3.7 Feature Importance: Correlation

How about numerical columns?
- Correlation coefficient

## 3.8 One-Hot Encoding

- Use `scikit-learn` to encoder categorical features

## 3.9 Logistic Regression

- Binary classification
- Linear- vs. Logistic-Regression

## 3.10 Training logostic regression model with `scikit-learn`

- Train a model with `scikit-learn`
- Apply it to the validation dataset
- Calculate the accuracy

## 3.11 Model interpretation

- Look at the coefficients
- Train a smaller model with fewer features

## 3.12 Using the model

## 3.13 Summary

- **Feature importance**: 
    - risk, mutual information, correlation
- **One-hot encoding**: 
    - can be implemented with `DictVectorizer`
- **Logistic Regression**: 
    - linear model like linear regression
- **Output of Logistic Regression**: 
    - probability
- **Interpretation of weights/parameters**: 
    - similar to linear regression