# 3. Machine Learning for Classification

[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) for churn prediction

# 3.1 The Project Data
- **Dataset**: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

## 3.2 Data Preparation

- Downloading the data and reading it with pandas
- Looking at the data (not yet EDA)
- Make column names and values look uniform
- Check if all columns are read correctly
- Check if the churn variable needs any preparation

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [8]:
data_source = "https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv"
!wget -c $data_source -O data-week-3.csv

--2023-09-25 12:59:02--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [9]:
# Loading the data
df = pd.read_csv("data-week-3.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [12]:
# Replacing all white-spaces from feature-names
df.columns = df.columns.str.lower().str.replace(" ", "_")

# Replacing all white-spaces in entries
categorical_columns = list(df.dtypes[df.dtypes == "object"].index)
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(" ", "_")

In [13]:
df.head().T # It did work

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


# 3.3 Setting up the validation framework


- Splitting the data into Train-/Val-/Test-set with `scikit-learn`

## 3.4 EDA

- Checking missing values
- Look at target variable `churn`
- Look at numerical and categorical variables

## 3.5 Feature importance: Churn rate and risk ratio

**`Feature importance (part of EDA)`**: identifying which features affect our target variable
- Churn rate
- Risk ratio
- Mutual indormation (later)

**Churn rate**

**Risk ratio**

```SQL
SELECT
    gender,
    AVG(churn),
    AVG(churn) - global_churn AS diff,
    AVG(churn) / global_churn AS risk
FROM
    data
GROUP BY
    gender;
```

## 3.6 Feature Importance: mutual Information

**Mutual information**: Concept from information theory, that tells us how much we can learn about one varibale if we know the value of another
- https://en.wikipedia.org/wiki/Mutual_information

## 3.7 Feature Importance: Correlation

How about numerical columns?
- Correlation coefficient

## 3.8 One-Hot Encoding

- Use `scikit-learn` to encoder categorical features

## 3.9 Logistic Regression

- Binary classification
- Linear- vs. Logistic-Regression

## 3.10 Training logostic regression model with `scikit-learn`

- Train a model with `scikit-learn`
- Apply it to the validation dataset
- Calculate the accuracy

## 3.11 Model interpretation

- Look at the coefficients
- Train a smaller model with fewer features

## 3.12 Using the model

## 3.13 Summary

- **Feature importance**: 
    - risk, mutual information, correlation
- **One-hot encoding**: 
    - can be implemented with `DictVectorizer`
- **Logistic Regression**: 
    - linear model like linear regression
- **Output of Logistic Regression**: 
    - probability
- **Interpretation of weights/parameters**: 
    - similar to linear regression