# Group Assignment

The goal of this assignment is predict probabilities of churn for telco customers. The data is available in the file `churn.csv`. 

The data contains the following columns:

- `customerID`: A unique identifier for each customer.
- `gender`:
- `SeniorCitizen`: Whether the customer is a senior citizen or not (1, 0).
- `Partner`: Whether the customer has a partner or not (Yes, No).
- `Dependents`: Whether the customer has dependents or not (Yes, No).
- `tenure`: Number of months the customer has stayed with the company.
- `PhoneService`: Whether the customer has a phone service or not (Yes, No).
- `MultipleLines`: Whether the customer has multiple lines or not (Yes, No, No phone service).
- `InternetService`: Customer’s internet service provider (DSL, Fiber optic, No).
- `OnlineSecurity`: Whether the customer has online security or not (Yes, No, No internet service).
- `OnlineBackup`: Whether the customer has online backup or not (Yes, No, No internet service).
- `DeviceProtection`: Whether the customer has device protection or not (Yes, No, No internet service).
- `TechSupport`: Whether the customer has tech support or not (Yes, No, No internet service).
- `StreamingTV`: Whether the customer has streaming TV or not (Yes, No, No internet service).
- `StreamingMovies`: Whether the customer has streaming movies or not (Yes, No, No internet service).
- `Contract`: The contract term of the customer (Month-to-month, One year, Two year).
- `PaperlessBilling`: Whether the customer has paperless billing or not (Yes, No).
- `PaymentMethod`: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).
- `MonthlyCharges`: The amount charged to the customer monthly.
- `TotalCharges`: The total amount charged to the customer.
- `Churn`: Whether the customer churned or not (Yes or No).

## Instructions

* The target variable is `Churn`.
* You should use Logistic Regression to make the predictions.
* Follow the steps below to prepare the data and build the model.


In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

## Step 1 (1 point)

Load the data in the file `churn.csv` and explore it.

What are you going to do with `customerID`?

In [4]:
df_churn = pd.read_csv('churn.csv')

# Add target variable to the dataframe
df_churn['target'] = df_churn['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Drop customerID since it's not a feature
df_churn = df_churn.drop(columns=['customerID'])

df_churn.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,target
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,...,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,0
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,...,No,No,No,One year,No,Mailed check,56.95,1889.5,No,0
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,...,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,...,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,0
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,...,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1


In [5]:
df_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 


## Step 2  (1 point)

Explore the dataset.

What's the deal with the `TotalCharges` column? Fix the column `TotalCharges` and convert it to a numerical data type.

What about missing values?

In [6]:
# Transfor from object to numerical
df_churn['TotalCharges'] = pd.to_numeric(df_churn['TotalCharges'], errors='coerce')

df_churn.fillna(0, inplace=True)

## Step 3 (1 point)

Build new features. Don't sweat it too much, just create a few new features that you think could be useful.

In [7]:
dummies = pd.get_dummies(df_churn[['gender', 'Partner', 'Dependents', 'PhoneService',
                         'MultipleLines', 'InternetService', 'OnlineSecurity',
                         'OnlineBackup', 'DeviceProtection', 'TechSupport',
                         'StreamingTV', 'StreamingMovies', 'Contract',
                         'PaperlessBilling', 'PaymentMethod'
                        ]])

dummies.columns

Index(['gender_Female', 'gender_Male', 'Partner_No', 'Partner_Yes',
       'Dependents_No', 'Dependents_Yes', 'PhoneService_No',
       'PhoneService_Yes', 'MultipleLines_No',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'InternetService_No', 'OnlineSecurity_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No', 'OnlineBackup_No internet service',
       'OnlineBackup_Yes', 'DeviceProtection_No',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No', 'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No', 'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No', 'StreamingMovies_No internet service',
       'StreamingMovies_Yes', 'Contract_Month-to-month', 'Contract_One year',
       'Contract_Two year', 'PaperlessBilling_No', 'PaperlessBilling_Yes',
       'PaymentMethod_

In [8]:
dummies.drop( ['gender_Male', 'Partner_No', 'Dependents_No',
              'PhoneService_No', 'MultipleLines_No', 'MultipleLines_No phone service',
               'InternetService_No', 'OnlineSecurity_No internet service', 'OnlineSecurity_No',
               'OnlineBackup_No internet service', 'OnlineBackup_No', 'DeviceProtection_No',
               'DeviceProtection_No internet service', 'TechSupport_No', 'TechSupport_No internet service',
               'StreamingTV_No', 'StreamingTV_No internet service', 'StreamingMovies_No',
               'StreamingMovies_No internet service', 'PaperlessBilling_No', 'Contract_Month-to-month'
              ], axis = 1, inplace = True)

dummies.head()

Unnamed: 0,gender_Female,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,OnlineSecurity_Yes,OnlineBackup_Yes,DeviceProtection_Yes,TechSupport_Yes,StreamingTV_Yes,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,True,True,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False,False,True,False
1,False,False,False,True,False,True,False,True,False,True,False,False,False,True,False,False,False,False,False,True
2,False,False,False,True,False,True,False,True,True,False,False,False,False,False,False,True,False,False,False,True
3,False,False,False,False,False,True,False,True,False,True,True,False,False,True,False,False,True,False,False,False
4,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False,True,False


In [10]:
a = pd.DataFrame(df_churn[['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges','target']])

X = pd.concat([a, dummies], axis = 1)
y = X['target']
X = X.drop('target', axis=1)
X.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Female,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_Yes,InternetService_DSL,...,TechSupport_Yes,StreamingTV_Yes,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,29.85,True,True,False,False,False,True,...,False,False,False,False,False,True,False,False,True,False
1,0,34,56.95,1889.5,False,False,False,True,False,True,...,False,False,False,True,False,False,False,False,False,True
2,0,2,53.85,108.15,False,False,False,True,False,True,...,False,False,False,False,False,True,False,False,False,True
3,0,45,42.3,1840.75,False,False,False,False,False,True,...,True,False,False,True,False,False,True,False,False,False
4,0,2,70.7,151.65,True,False,False,True,False,False,...,False,False,False,False,False,True,False,False,True,False


In [11]:
X.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Female', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_Yes', 'InternetService_DSL',
       'InternetService_Fiber optic', 'OnlineSecurity_Yes', 'OnlineBackup_Yes',
       'DeviceProtection_Yes', 'TechSupport_Yes', 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year',
       'PaperlessBilling_Yes', 'PaymentMethod_Bank transfer (automatic)',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
      dtype='object')

In [29]:
# Create 5 new features

# 1. Ratio of total and monthly charges
X['ChargeRatio'] = X['TotalCharges'] / X['MonthlyCharges']

# 2. Difference between total and monthly charges
X['ChargeDifference'] = X['TotalCharges'] - X['MonthlyCharges']

# 3. Total services
X['TotalServices'] = ((X['PhoneService_Yes']).astype(int) + (X['InternetService_DSL']).astype(int) + 
                      (X['InternetService_Fiber optic']).astype(int) + (X['OnlineSecurity_Yes']).astype(int) + (X['OnlineBackup_Yes']).astype(int) +
                      (X['DeviceProtection_Yes']).astype(int) + (X['TechSupport_Yes']).astype(int) + (X['StreamingTV_Yes']).astype(int) +
                      (X['StreamingMovies_Yes']).astype(int)
                     )

# 4. Average Monthly Charge per Total Services 
X['AvgChargePerService'] = X['TotalCharges']/X['TotalServices']


# 5. Partner & Dependents
X['Partner&Dependents'] = X['Partner_Yes'] * X['Dependents_Yes']


X.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Female,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_Yes,InternetService_DSL,...,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,ChargeRatio,ChargeDifference,TotalServices,AvgChargePerService,Partner&Dependents
0,0,1,29.85,29.85,True,True,False,False,False,True,...,True,False,False,True,False,1.0,0.0,2,14.925,False
1,0,34,56.95,1889.5,False,False,False,True,False,True,...,False,False,False,False,True,33.178227,1832.55,4,472.375,False
2,0,2,53.85,108.15,False,False,False,True,False,True,...,True,False,False,False,True,2.008357,54.3,4,27.0375,False
3,0,45,42.3,1840.75,False,False,False,False,False,True,...,False,True,False,False,False,43.516548,1798.45,4,460.1875,False
4,0,2,70.7,151.65,True,False,False,True,False,False,...,True,False,False,True,False,2.144979,80.95,2,75.825,False


## Step 4 (1 point)

Split the data into train and test sets, use 20% of the data for the test set.

Use `42` as the random state.

Is the dataset balanced? Justify your question and split your data accordingly, using the `stratify` parameter if necessary.

## Step 5 (1 point)

Encode the categorical variables using `OneHotEncoder`.

Remove the original categorical columns and add the encoded columns.

## Step 6 (1 point)

Prepare the target variable for the model.

## Step 7 (1 point)

Train a Logistic Regression model instantiated with the following baseline hyperparameters:

```python
LogisticRegression(random_state=random_state, max_iter=1000, class_weight='balanced')
```

This will be your baseline model and performance metric.

## Step 8 (1 point)

Find the best hyperparameters for the model using GridSearchCV, using the following hyperparameters:
- `penalty`
- `C`
- `class_weight`

The documentation for the Logistic Regression model can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Use as many or as few values you want for the number of folds and hyperparameters.

Return the best hyperparameters and the best F1-score.

## Step 9 (1 point)

Train a new Logistic Regression model using the best hyperparameters found in the previous step, and compare the F1-score with the baseline model.

## Step 10 (1 point)

How much did the F1-score improve when using the best hyperparameters?

Calculate it using the formula:

$$ \text{F1-score improvement (\%)} = 100 \cdot \frac{\text{F1-score best model} - \text{F1-score baseline model}}{\text{F1-score baseline model}} $$

Grading:

* No improvement: 0 points
* 0-1%: 0.25 point
* 1-2%: 0.5 points
* 2-3%: 0.75 points
* 3% or more: 1 point
