# Group Assignment

The goal of this assignment is predict probabilities of churn for telco customers. The data is available in the file `churn.csv`. 

The data contains the following columns:

- `customerID`: A unique identifier for each customer.
- `gender`:
- `SeniorCitizen`: Whether the customer is a senior citizen or not (1, 0).
- `Partner`: Whether the customer has a partner or not (Yes, No).
- `Dependents`: Whether the customer has dependents or not (Yes, No).
- `tenure`: Number of months the customer has stayed with the company.
- `PhoneService`: Whether the customer has a phone service or not (Yes, No).
- `MultipleLines`: Whether the customer has multiple lines or not (Yes, No, No phone service).
- `InternetService`: Customer’s internet service provider (DSL, Fiber optic, No).
- `OnlineSecurity`: Whether the customer has online security or not (Yes, No, No internet service).
- `OnlineBackup`: Whether the customer has online backup or not (Yes, No, No internet service).
- `DeviceProtection`: Whether the customer has device protection or not (Yes, No, No internet service).
- `TechSupport`: Whether the customer has tech support or not (Yes, No, No internet service).
- `StreamingTV`: Whether the customer has streaming TV or not (Yes, No, No internet service).
- `StreamingMovies`: Whether the customer has streaming movies or not (Yes, No, No internet service).
- `Contract`: The contract term of the customer (Month-to-month, One year, Two year).
- `PaperlessBilling`: Whether the customer has paperless billing or not (Yes, No).
- `PaymentMethod`: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).
- `MonthlyCharges`: The amount charged to the customer monthly.
- `TotalCharges`: The total amount charged to the customer.
- `Churn`: Whether the customer churned or not (Yes or No).

## Instructions

* The target variable is `Churn`.
* You should use Logistic Regression to make the predictions.
* Follow the steps below to prepare the data and build the model.


## Step 1 (1 point)

Load the data in the file `churn.csv` and explore it.

What are you going to do with `customerID`?

In [1]:
import pandas as pd

df_churn = pd.read_csv('churn.csv')

# Add target variable to the dataframe
df_churn['target'] = df_churn['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Drop customerID since it's not a feature
df_churn = df_churn.drop(columns=['customerID'])

df_churn.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,target
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,...,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,0
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,...,No,No,No,One year,No,Mailed check,56.95,1889.5,No,0
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,...,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,...,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,0
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,...,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1


## Step 2  (1 point)

Explore the dataset.

What's the deal with the `TotalCharges` column? Fix the column `TotalCharges` and convert it to a numerical data type.

What about missing values?

In [2]:
df_churn['TotalCharges'] = pd.to_numeric(df_churn['TotalCharges'], errors='coerce')

df_churn.fillna(0, inplace=True)

## Step 3 (1 point)

Build new features. Don't sweat it too much, just create a few new features that you think could be useful.

In [3]:
# Create 5 new features

# 1. Ratio of total and monthly charges
df_churn['charge_ratio'] = df_churn['TotalCharges'] / df_churn['MonthlyCharges']

# 2. Difference between total and monthly charges
df_churn['charge_difference'] = df_churn['TotalCharges'] - df_churn['MonthlyCharges']




df_churn.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,target,charge_ratio,charge_difference
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,...,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,0,1.0,0.0
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,...,No,One year,No,Mailed check,56.95,1889.5,No,0,33.178227,1832.55
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,...,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,2.008357,54.3
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,...,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,0,43.516548,1798.45
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,...,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,2.144979,80.95


## Step 4 (1 point)

Split the data into train and test sets, use 20% of the data for the test set.

Use `42` as the random state.

Is the dataset balanced? Justify your question and split your data accordingly, using the `stratify` parameter if necessary.

## Step 5 (1 point)

Encode the categorical variables using `OneHotEncoder`.

Remove the original categorical columns and add the encoded columns.

## Step 6 (1 point)

Prepare the target variable for the model.

## Step 7 (1 point)

Train a Logistic Regression model instantiated with the following baseline hyperparameters:

```python
LogisticRegression(random_state=random_state, max_iter=1000, class_weight='balanced')
```

This will be your baseline model and performance metric.

## Step 8 (1 point)

Find the best hyperparameters for the model using GridSearchCV, using the following hyperparameters:
- `penalty`
- `C`
- `class_weight`

The documentation for the Logistic Regression model can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Use as many or as few values you want for the number of folds and hyperparameters.

Return the best hyperparameters and the best F1-score.

## Step 9 (1 point)

Train a new Logistic Regression model using the best hyperparameters found in the previous step, and compare the F1-score with the baseline model.

## Step 10 (1 point)

How much did the F1-score improve when using the best hyperparameters?

Calculate it using the formula:

$$ \text{F1-score improvement (\%)} = 100 \cdot \frac{\text{F1-score best model} - \text{F1-score baseline model}}{\text{F1-score baseline model}} $$

Grading:

* No improvement: 0 points
* 0-1%: 0.25 point
* 1-2%: 0.5 points
* 2-3%: 0.75 points
* 3% or more: 1 point
