# Group Assignment

The goal of this assignment is predict probabilities of churn for telco customers. The data is available in the file `churn.csv`. 

The data contains the following columns:

- `customerID`: A unique identifier for each customer.
- `gender`:
- `SeniorCitizen`: Whether the customer is a senior citizen or not (1, 0).
- `Partner`: Whether the customer has a partner or not (Yes, No).
- `Dependents`: Whether the customer has dependents or not (Yes, No).
- `tenure`: Number of months the customer has stayed with the company.
- `PhoneService`: Whether the customer has a phone service or not (Yes, No).
- `MultipleLines`: Whether the customer has multiple lines or not (Yes, No, No phone service).
- `InternetService`: Customer’s internet service provider (DSL, Fiber optic, No).
- `OnlineSecurity`: Whether the customer has online security or not (Yes, No, No internet service).
- `OnlineBackup`: Whether the customer has online backup or not (Yes, No, No internet service).
- `DeviceProtection`: Whether the customer has device protection or not (Yes, No, No internet service).
- `TechSupport`: Whether the customer has tech support or not (Yes, No, No internet service).
- `StreamingTV`: Whether the customer has streaming TV or not (Yes, No, No internet service).
- `StreamingMovies`: Whether the customer has streaming movies or not (Yes, No, No internet service).
- `Contract`: The contract term of the customer (Month-to-month, One year, Two year).
- `PaperlessBilling`: Whether the customer has paperless billing or not (Yes, No).
- `PaymentMethod`: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).
- `MonthlyCharges`: The amount charged to the customer monthly.
- `TotalCharges`: The total amount charged to the customer.
- `Churn`: Whether the customer churned or not (Yes or No).

## Instructions

* The target variable is `Churn`.
* You should use Logistic Regression to make the predictions.
* Follow the steps below to prepare the data and build the model.


## Step 1 (1 point)

Load the data in the file `churn.csv` and explore it.

What are you going to do with `customerID`?

In [197]:
import pandas as pd

df_churn = pd.read_csv('churn.csv')

# Add target variable to the dataframe
df_churn['target'] = df_churn['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Drop customerID and Churn since it's not a feature
df_churn = df_churn.drop(columns=['customerID', 'Churn'])

df_churn.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,target
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,0
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1


## Step 2  (1 point)

Explore the dataset.

What's the deal with the `TotalCharges` column? Fix the column `TotalCharges` and convert it to a numerical data type.

What about missing values?

In [198]:
df_churn['TotalCharges'] = pd.to_numeric(df_churn['TotalCharges'], errors='coerce')

df_churn.fillna(0, inplace=True)

## Step 3 (1 point)

Build new features. Don't sweat it too much, just create a few new features that you think could be useful.

In [199]:
# Create new features
import numpy as np

# 1. Ratio of monthly and total charges (avoid division by zero)
df_churn['charge_ratio'] = np.where(
    df_churn['MonthlyCharges'] == 0, 
    0,  # Set to 0 when MonthlyCharges is 0
    df_churn['TotalCharges'] / df_churn['MonthlyCharges']
)

# 2. Log of monthly charges (avoid log of zero)
df_churn['log_monthly_charges'] = np.where(
    df_churn['MonthlyCharges'] == 0,
    -1,  # Set to a small negative value when MonthlyCharges is 0
    np.log(df_churn['MonthlyCharges'])
)

# 3. Log of total charges (avoid log of zero)
df_churn['log_total_charges'] = np.where(
    df_churn['TotalCharges'] == 0,
    -1,  # Set to a small negative value when TotalCharges is 0
    np.log(df_churn['TotalCharges'])
)


df_churn.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,target,charge_ratio,log_monthly_charges,log_total_charges
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,...,No,Month-to-month,Yes,Electronic check,29.85,29.85,0,1.0,3.396185,3.396185
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,...,No,One year,No,Mailed check,56.95,1889.5,0,33.178227,4.042174,7.544068
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,...,No,Month-to-month,Yes,Mailed check,53.85,108.15,1,2.008357,3.986202,4.683519
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,...,No,One year,No,Bank transfer (automatic),42.3,1840.75,0,43.516548,3.744787,7.517928
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,...,No,Month-to-month,Yes,Electronic check,70.7,151.65,1,2.144979,4.258446,5.021575


## Step 4 (1 point)

Split the data into train and test sets, use 20% of the data for the test set.

Use `42` as the random state.

Is the dataset balanced? Justify your question and split your data accordingly, using the `stratify` parameter if necessary.

In [200]:
from sklearn.model_selection import train_test_split

# Split the data into features and target

y = df_churn.pop('target') # Pop the target variable

x = df_churn # Keep the features

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

## Step 5 (1 point)

Encode the categorical variables using `OneHotEncoder`.

Remove the original categorical columns and add the encoded columns.

In [201]:
# Find the categorical columns

def is_categorical(series: pd.Series) -> bool:
    return series.dtype == 'object'

categorical_columns = [col for col in x.columns if is_categorical(x[col])]
numerical_columns = [col for col in x.columns if not is_categorical(x[col])]

print(f"Categorical columns: {categorical_columns}")
print(f"Numerical columns: {numerical_columns}")

Categorical columns: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
Numerical columns: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'charge_ratio', 'log_monthly_charges', 'log_total_charges']


In [202]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# Create pipeline for categorical variables 
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False))
])

# Create pipeline for numerical variables
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Combine both pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    [
        ('categorical', categorical_pipeline, categorical_columns),
        ('numerical', numerical_pipeline, numerical_columns)
    ], 
    remainder="drop"
)

In [203]:
# Fit the preprocessor to the training data and transform both training and test data
x_train_preprocessed = preprocessor.fit_transform(x_train)
x_test_preprocessed = preprocessor.fit_transform(x_test)

## Step 6 (1 point)

Prepare the target variable for the model.

In [204]:
y.head()

0    0
1    0
2    1
3    0
4    1
Name: target, dtype: int64

## Step 7 (1 point)

Train a Logistic Regression model instantiated with the following baseline hyperparameters:

```python
LogisticRegression(random_state=random_state, max_iter=1000, class_weight='balanced')
```

This will be your baseline model and performance metric.

In [205]:
# 1. Baseline model - Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Train baseline model
baseline_model = LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced')
baseline_model.fit(x_train_preprocessed, y_train)

# Make predictions with baseline model
y_pred_baseline = baseline_model.predict(x_test_preprocessed)

# Calculate F1-score
f1_score_baseline = f1_score(y_test, y_pred_baseline)

# Print F1-score
print(f"F1-score of baseline model: {f1_score_baseline:.4f}")


F1-score of baseline model: 0.6199


## Step 8 (1 point)

Find the best hyperparameters for the model using GridSearchCV, using the following hyperparameters:
- `penalty`
- `C`
- `class_weight`

The documentation for the Logistic Regression model can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Use as many or as few values you want for the number of folds and hyperparameters.

Return the best hyperparameters and the best F1-score.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# TODO: @luisguareschi: Increase the model f1-score

# Define the parameter grid
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.01, 0.1, 1, 10, 100],
    'class_weight': ['balanced', None, {0: 0.4, 1: 0.6}],
    'solver': ['liblinear']
}

# Initialize the Logistic Regression model

grid_search_model = GridSearchCV(
    estimator=LogisticRegression(random_state=42),
    param_grid=param_grid,
    scoring='f1',
    cv=3,
    verbose=False,
)

# Fit the GridSearchCV model
grid_search_model.fit(x_train_preprocessed, y_train)

# Print the best hyperparameters
print(f"Best hyperparameters: {grid_search_model.best_params_}")

# Print the best F1-score
print(f"Best F1-score: {grid_search_model.best_score_:.4f}")

Best hyperparameters: {'C': 1, 'class_weight': 'balanced', 'penalty': 'l2', 'solver': 'liblinear'}
Best F1-score: 0.6344


## Step 9 (1 point)

Train a new Logistic Regression model using the best hyperparameters found in the previous step, and compare the F1-score with the baseline model.

In [207]:
# Make predictions with fine-tuned Random Forest
y_pred_rf_tuned = grid_search_model.predict(x_test_preprocessed)

# Calculate F1-score
f1_score_rf_tuned = f1_score(y_test, y_pred_rf_tuned)

# Print F1-score
print(f"F1-score of fine-tuned Random Forest: {f1_score_rf_tuned:.4f}")

F1-score of fine-tuned Random Forest: 0.6199


## Step 10 (1 point)

How much did the F1-score improve when using the best hyperparameters?

Calculate it using the formula:

$$ \text{F1-score improvement (\%)} = 100 \cdot \frac{\text{F1-score best model} - \text{F1-score baseline model}}{\text{F1-score baseline model}} $$

Grading:

* No improvement: 0 points
* 0-1%: 0.25 point
* 1-2%: 0.5 points
* 2-3%: 0.75 points
* 3% or more: 1 point


In [208]:
f1_improvement = 100 * (f1_score_rf_tuned - f1_score_baseline) / f1_score_baseline

print(f"F1-score improvement: {f1_improvement:.2f}%")

F1-score improvement: 0.00%
