# Individual Assignment

The goal of this assignment is predict probabilities of churn for telco customers. The data is available in the file `churn.csv`. 

The data contains the following columns:

- `customerID`: A unique identifier for each customer.
- `gender`:
- `SeniorCitizen`: Whether the customer is a senior citizen or not (1, 0).
- `Partner`: Whether the customer has a partner or not (Yes, No).
- `Dependents`: Whether the customer has dependents or not (Yes, No).
- `tenure`: Number of months the customer has stayed with the company.
- `PhoneService`: Whether the customer has a phone service or not (Yes, No).
- `MultipleLines`: Whether the customer has multiple lines or not (Yes, No, No phone service).
- `InternetService`: Customer’s internet service provider (DSL, Fiber optic, No).
- `OnlineSecurity`: Whether the customer has online security or not (Yes, No, No internet service).
- `OnlineBackup`: Whether the customer has online backup or not (Yes, No, No internet service).
- `DeviceProtection`: Whether the customer has device protection or not (Yes, No, No internet service).
- `TechSupport`: Whether the customer has tech support or not (Yes, No, No internet service).
- `StreamingTV`: Whether the customer has streaming TV or not (Yes, No, No internet service).
- `StreamingMovies`: Whether the customer has streaming movies or not (Yes, No, No internet service).
- `Contract`: The contract term of the customer (Month-to-month, One year, Two year).
- `PaperlessBilling`: Whether the customer has paperless billing or not (Yes, No).
- `PaymentMethod`: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).
- `MonthlyCharges`: The amount charged to the customer monthly.
- `TotalCharges`: The total amount charged to the customer.
- `Churn`: Whether the customer churned or not (Yes or No).

## Instructions

* The target variable is `Churn`.
* You should use Logistic Regression to make the predictions.
* Follow the steps below to prepare the data and build the model.


## Step 1 (1 point)

Load the data in the file `churn.csv` and explore it.

What are you going to do with `customerID`?

In [31]:
import pandas as pd
df = pd.read_csv('churn.csv')
df
#we can drop the customer_ID, as it does not contain any information that would be helpful for the prediction
df = df.drop(columns = 'customerID')
df

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


## Step 2  (1 point)

Explore the dataset.

What's the deal with the `TotalCharges` column? Fix the column `TotalCharges` and convert it to a numerical data type.

What about missing values?

In [32]:
print('Original Datatype: ',df['TotalCharges'].dtype)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'],errors='coerce')
print('New Datatype:',df['TotalCharges'].dtype)

Original Datatype:  object
New Datatype: float64


In [33]:
print(df['TotalCharges'].isna().sum())
df['TotalCharges'].fillna(0,inplace=True)

11


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(0,inplace=True)


## Step 3 (1 point)

Build new features. Don't sweat it too much, just create a few new features that you think could be useful.

In [34]:
df['AvgMonthlyTotalSpend'] = df['TotalCharges'] / (df['tenure'].replace(0,1))
df['TenureGroups'] = pd.cut(df['tenure'],bins=[-1,12,24,48,60,df['tenure'].max()],labels=['New','Normal','Long','Loyal','Lifetime'])
df['ElectronicExpert'] = ((df['PaymentMethod'] == 'Electronic check') & (df['PaperlessBilling'] == 'Yes')).astype(int)
df['StreamingCust'] = ((df['StreamingTV'] == 'Yes') & (df['StreamingMovies'] == 'Yes' )).astype(int)
df['MonthlyChargeGroups'] = pd.cut(df['MonthlyCharges'], bins= [0,30,60,100,df['MonthlyCharges'].max()],labels= ['Low','Medium','High','Very High'])
df
                                                                                                                 

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,AvgMonthlyTotalSpend,TenureGroups,ElectronicExpert,StreamingCust,MonthlyChargeGroups
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,...,Yes,Electronic check,29.85,29.85,No,29.850000,New,1,0,Low
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,...,No,Mailed check,56.95,1889.50,No,55.573529,Long,0,0,Medium
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,...,Yes,Mailed check,53.85,108.15,Yes,54.075000,New,0,0,Medium
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,...,No,Bank transfer (automatic),42.30,1840.75,No,40.905556,Long,0,0,Medium
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,...,Yes,Electronic check,70.70,151.65,Yes,75.825000,New,1,0,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,...,Yes,Mailed check,84.80,1990.50,No,82.937500,Normal,0,1,High
7039,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,...,Yes,Credit card (automatic),103.20,7362.90,No,102.262500,Lifetime,0,1,Very High
7040,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,...,Yes,Electronic check,29.60,346.45,No,31.495455,New,1,0,Low
7041,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,...,Yes,Mailed check,74.40,306.60,Yes,76.650000,New,0,0,High


## Step 4 (1 point)

Split the data into train and test sets, use 20% of the data for the test set.

Use `42` as the random state.

Is the dataset balanced? Justify your question and split your data accordingly, using the `stratify` parameter if necessary.

In [35]:
from sklearn.model_selection import train_test_split
print(df['Churn'].value_counts(normalize=True))
#as the dataset isn't balanced, we should use stratify
x = df.drop(columns='Churn')
y = df['Churn'].map({'No':0,'Yes':1})
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state =42,stratify=y)


Churn
No     0.73463
Yes    0.26537
Name: proportion, dtype: float64


## Step 5 (1 point)

Encode the categorical variables using `OneHotEncoder`.

Remove the original categorical columns and add the encoded columns.

In [41]:
from sklearn.preprocessing import OneHotEncoder
categorical_features = list(x.select_dtypes(include=["object","category"]).columns)
ohe = OneHotEncoder(drop="first",handle_unknown="ignore",sparse_output=False)
x_train_encoded = ohe.fit_transform(x_train[categorical_features])
x_test_encoded = ohe.transform(x_test[categorical_features])
encoded_cols = ohe.get_feature_names_out(categorical_features)
x_train_encoded = pd.DataFrame(x_train_encoded, columns=encoded_cols, index=x_train.index)
x_test_encoded  = pd.DataFrame(x_test_encoded,  columns=encoded_cols, index=x_test.index)
x_train_num = x_train.drop(columns=categorical_features)
x_test_num  = x_test.drop(columns=categorical_features)

x_train_encoded = pd.concat([x_train_num.reset_index(drop=True),
                             x_train_encoded.reset_index(drop=True)], axis=1)

x_test_encoded = pd.concat([x_test_num.reset_index(drop=True),
                            x_test_encoded.reset_index(drop=True)], axis=1)

## Step 6 (1 point)

Prepare the target variable for the model.

In [47]:
print(y_train.value_counts())
print(y_test.value_counts())
print(y_train.mean())
print(y_test.mean())
#No more preparation is needed

Churn
0    4139
1    1495
Name: count, dtype: int64
Churn
0    1035
1     374
Name: count, dtype: int64
0.2653532126375577
0.2654364797728886


## Step 7 (1 point)

Train a Logistic Regression model instantiated with the following baseline hyperparameters:

```python
LogisticRegression(random_state=random_state, max_iter=1000, class_weight='balanced')
```

This will be your baseline model and performance metric.

In [48]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
logistic_reg = LogisticRegression(random_state = 42,max_iter=1000,class_weight='balanced')
logistic_reg.fit(x_train_encoded,y_train)
y_pred = logistic_reg.predict(x_test_encoded)
print('F1 Score: ', f1_score(y_test,y_pred))


F1 Score:  0.6178010471204188


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Step 8 (1 point)

Find the best hyperparameters for the model using GridSearchCV, using the following hyperparameters:
- `penalty`
- `C`
- `class_weight`

The documentation for the Logistic Regression model can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Use as many or as few values you want for the number of folds and hyperparameters.

Return the best hyperparameters and the best F1-score.

In [59]:
from sklearn.model_selection import GridSearchCV
logistic_reg = LogisticRegression(max_iter = 5000,random_state = 42)
param_grid = {
    "penalty": ["l1","l2","elasticnet"],
    "C": [0.01,0.1,1,10],
    "class_weight": [None,"balanced"],
    "solver":["liblinear","saga"],
    "l1_ratio":[0.3,0.5,0.7]
}
grid_search = GridSearchCV(
    estimator=logistic_reg,
    param_grid = param_grid,
    scoring="f1",
    cv=5,
    n_jobs=-1
)
grid_search.fit(x_train_encoded,y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)


120 fits failed out of a total of 720.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
120 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/anaconda3/lib/python3.13/site-packag

{'C': 1, 'class_weight': 'balanced', 'l1_ratio': 0.3, 'penalty': 'l1', 'solver': 'liblinear'}
0.6298876488107334


## Step 9 (1 point)

Train a new Logistic Regression model using the best hyperparameters found in the previous step, and compare the F1-score with the baseline model.

In [77]:
best_logistic_reg = LogisticRegression(random_state = 42,max_iter=5000,class_weight='balanced',C=1,penalty="l1",solver="liblinear",l1_ratio=0.3)
best_logistic_reg.fit(x_train_encoded,y_train)
best_y_pred = best_logistic_reg.predict(x_test_encoded)
print('F1 Score: ', f1_score(y_test,best_y_pred))





F1 Score:  0.6190975865687304


## Step 10 (1 point)

How much did the F1-score improve when using the best hyperparameters?

Calculate it using the formula:

$$ \text{F1-score improvement (\%)} = 100 \cdot \frac{\text{F1-score best model} - \text{F1-score baseline model}}{\text{F1-score baseline model}} $$

Grading:

* No improvement: 0 points
* 0-1%: 0.25 point
* 1-2%: 0.5 points
* 2-3%: 0.75 points
* 3% or more: 1 point
