# Assignment 1 - Logistic Regression

In this assignment, we are going to train a logistic regression prediction model.
The first step is to load the dataset, and show the first 5 rows of this dataset to make sure our table is correctly loaded.

In [None]:
import pandas as pd
df = pd.read_parquet('https://huggingface.co/datasets/scikit-learn/churn-prediction/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet', engine='pyarrow')

df.head()

## Load default scripts
The cell below executes prepared code, just hit the "play" button and move to the next cell

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

def train_logistic_regression(X, y):
  # Define categorical columns
  categorical_cols = [col for col in X.columns if X[col].dtype == 'object']
  numeric_features = [col for col in X.columns if X[col].dtype in ["float64", "int64"]]

  # Create a pipeline for preprocessing categorical variables
  categorical_transformer = Pipeline(steps=[
      ('onehot', OneHotEncoder(handle_unknown='ignore'))
  ])

  numeric_transformer = Pipeline(
      steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
  )

  # Combine preprocessing steps with ColumnTransformer
  preprocessor = ColumnTransformer(
      transformers=[
          ("num", numeric_transformer, numeric_features),
          ('cat', categorical_transformer, categorical_cols)
  ])

  # Append classifier to preprocessing pipeline
  # Now we have a full prediction pipeline
  clf = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression())])

  # Fit the model
  clf.fit(X, y)

  # Get the feature names
  feature_names = clf.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out() #(input_features=categorical_cols)

  # Get the coefficients
  coefficients = clf.named_steps['classifier'].coef_

  nested_dict = { }
  i = 0

  # Match coefficient to column values
  for column in (numeric_features + categorical_cols):
    if column in categorical_cols:
      sub_dict = { }
      for feature_name in feature_names:
        if feature_name.startswith(column):
          name = feature_name.replace(column + "_","")
          sub_dict[name] = coefficients[:,i].item()
          i+=1
      nested_dict[column] = sub_dict
    else:
      nested_dict[column] = coefficients[:,i].item()
      i+=1

  return clf, nested_dict

## Execute machine learning run
The cell below will train a logistic regression model, and show the covariate weights of the prediction model.

If you want to use additional features in your model, copy & paste the **exact** column name (mind capital letters!) from the table above (or from excel).
You can add as many variables as you want, if you wrap them with quotes, and separate every variable with a comma.

In [None]:
inputFeatures = df[["gender", "MonthlyCharges"]]
outcome = (df["Churn"]=="Yes")

lr_model, coefficients_dict = train_logistic_regression(inputFeatures, outcome)
import json
print(json.dumps(coefficients_dict, indent=2))

# Model performance
**Wait with the steps below when until we are addressing model performance!**

## Confusion matrix
Below you can generate the confusion matrix for the logistic regression model developed above.
You can change the threshold to define above what predicted probability we define a positive churn. This threshold can be defined in the first line.

In [None]:
threshold = 0.5
from sklearn.metrics import ConfusionMatrixDisplay
probabilities = lr_model.predict_proba(inputFeatures)
probabilities_true = [item[1] > threshold for item in probabilities]
ConfusionMatrixDisplay.from_predictions(outcome, probabilities_true)

## ROC curve (discriminative ability)
Run the code below to generate the ROC curve and determine the AUC value (see legend). How good/bad is the model?

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay
roc = RocCurveDisplay.from_estimator(lr_model, inputFeatures, outcome)
plt.show()