# Logistic Regression

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd
import time

First of all, we load the data

In [2]:
train =  pd.read_parquet('../../../data/model_input/train_sets/breast_cancer.parquet')
test =  pd.read_parquet('../../../data/model_input/validation_sets/breast_cancer.parquet')

In [3]:
y_train = train.diagnosis
X_train = train.drop(columns=['diagnosis'])

In [4]:
y_test = test.diagnosis
X_test = test.drop(columns=['diagnosis'])

We are going to compute different models varying the penalty terms, if the solver is modified it's due to incompatibility with the penalties that are supported by each one.

In [5]:
metrics = {}

Without penalty term:

In [6]:
start_time = time.time()

lr = LogisticRegression(penalty=None, max_iter=10000)
lr.fit(X_train, y_train)

train_pred = lr.predict_proba(X_train)[:, 1]
test_pred = lr.predict_proba(X_test)[:, 1]

metrics['LR'] = {
    'Train_AUC': roc_auc_score(y_train, train_pred),
    'Test_AUC': roc_auc_score(y_test, test_pred),
    'Run_Time': time.time() - start_time
}

Adding the $l_2$ term to the cost function: $$ \frac{1}{2}||\beta||_2^2 = \frac{1}{2}\beta^T\beta \ , $$ where $\beta$ is the vector of coefficients.

In [7]:
start_time = time.time()

lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)

train_pred = lr.predict_proba(X_train)[:, 1]
test_pred = lr.predict_proba(X_test)[:, 1]

metrics['LR_l2'] = {
    'Train_AUC': roc_auc_score(y_train, train_pred),
    'Test_AUC': roc_auc_score(y_test, test_pred),
    'Run_Time': time.time() - start_time
} 

Instead of the $l_2$ penalty, the $l_1$ penalty: $$||\beta||_1=\sum_{i=1}^n|\beta_i| \,$$ where $n$ is the length of $\beta$, that is, the number of features.

In [8]:
start_time = time.time()

lr = LogisticRegression(penalty='l1', solver='liblinear', max_iter=10000)
lr.fit(X_train, y_train)

train_pred = lr.predict_proba(X_train)[:, 1]
test_pred = lr.predict_proba(X_test)[:, 1]

metrics['LR_l1'] = {
    'Train_AUC': roc_auc_score(y_train, train_pred),
    'Test_AUC': roc_auc_score(y_test, test_pred),
    'Run_Time': time.time() - start_time
} 

Lastly, the ElasticNet penalty: $$\frac{1-\rho}{2}||\beta||_2^2 + \rho||\beta||_1 \ ,$$ where $\rho$ is a parameter between 0 and 1 that controls the strength of the $l_1$ and $l_2$ regularizations. Note that if $\rho=1$, ElasticNet is equivalent to $l_1$ and if $\rho=0$, is the same as $l_2$.

In [9]:
for rho in [0.25, 0.5, 0.75]:
    start_time = time.time()

    lr = LogisticRegression(penalty='elasticnet', solver='saga',l1_ratio=rho, max_iter=10000)
    lr.fit(X_train, y_train)

    train_pred = lr.predict_proba(X_train)[:, 1]
    test_pred = lr.predict_proba(X_test)[:, 1]

    metrics['LR_en_'+str(rho)] = {
        'Train_AUC': roc_auc_score(y_train, train_pred),
        'Test_AUC': roc_auc_score(y_test, test_pred),
        'Run_Time': time.time() - start_time
    } 

In [10]:
metrics_lr = pd.DataFrame.from_dict(metrics, orient='index',columns=['Run_Time', 'Train_AUC', 'Test_AUC'])
metrics_lr['delta%'] = 100*(metrics_lr.Test_AUC - metrics_lr.Train_AUC) / metrics_lr.Train_AUC
metrics_lr

Unnamed: 0,Run_Time,Train_AUC,Test_AUC,delta%
LR,1.538563,1.0,0.946742,-5.325815
LR_l2,0.677528,0.996764,0.976817,-2.001173
LR_l1,0.257306,0.997138,0.966165,-3.10619
LR_en_0.25,0.572656,0.968951,0.964912,-0.41677
LR_en_0.5,0.559031,0.968897,0.964912,-0.411273
LR_en_0.75,0.547248,0.968897,0.964912,-0.411273


In [11]:
metrics_lr.to_parquet('../../../data/metrics/breast_cancer/logistic_regression.parquet')

Observe that the worst model is the most basic one in terms of overfitting as we could expect. The ElasticNet models that combine both penalties give truthful models in exchange for less accuracy than the $l_2$ and $l_1$ models. We choose **LR_l2** as the best model because has the best AUC in the test and an acceptable delta.