# Logistic Regression

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd
import time

First of all, we load the data

In [2]:
train =  pd.read_parquet('../../data/model_input/train_sets/breast_cancer.parquet')
test =  pd.read_parquet('../../data/model_input/validation_sets/breast_cancer.parquet')

In [3]:
y_train = train.diagnosis
X_train = train.drop(columns=['diagnosis'])

In [4]:
y_test = test.diagnosis
X_test = test.drop(columns=['diagnosis'])

We are going to compute different models varying the penalty terms, if the solver is modified it's due to incompatibility with the penalties that are supported by each one.

In [5]:
metrics = {}

Without penalty term:

In [7]:
start_time = time.time()

lr = LogisticRegression(penalty=None, max_iter=10000)
lr.fit(X_train, y_train)

train_pred = lr.predict_proba(X_train)[:, 1]
test_pred = lr.predict_proba(X_test)[:, 1]

metrics['LR'] = {
    'Train_Gini': 2*roc_auc_score(y_train, train_pred)-1,
    'Test_Gini': 2*roc_auc_score(y_test, test_pred)-1,
    'Run_Time': time.time() - start_time
}

Adding the $l_2$ term to the cost function: $$ \frac{1}{2}||\beta||_2^2 = \frac{1}{2}\beta^T\beta \ , $$ where $\beta$ is the vector of coefficients.

In [9]:
start_time = time.time()

lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)

train_pred = lr.predict_proba(X_train)[:, 1]
test_pred = lr.predict_proba(X_test)[:, 1]

metrics['LR_l2'] = {
    'Train_Gini': 2*roc_auc_score(y_train, train_pred)-1,
    'Test_Gini': 2*roc_auc_score(y_test, test_pred)-1,
    'Run_Time': time.time() - start_time
} 

Instead of the $l_2$ penalty, the $l_1$ penalty: $$||\beta||_1=\sum_{i=1}^n|\beta_i| \,$$ where $n$ is the length of $\beta$, that is, the number of features.

In [10]:
start_time = time.time()

lr = LogisticRegression(penalty='l1', solver='liblinear', max_iter=10000)
lr.fit(X_train, y_train)

train_pred = lr.predict_proba(X_train)[:, 1]
test_pred = lr.predict_proba(X_test)[:, 1]

metrics['LR_l1'] = {
    'Train_Gini': 2*roc_auc_score(y_train, train_pred)-1,
    'Test_Gini': 2*roc_auc_score(y_test, test_pred)-1,
    'Run_Time': time.time() - start_time
} 

Lastly, the ElasticNet penalty: $$\frac{1-\rho}{2}||\beta||_2^2 + \rho||\beta||_1 \ ,$$ where $\rho$ is a parameter between 0 and 1 that controls the strength of the $l_1$ and $l_2$ regularizations. Note that if $\rho=1$, ElasticNet is equivalent to $l_1$ and if $\rho=0$, is the same as $l_2$.

In [11]:
for rho in [0.25, 0.5, 0.75]:
    start_time = time.time()

    lr = LogisticRegression(penalty='elasticnet', solver='saga',l1_ratio=rho, max_iter=10000)
    lr.fit(X_train, y_train)

    train_pred = lr.predict_proba(X_train)[:, 1]
    test_pred = lr.predict_proba(X_test)[:, 1]

    metrics['LR_en_'+str(rho)] = {
        'Train_Gini': 2*roc_auc_score(y_train, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(y_test, test_pred)-1,
        'Run_Time': time.time() - start_time
    } 

In [12]:
metrics_lr = pd.DataFrame.from_dict(metrics, orient='index',columns=['Run_Time', 'Train_Gini', 'Test_Gini'])
metrics_lr['delta%'] = 100*(metrics_lr.Test_Gini - metrics_lr.Train_Gini) / metrics_lr.Train_Gini
metrics_lr

Unnamed: 0,Run_Time,Train_Gini,Test_Gini,delta%
LR,1.663978,1.0,0.893484,-10.651629
LR_l2,0.765762,0.993528,0.953634,-4.015382
LR_l1,0.310524,0.994384,0.932331,-6.240347
LR_en_0.25,0.823723,0.937901,0.929825,-0.861135
LR_en_0.5,0.834715,0.937794,0.929825,-0.849826
LR_en_0.75,0.873662,0.937794,0.929825,-0.849826


In [13]:
metrics_lr.to_parquet('../../data/metrics/breast_cancer/logistic_regression.parquet')

Observe that the worst model is the most basic one in terms of overfitting as we could expect. The ElasticNet models that combine both penalties give truthful models in exchange for less accuracy than the $l_2$ and $l_1$ models.