# Logistic Regression

In [1]:
from sklearn.linear_model import LogisticRegression
import pandas as pd

In [2]:
import sys
sys.path.append('F:\\Users\\Manuel García Plaza\\Desktop\\TFG\\')

In [3]:
from notebooks.utils.classification_metrics import classification

First of all, we load the data

In [4]:
train =  pd.read_parquet('../../../data/model_input/train_sets/breast_cancer.parquet')
test =  pd.read_parquet('../../../data/model_input/validation_sets/breast_cancer.parquet')

In [5]:
y_train = train.diagnosis
X_train = train.drop(columns=['diagnosis'])

In [6]:
y_test = test.diagnosis
X_test = test.drop(columns=['diagnosis'])

We are going to compute different models varying the penalty terms, if the solver is modified it's due to incompatibility with the penalties that are supported by each one.

Without penalty term:

In [7]:
lr = LogisticRegression(penalty=None, max_iter=10000)

Adding the $l_2$ term to the cost function: $$ \frac{1}{2}||\beta||_2^2 = \frac{1}{2}\beta^T\beta \ , $$ where $\beta$ is the vector of coefficients.

In [8]:
lr_l2 = LogisticRegression(max_iter=10000)

Instead of the $l_2$ penalty, the $l_1$ penalty: $$||\beta||_1=\sum_{i=1}^n|\beta_i| \,$$ where $n$ is the length of $\beta$, that is, the number of features.

In [9]:
lr_l1 = LogisticRegression(penalty='l1', solver='liblinear', max_iter=10000)

Lastly, the ElasticNet penalty: $$\frac{1-\rho}{2}||\beta||_2^2 + \rho||\beta||_1 \ ,$$ where $\rho$ is a parameter between 0 and 1 that controls the strength of the $l_1$ and $l_2$ regularizations. Note that if $\rho=1$, ElasticNet is equivalent to $l_1$ and if $\rho=0$, is the same as $l_2$.

In [10]:
lr_en1 = LogisticRegression(penalty='elasticnet', solver='saga',l1_ratio=0.25, max_iter=10000)
lr_en2 = LogisticRegression(penalty='elasticnet', solver='saga',l1_ratio=0.5, max_iter=10000)
lr_en3 = LogisticRegression(penalty='elasticnet', solver='saga',l1_ratio=0.75, max_iter=10000)

In [11]:
models_list = [lr, lr_l2, lr_l1, lr_en1, lr_en2 , lr_en3]
names_list = ['LR', 'LR_l2', 'LR_l1', 'LR_en_0.25', 'LR_en_0.5', 'LR_en_0.75']

In [12]:
metrics = classification(models_list, names_list, '../../../data/metrics/breast_cancer/logistic_regression.parquet', X_train, y_train, X_test, y_test)
metrics

Unnamed: 0,Run_Time,Train_AUC,Test_AUC,delta%
LR,2.0909,1.0,0.946742,-5.325815
LR_l2,0.858113,0.996764,0.976817,-2.001173
LR_l1,0.279603,0.997005,0.966792,-3.030349
LR_en_0.25,0.803109,0.968951,0.964912,-0.41677
LR_en_0.5,0.799593,0.968897,0.964912,-0.411273
LR_en_0.75,0.705775,0.968897,0.964912,-0.411273


Observe that the worst model is the most basic one in terms of overfitting as we could expect. The ElasticNet models that combine both penalties give truthful models in exchange for less accuracy than the $l_2$ and $l_1$ models (by the way, they are pretty good models). We choose **LR_l2** as the best model because has the best AUC in the test and an acceptable delta.