<h2>Logistic Regression and SoftMax Regression</h2>

In this module, we discuss two classification models: Logistic regression and SoftMax regression.

<h3>Binary Logistic Regression</h3>

In this example, we use the credit approvement data - the problem is to predict if the credit transaction is approved (+) or not (-). The target is the last column in the data.

In [3]:
import pandas as pd
import numpy as np

In [4]:
crx = pd.read_csv('crx.data', header=None)
crx.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


We will convert '+' and '-' to 1 and 0 for better uses with sklearn

In [5]:
Y = np.zeros(crx.shape[0])           #create a vector of zeros with size = the data
Y[crx[15]=='+'] = 1                  #when the actual target is +, Y is assigned 1
crx[15] = Y                          #assign the new labels back to the data 

<h4>Train/Test Split </h4>

Recall, the target in this data is column 15, we will use stratified split based on column 15

In [6]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)

for train_index, test_index in split.split(crx, crx[15]):
    strat_train_set = crx.loc[train_index]
    strat_test_set = crx.loc[test_index]
    
trainX = strat_train_set.loc[:,:14]
trainY = strat_train_set.loc[:,15]
trainX.shape, trainY.shape

((517, 15), (517,))

In [8]:
trainX.dtypes

0      object
1     float64
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13    float64
14      int64
dtype: object

<h4>Preprocessing</h4>

log transformation is relatively useful in binary classification. We will use FunctionTransformer to conduct log transformation on this data.

In [13]:
#get a list of numeric columns
#we use '|' to combine condition because we have both int64 and float64 in the columns
num_cols = trainX.columns[(trainX.dtypes == np.int64) | (trainX.dtypes == np.float64)]

#create a transform function for FunctionTransformer
def log_transform(X):                             #input of the function is any dataset X
    log_X = np.log(X + 0.1)                       #log of all columns of X, we add 0.1 to avoid log(0)
    return np.c_[X,log_X]                         #return X concatenated with log columns

Now we can build our numeric pipeline

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer

num_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('log transform', FunctionTransformer(log_transform, validate=False)),
    ('standardize', StandardScaler())
])

And the class pipeline. We will impute with missing category, then OneHotEncoder

In [15]:
from sklearn.preprocessing import OneHotEncoder

#get a list of class columns
cat_cols = trainX.columns[trainX.dtypes==object]

cat_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='constant',fill_value='missing')),
    ('encode', OneHotEncoder())
])

Then combine the two pipelines with ColumnTransformer

In [16]:
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

Finally, transform the data through the pipeline

In [17]:
trainX_prc = full_pipeline.fit_transform(trainX)

trainX_prc.shape

(517, 57)

And the test data

In [18]:
#Similarly for testing data
testX = strat_test_set.loc[:,:14]
testY = strat_test_set.loc[:,15]

testX_prc = full_pipeline.transform(testX)  
testX_prc.shape

(173, 57)

<h4>Modeling with Logistic Regression</h4>

Just as any other sklearn models, begin analysis is very easy. First, let's try the default model. I will just consider accuracy and F1 score. You may want to practice with other measurements/curves

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

#create new model
logistic = LogisticRegression()

#train 
logistic.fit(trainX_prc, trainY)

#get CV accuracy
accuracy_3cv = cross_val_score(logistic, trainX_prc, trainY, cv=3, scoring="accuracy")

#get prediction for computation of F1 score
y_train_pred = cross_val_predict(logistic, trainX_prc, trainY, cv=3)

print('Training Accuracy: ', logistic.score(trainX_prc, trainY))
print('Cross-Validation Accuracy: ',accuracy_3cv.mean())
print('Cross-Validation F1: ', f1_score(trainY, y_train_pred))

Training Accuracy:  0.874274661508704
Cross-Validation Accuracy:  0.8509992382488686
Cross-Validation F1:  0.8351177730192719


The default model seems to be ok, no significant signs of overfitting. Let's try finetune it.

<h4>L2 Regularization</h4>

Recall, this is similar to Ridge regression - we add the sum of squares values of coefficients to the training objective. However, the hyperparameter is <b>C</b>, NOT alpha.

In [23]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#the default is l2 anyway, but we can specify it so that the code is clear
#logistic regression is also trained iteratively, we can increase max_iter if you see some warning from sklearn
logistic = LogisticRegression(penalty='l2', max_iter=5000)

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=5000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='accuracy', verbose=0)

Let's look at the best model

In [24]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.1}
0.856814787154593


And apply it on the testing data

In [25]:
best_l2_logistic = grid_search.best_estimator_

best_l2_logistic.score(testX_prc, testY)

0.861271676300578

<h4>L1 Regularization</h4>

This is similar to LASSO, we add sum of absolute values to the training objective

In [27]:
param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#now we need to specify penalty to l1
#also, we need to set solver to 'liblinear' because the default solver doesn't support l1
logistic = LogisticRegression(penalty='l1', max_iter=5000, solver='liblinear')

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=5000, multi_class='auto',
                                          n_jobs=None, penalty='l1',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='accuracy', verbose=0)

Best model performance:

In [28]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.5}
0.8606422703510083


And in the testing data

In [29]:
best_l1_logistic = grid_search.best_estimator_

best_l1_logistic.score(testX_prc, testY)

0.861271676300578

<h4>Elastic Net Regularization</h4>

Similar to Elastic Net model in regression, we use both l1 and l2 regularization in the same model. Now we need to finetune both C and l1_ratio

In [39]:
#now penalty is changed to elasticnet
#and we need to change solver to saga
logistic = LogisticRegression(penalty='elasticnet', max_iter=5000, solver='saga')

param_grid = [{
    'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100],
    'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
}]

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=5000, multi_class='auto',
                                          n_jobs=None, penalty='elasticnet',
                                          random_state=None, solver='saga',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10,
                                50, 100],
                          'l1_ratio': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                                       0.9]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='a

Best model performance

In [40]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.5, 'l1_ratio': 0.5}
0.8625840179238239


And apply to testing data

In [41]:
best_enet_logistic = grid_search.best_estimator_

best_enet_logistic.score(testX_prc, testY)

0.861271676300578

<h4>No-Regularization Model</h4>

Finally, let's see how a model without regularization perform

In [45]:
#we need to set penalty to 'none'
logistic = LogisticRegression(penalty='none', max_iter=5000)

#train 
logistic.fit(trainX_prc, trainY)

#get CV accuracy
accuracy_3cv = cross_val_score(logistic, trainX_prc, trainY, cv=3, scoring="accuracy")

#get prediction for computation of F1 score
y_train_pred = cross_val_predict(logistic, trainX_prc, trainY, cv=3)

print('Training Accuracy: ', logistic.score(trainX_prc, trainY))
print('Cross-Validation Accuracy: ',accuracy_3cv.mean())
print('Cross-Validation F1: ', f1_score(trainY, y_train_pred))
print('Testing Accuracy: ', logistic.score(testX_prc, testY))

Training Accuracy:  0.8820116054158608
Cross-Validation Accuracy:  0.8336357933413989
Cross-Validation F1:  0.8138528138528138
Testing Accuracy:  0.8497109826589595


Both the CV accuracy and testing accuracy drop more than in other models. The no-regularization model may be overfitting the data slightly

<h4>Result Summary</h4>

We can summarize all model performances in a table. I'll just focus on accuracy

|Model|Training CV Accuracy| Testing Accuracy|
|-----|--------------------|-----------------|
|No Regularization|0.834|0.850|
|L2 Regularization|0.857|0.861|
|L1 Regularization|0.861|0.861|
|ENet Regularization|0.863|0.861|

Except for the no-regularization model which perform slightly worse, all regularized models have very close performance in the training data, and equal performance in the testing data