<h2>Multilabel Classification</h2>

In this module, we look at Logistic Regression and SoftMax regression for multiclass classification problems. Multiclass means the target has at least three unique classes. In this case, a single Logistic regression model cannot be applied to multilabel classification problems

There are a few strategies for this. SKLearn implements a one vs. all (or one vs. rest) strategy – building multiple models that predict if the instance belong to each class or not. 

For example, if the target has four unique values (1,2,3,4), we can have four models
- 1 vs. not 1 (i.e. y = 2,3, or 4)
- 2 vs. not 2 (i.e. y = 1,3, or 4)
- 3 vs. not 3 (i.e. y = 1,2, or 4)
- 4 vs. not 4 (i.e. y = 1,2, or 3)

Then pick the one that give the highest probability. 

Fortunately, this is transparent in SKLearn – we don’t need to worry about fitting multiple models, SKLearn automatically do that for us. To use the one vs. all strategy, we also need to define <b>multiclass='ovr'</b> when creating the model. The data is body performance, which can be obtained from https://www.kaggle.com/datasets/kukuroo3/body-performance-data. The features are as below

1. age : 20 ~64
2. gender : F,M
3. height_cm : (If you want to convert to feet, divide by 30.48)
4. weight_kg
5. body fat_%
6. diastolic : diastolic blood pressure (min)
7. systolic : systolic blood pressure (min)
8. gripForce
9. sit and bend forward_cm
10. sit-ups counts
11. broad jump_cm
12. class : A,B,C,D ( A: best) / stratified <== <b>target</b>

For simplicity, we will only consider accuracy rates to evaluate the models in example.

In [4]:
import numpy as np
import pandas as pd

data = pd.read_csv('bodyPerformance.csv')
data.head()

Unnamed: 0,age,gender,height_cm,weight_kg,body fat_%,diastolic,systolic,gripForce,sit and bend forward_cm,sit-ups counts,broad jump_cm,class
0,27.0,M,172.3,75.24,21.3,80.0,130.0,54.9,18.4,60.0,217.0,C
1,25.0,M,165.0,55.8,15.7,77.0,126.0,36.4,16.3,53.0,229.0,A
2,31.0,M,179.6,78.0,20.1,92.0,152.0,44.8,12.0,49.0,181.0,C
3,32.0,M,174.5,71.1,18.4,76.0,147.0,41.4,15.2,53.0,219.0,B
4,28.0,M,173.8,67.7,17.1,70.0,127.0,43.5,27.1,45.0,217.0,B


In [5]:
#all columns are numeric except for target
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13393 entries, 0 to 13392
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      13393 non-null  float64
 1   gender                   13393 non-null  object 
 2   height_cm                13393 non-null  float64
 3   weight_kg                13393 non-null  float64
 4   body fat_%               13393 non-null  float64
 5   diastolic                13393 non-null  float64
 6   systolic                 13393 non-null  float64
 7   gripForce                13393 non-null  float64
 8   sit and bend forward_cm  13393 non-null  float64
 9   sit-ups counts           13393 non-null  float64
 10  broad jump_cm            13393 non-null  float64
 11  class                    13393 non-null  object 
dtypes: float64(10), object(2)
memory usage: 1.2+ MB


A quick check on the target shows some rare values:

In [6]:
data['class'].value_counts()

C    3349
D    3349
A    3348
B    3347
Name: class, dtype: int64

Similar to any other data, we split training and testing first. We will stratify by the class quality

In [8]:
from sklearn.model_selection import train_test_split


X = data.drop('class', axis=1)
y = data['class']

trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2)

trainX.shape, testX.shape, trainY.shape, testY.shape

((10714, 11), (2679, 11), (10714,), (2679,))

All columns are without missing data. We will perform standardization and one hot encoder.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_cols = trainX.columns[(trainX.dtypes == np.int64) | (trainX.dtypes == np.float64)]
cat_cols = ['gender']
num_pipeline = Pipeline([
    ('standardize', StandardScaler())
])

cat_pipeline = Pipeline([
    ('encode', OneHotEncoder())
])

full_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

trainX_prc = full_pipeline.fit_transform(trainX)

trainX_prc.shape

(10714, 12)

In [11]:
testX_prc = full_pipeline.transform(testX)  
testX_prc.shape

(2679, 12)

<h3>One-Versus-Rest Logistic Regression</h3>

We will finetune models using all 3 regularization methods. In terms of writing codes, nothing changes compared to the binary classification case beside adding the multi_class='ovr'. We will go through each regularization type like before.

<h4>L2 Regularization</h4>

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

#create new model
logistic = LogisticRegression()

from sklearn.model_selection import GridSearchCV

param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#the default is l2 anyway, but we can specify it so that the code is clear
#logistic regression is also trained iteratively, we can increase max_iter if you see some warning from sklearn
logistic = LogisticRegression(penalty='l2', multi_class='ovr', max_iter=5000)

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Let's look at the best model

In [39]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.05}
0.5896026975107977


And apply it on the testing data

In [40]:
best_l2_logistic = grid_search.best_estimator_

best_l2_logistic.score(testX_prc, testY)

0.5961179544606197

<h4>L1 Regularization</h4>

This is similar to LASSO, we add sum of absolute values to the training objective

In [41]:
param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#now we need to specify penalty to l1
#also, we need to set solver to 'liblinear' because the default solver doesn't support l1
logistic = LogisticRegression(penalty='l1', multi_class='ovr', max_iter=5000, solver='saga')

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)



Best model performance:

In [42]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 1}
0.5893224983258196


And in the testing data

In [43]:
best_l1_logistic = grid_search.best_estimator_

best_l1_logistic.score(testX_prc, testY)

0.5949981336319522

<h4>Elastic Net Regularization</h4>

Similar to Elastic Net model in regression, we use both l1 and l2 regularization in the same model. Now we need to finetune both C and l1_ratio

In [44]:
#now penalty is changed to elasticnet
#and we need to change solver to saga
#this model is taking a long time to train due to the data size
#so I reduce the number of values in each parameter
logistic = LogisticRegression(penalty='elasticnet', multi_class='ovr', max_iter=5000, solver='saga')

param_grid = [{
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'l1_ratio': [0.25, 0.5, 0.75]
}]

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Best model performance

In [45]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 100, 'l1_ratio': 0.5}
0.5893224547557396


And apply to testing data

In [46]:
best_enet_logistic = grid_search.best_estimator_

best_enet_logistic.score(testX_prc, testY)

0.595371407241508

<h3>SoftMax Regression</h3>

SoftMax Regression is fairly similar to Logistic regression however can directly model multilabel classification problems

Softmax model equation is a bit more complicated than logistic regression. Given the target has 𝑛 unique values (1,2,…,n), we will have 𝑛 sets of coefficient $\beta_i=(\beta_{0,i},\beta_{1,i},\beta_{2,i},…,\beta_{𝑘,i})$; 𝑘 is still the number of input features. The probability of an instance belong to class 𝑖 is as

$P(y=i)=\dfrac{e^{\beta_{0,i} + \beta_{1,i} 𝑥_1+\beta_{2,i}𝑥_2 + \dots + \beta_{𝑘,i} 𝑥_𝑘}}{\sum_𝑗 e^{\beta_{0,j}+\beta_{1,j} 𝑥_1 + \beta_{2,j} 𝑥_2 + \dots + \beta_{𝑘,j} 𝑥_𝑘}}$

The output for each instance by the model is a vector of $n$ values between $0$ and $1$, $\hat{y}=(y_1, y_2, \dots y_n)$ representing the probability of the instance belonging to each class 1, 2, \dots n. The difference to Logistic regression is that all set of coefficients are trained simultaneously instead of separately as if having 𝑛 one vs. rest Logistic models. Furthermore, all probabilities for each instance in Softmax regression sum up to 1.

The training objective of Softmax regression is called <b>Maximum Likelihood</b> which aims to maximize the predicted probability of an instance belonging to its true class while that of other classes. 

In SKLearn, we still use Logistic regression. Besides changing multi_class to 'multinomial', everything else is exactly the same as above. We still try all three regularization methods and finetune them

<h4>L2 Regularization</h4>

In [32]:
#create new model
logistic = LogisticRegression(multi_class='multinomial')

from sklearn.model_selection import GridSearchCV

param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#the default is l2 anyway, but we can specify it so that the code is clear
#logistic regression is also trained iteratively, we can increase max_iter if you see some warning from sklearn
logistic = LogisticRegression(penalty='l2', max_iter=5000, solver='lbfgs')

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Let's look at the best model

In [33]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 100}
0.6141492963649917


And apply it on the testing data

In [34]:
best_l2_logistic = grid_search.best_estimator_

best_l2_logistic.score(testX_prc, testY)

0.6248600223964166

<h4>L1 Regularization</h4>

This is similar to LASSO, we add sum of absolute values to the training objective

In [24]:
param_grid = [{'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1 , 5, 10, 50, 100]}]

#now we need to specify penalty to l1
#also, we need to set solver to 'saga' because it is the only one supports both multinomial and l1 regularization
logistic = LogisticRegression(multi_class='multinomial', penalty='l1', max_iter=5000, solver='saga')

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Best model performance:

In [25]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.1}
0.6143364734290044


And in the testing data

In [26]:
best_l1_logistic = grid_search.best_estimator_

best_l1_logistic.score(testX_prc, testY)

0.6297125793206421

<h4>Elastic Net Regularization</h4>

Similar to Elastic Net model in regression, we use both l1 and l2 regularization in the same model. Now we need to finetune both C and l1_ratio

In [31]:
#now penalty is changed to elasticnet
#and we need to change solver to saga
#this model is taking a long time to train due to the data size
#so I reduce the number of values in each parameter
logistic = LogisticRegression(multi_class='multinomial',penalty='elasticnet', max_iter=5000, solver='saga')

param_grid = [{
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'l1_ratio': [0.25, 0.5, 0.75]
}]

grid_search = GridSearchCV(logistic, param_grid, cv=5, scoring='accuracy', return_train_score=True)

grid_search.fit(trainX_prc,trainY)

Best model performance

In [28]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.1, 'l1_ratio': 0.75}
0.61442962626021


And apply to testing data

In [29]:
best_enet_logistic = grid_search.best_estimator_

best_enet_logistic.score(testX_prc, testY)

0.6289660321015305

<h3>All Model Summarization</h3>

|Model|Training CV Accuracy| Testing Accuracy|
|-----|--------------------|-----------------|
|L2 Logistic|0.5896|0.5961|
|L1 Logistic|0.5943|0.5950|
|ENet Logistic|0.5893|0.5953|
|L2 SoftMax|0.6141|0.6249|
|L1 SoftMax|0.6193|0.6297|
|ENet SoftMax|0.6144|0.6289|

Overall, different types of regularizations yield very similar performances, however, Softmax regression outperforms OVR Logistic slightly.