# Models - Transformation into binary classification

**Transformation techniques used:**

* Binary Relevance
* Classification Chains, with classifier orders:
    * Most frequent first
    * Least frequent first
    * Random order

**Base classification models used:**

* Logistic Regression
* Multinomial Naive Bayes
* Support Vector Machines

## Necessary downloads and library imports

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
project_path = '/content/drive/My Drive/Colab Notebooks/MATF_ML_project/'

In [3]:
!pip install scikit-multilearn
!pip install ipynb



In [4]:
from ipynb.fs.full.utility import *

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

import pickle
import time
import datetime
import copy
import warnings

from scipy import sparse

import skmultilearn
from skmultilearn.model_selection import iterative_train_test_split
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import PredefinedSplit
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import sklearn.metrics as metrics
from sklearn.metrics import make_scorer

## Load data

In [5]:
X_train_vect_concat = sparse.load_npz(project_path + 'data/X_train_vect_concat.npz')
y_train_vect_concat = sparse.load_npz(project_path + 'data/y_train_vect_concat.npz')
X_test_vect_concat = sparse.load_npz(project_path + 'data/X_test_vect_concat.npz')
y_test_vect_concat = sparse.load_npz(project_path + 'data/y_test_vect_concat.npz')

In [6]:
X_train_concat_vect = sparse.load_npz(project_path + 'data/X_train_concat_vect.npz')
y_train_concat_vect = sparse.load_npz(project_path + 'data/y_train_concat_vect.npz')
X_test_concat_vect = sparse.load_npz(project_path + 'data/X_test_concat_vect.npz')
y_test_concat_vect = sparse.load_npz(project_path + 'data/y_test_concat_vect.npz')

In [7]:
X_train_vect_concat.shape, y_train_vect_concat.shape, X_test_vect_concat.shape, y_test_vect_concat.shape

((6058, 17416), (6058, 100), (1570, 17416), (1570, 100))

In [8]:
X_train_concat_vect.shape, y_train_concat_vect.shape, X_test_concat_vect.shape, y_test_concat_vect.shape

((6058, 12007), (6058, 100), (1570, 12007), (1570, 100))

In [9]:
N_labels = y_train_vect_concat.shape[1]
N_labels

100

In [10]:
data_list = [(X_train_vect_concat, y_train_vect_concat), (X_train_concat_vect, y_train_concat_vect)]

## Binary Relevance

In [11]:
parameters = [
    {
        'classifier': [MultinomialNB()],
    },
    {
        'classifier': [SVC()],
        'classifier__kernel': ['linear'],
        'classifier__verbose': [0],
        'require_dense': [[False, True]],
    },
    {
        'classifier': [LogisticRegression()],
        'classifier__solver': ['lbfgs'],
        'classifier__verbose': [0],
        'require_dense': [[False, True]],
    },
]

When true positive + false positive == 0, precision is undefined. When true positive + false negative == 0, recall is undefined. In such cases, by default the metric will be set to 0, as will f-score, and UndefinedMetricWarning will be raised. We will suppress this warning.

In [12]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_binary_relevance = GridSearchCVMultipleFits(BinaryRelevance(), parameters, 
                              cv=ShuffleSplit(test_size=0.20, n_splits=1, random_state=0), 
                              scoring=['f1_weighted', 'f1_micro'],
                              refit='f1_weighted', verbose=0,
                              data_list=data_list)

In [13]:
print(result_binary_relevance)

{'mean_test_f1_micro': [0.0979631425800194, 0.5157314304249107, 0.408115489660554, 0.008243173621844409, 0.4787354158825744, 0.24944121591417073], 'mean_test_f1_weighted': [0.08975093197691675, 0.4928020027549865, 0.3752544096145575, 0.007722291226121143, 0.4343919861098135, 0.21771416730921878], 'best_score_': 0.4928020027549865, 'best_estimator_': BinaryRelevance(classifier=SVC(C=1.0, break_ties=False, cache_size=200,
                               class_weight=None, coef0=0.0,
                               decision_function_shape='ovr', degree=3,
                               gamma='scale', kernel='linear', max_iter=-1,
                               probability=False, random_state=None,
                               shrinking=True, tol=0.001, verbose=0),
                require_dense=[False, True])}


In [14]:
print("Best F1-score weighted:", result_binary_relevance['best_score_'])

Best F1-score weighted: 0.4928020027549865


In [15]:
names = ['Binary Relevance - Multinomial Naive Bayes, vectorization -> concatenate\t%.3f',
         'Binary Relevance - Support Vector Machines, vectorization -> concatenate\t%.3f',
         'Binary Relevance - Logistic Regression, vectorization -> concatenate\t\t%.3f',
         'Binary Relevance - Multinomial Naive Bayes, concatenate -> vectorization\t%.3f',
         'Binary Relevance - Support Vector Machines, concatenate -> vectorization\t%.3f',
         'Binary Relevance - Logistic Regression, concatenate -> vectorization\t\t%.3f']

In [16]:
print("--------------------------------------------")
print("F1-scores Micro averaged on validation data:")
print("--------------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result_binary_relevance['mean_test_f1_micro'][i]))

print()
print("--------------------------------------")
print("F1-scores Weighted on validation data:")
print("--------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result_binary_relevance['mean_test_f1_weighted'][i]))

--------------------------------------------
F1-scores Micro averaged on validation data:
--------------------------------------------
Binary Relevance - Multinomial Naive Bayes, vectorization -> concatenate	9.796
Binary Relevance - Support Vector Machines, vectorization -> concatenate	51.573
Binary Relevance - Logistic Regression, vectorization -> concatenate		40.812
Binary Relevance - Multinomial Naive Bayes, concatenate -> vectorization	0.824
Binary Relevance - Support Vector Machines, concatenate -> vectorization	47.874
Binary Relevance - Logistic Regression, concatenate -> vectorization		24.944

--------------------------------------
F1-scores Weighted on validation data:
--------------------------------------
Binary Relevance - Multinomial Naive Bayes, vectorization -> concatenate	8.975
Binary Relevance - Support Vector Machines, vectorization -> concatenate	49.280
Binary Relevance - Logistic Regression, vectorization -> concatenate		37.525
Binary Relevance - Multinomial Naive Ba

## Classifier Chains

In [17]:
order_freq_incr = list(range(N_labels))

order_freq_decr = list(range(N_labels-1,-1,-1))

order_random = copy.deepcopy(order_freq_incr)
random.shuffle(order_random)

In [18]:
parameters = [
    {
        'classifier': [SVC()],
        'classifier__kernel': ['linear'],
        'classifier__verbose': [0],
        'order': [order_freq_decr, order_freq_incr, order_random],
        'require_dense': [[False, True]],
    },
    {
        'classifier': [LogisticRegression()],
        'classifier__solver': ['lbfgs'],
        'classifier__verbose': [0],
        'order': [order_freq_decr, order_freq_incr, order_random],
        'require_dense': [[False, True]],
    },
]

When true positive + false positive == 0, precision is undefined. When true positive + false negative == 0, recall is undefined. In such cases, by default the metric will be set to 0, as will f-score, and UndefinedMetricWarning will be raised. We will suppress this warning.

In [19]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_classifier_chain = GridSearchCVMultipleFits(ClassifierChain(), parameters, 
                              cv=ShuffleSplit(test_size=0.20, n_splits=1, random_state=0), 
                              scoring=['f1_weighted', 'f1_micro'],
                              refit='f1_weighted', verbose=0,
                              data_list=data_list)

In [20]:
print(result_classifier_chain)

{'mean_test_f1_micro': [0.010701920050361977, 0.5143769968051118, 0.012670256572695597, 0.008627450980392156, 0.4114906832298137, 0.0078125, 0.008743169398907104, 0.49926362297496313, 0.007272727272727273, 0.006200177147918512, 0.2492184010719071, 0.0026666666666666666], 'mean_test_f1_weighted': [0.008500626531863183, 0.4912067788355804, 0.009194326816866178, 0.0036331335865344757, 0.37920325436753, 0.006048715218993447, 0.0038371507863128626, 0.45530347243177727, 0.00504472102637848, 0.0017225407035095428, 0.21511021005097158, 0.002172377831974314], 'best_score_': 0.4912067788355804, 'best_estimator_': ClassifierChain(classifier=SVC(C=1.0, break_ties=False, cache_size=200,
                               class_weight=None, coef0=0.0,
                               decision_function_shape='ovr', degree=3,
                               gamma='scale', kernel='linear', max_iter=-1,
                               probability=False, random_state=None,
                               shrinkin

In [21]:
print("Best F1-score weighted:", result_classifier_chain['best_score_'])

Best F1-score weighted: 0.4912067788355804


In [22]:
names = ['Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Least frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Most frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Random order\t\t%.3f',
         'Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Least frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Most frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Random order\t\t%.3f',

         'Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Least frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Most frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Random order\t\t%.3f',
         'Classifier Chains - Logistic Regression, concatenate -> vectorization, order: Least frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, concatenate -> vectorization, order: Most frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, concatenate -> vectorization, order: Random order\t\t%.3f']

In [23]:
print("--------------------------------------------")
print("F1-scores Micro averaged on validation data:")
print("--------------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result_classifier_chain['mean_test_f1_micro'][i]))

print()
print("--------------------------------------")
print("F1-scores Weighted on validation data:")
print("--------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result_classifier_chain['mean_test_f1_weighted'][i]))

--------------------------------------------
F1-scores Micro averaged on validation data:
--------------------------------------------
Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Least frequent first	1.070
Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Most frequent first	51.438
Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Random order		1.267
Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Least frequent first	0.863
Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Most frequent first	41.149
Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Random order		0.781
Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Least frequent first	0.874
Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Most frequent first	49.926
Classifier Chains -

## Save models

In [24]:
with open(project_path + 'models/model_best_binary_relevance.pk', 'wb') as fin:
    pickle.dump(result_binary_relevance['best_estimator_'], fin)
with open(project_path + 'models/model_best_classifier_chain.pk', 'wb') as fin:
    pickle.dump(result_classifier_chain['best_estimator_'], fin)