# <center>Models - Transformation into binary classification</center>

**Transformation techniques used:**

* Binary Relevance
* Classification Chains, with classifier orders:
    * Most frequent first
    * Least frequent first
    * Random order

**Base classification models used:**

* Logistic Regression
* Multinomial Naive Bayes
* Support Vector Machines

## Necessary downloads and library imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
project_path = '/content/drive/My Drive/Colab Notebooks/MATF_ML_project/'

In [None]:
!pip install scikit-multilearn
!pip install ipynb

Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[?25l[K     |███▊                            | 10 kB 13.3 MB/s eta 0:00:01[K     |███████▍                        | 20 kB 19.2 MB/s eta 0:00:01[K     |███████████                     | 30 kB 18.0 MB/s eta 0:00:01[K     |██████████████▊                 | 40 kB 12.0 MB/s eta 0:00:01[K     |██████████████████▍             | 51 kB 5.5 MB/s eta 0:00:01[K     |██████████████████████          | 61 kB 5.6 MB/s eta 0:00:01[K     |█████████████████████████▊      | 71 kB 5.3 MB/s eta 0:00:01[K     |█████████████████████████████▍  | 81 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 89 kB 3.6 MB/s 
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0
Collecting ipynb
  Downloading ipynb-0.5.1-py3-none-any.whl (6.9 kB)
Installing collected packages: ipynb
Successfully installed ipynb-0.5.1


In [None]:
from ipynb.fs.full.utility import *

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

import pickle
import time
import datetime
import copy

from scipy import sparse

import skmultilearn
from skmultilearn.model_selection import iterative_train_test_split
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import PredefinedSplit
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import sklearn.metrics as metrics

## Load data

In [None]:
X_train_vect_concat = sparse.load_npz(project_path + 'data/X_train_vect_concat.npz')
y_train_vect_concat = sparse.load_npz(project_path + 'data/y_train_vect_concat.npz')
X_test_vect_concat = sparse.load_npz(project_path + 'data/X_test_vect_concat.npz')
y_test_vect_concat = sparse.load_npz(project_path + 'data/y_test_vect_concat.npz')

In [None]:
X_train_concat_vect = sparse.load_npz(project_path + 'data/X_train_concat_vect.npz')
y_train_concat_vect = sparse.load_npz(project_path + 'data/y_train_concat_vect.npz')
X_test_concat_vect = sparse.load_npz(project_path + 'data/X_test_concat_vect.npz')
y_test_concat_vect = sparse.load_npz(project_path + 'data/y_test_concat_vect.npz')

In [None]:
X_train_vect_concat.shape, y_train_vect_concat.shape, X_test_vect_concat.shape, y_test_vect_concat.shape

((6051, 15348), (6051, 100), (1577, 15348), (1577, 100))

In [None]:
X_train_concat_vect.shape, y_train_concat_vect.shape, X_test_concat_vect.shape, y_test_concat_vect.shape

((6062, 9110), (6062, 100), (1566, 9110), (1566, 100))

In [None]:
N_labels = y_train_vect_concat.shape[1]
N_labels

100

In [None]:
data_list = [(X_train_vect_concat, y_train_vect_concat), (X_train_concat_vect, y_train_concat_vect)]

## Binary Relevance

In [None]:
parameters = [
    {
        'classifier': [MultinomialNB()],
    },
    {
        'classifier': [SVC()],
        'classifier__kernel': ['linear'],
        'classifier__verbose': [0],
        'require_dense': [[False, True]],
    },
    {
        'classifier': [LogisticRegression()],
        'classifier__solver': ['lbfgs'],
        'classifier__verbose': [0],
        'require_dense': [[False, True]],
    },
]

In [None]:
result = GridSearchCVMultipleFits(BinaryRelevance(), parameters, 
                              cv=ShuffleSplit(test_size=0.20, n_splits=1, random_state=0), 
                              scoring=['f1_weighted', 'f1_micro'],
                              refit='f1_weighted', verbose=0,
                              data_list=data_list)

In [None]:
print(result)

{'mean_test_f1_micro': [0.12464319695528069, 0.5054151624548736, 0.4045831687080205, 0.015368852459016393, 0.46258503401360535, 0.2506690454950937], 'mean_test_f1_weighted': [0.11299516660530481, 0.49058426992079085, 0.37579481442693113, 0.013725937531791767, 0.417152690093167, 0.21969465213710476], 'best_score_': 0.49058426992079085, 'best_estimator_': BinaryRelevance(classifier=SVC(C=1.0, break_ties=False, cache_size=200,
                               class_weight=None, coef0=0.0,
                               decision_function_shape='ovr', degree=3,
                               gamma='scale', kernel='linear', max_iter=-1,
                               probability=False, random_state=None,
                               shrinking=True, tol=0.001, verbose=0),
                require_dense=[False, True])}


In [None]:
print("Best F1-score weighted:", result['best_score_'])

Best F1-score weighted: 0.49058426992079085


In [None]:
names = ['Binary Relevance - Multinomial Naive Bayes, vectorization -> concatenate\t%.3f',
         'Binary Relevance - Support Vector Machines, vectorization -> concatenate\t%.3f',
         'Binary Relevance - Logistic Regression, vectorization -> concatenate\t\t%.3f',
         'Binary Relevance - Multinomial Naive Bayes, concatenate -> vectorization\t%.3f',
         'Binary Relevance - Support Vector Machines, concatenate -> vectorization\t%.3f',
         'Binary Relevance - Logistic Regression, concatenate -> vectorization\t\t%.3f']

In [None]:
print("--------------------------------------------")
print("F1-scores Micro averaged on validation data:")
print("--------------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result['mean_test_f1_micro'][i]))

print()
print("--------------------------------------")
print("F1-scores Weighted on validation data:")
print("--------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result['mean_test_f1_weighted'][i]))

--------------------------------------------
F1-scores Micro averaged on validation data:
--------------------------------------------
Binary Relevance - Multinomial Naive Bayes, vectorization -> concatenate	12.464
Binary Relevance - Support Vector Machines, vectorization -> concatenate	50.542
Binary Relevance - Logistic Regression, vectorization -> concatenate		40.458
Binary Relevance - Multinomial Naive Bayes, concatenate -> vectorization	1.537
Binary Relevance - Support Vector Machines, concatenate -> vectorization	46.259
Binary Relevance - Logistic Regression, concatenate -> vectorization		25.067

--------------------------------------
F1-scores Weighted on validation data:
--------------------------------------
Binary Relevance - Multinomial Naive Bayes, vectorization -> concatenate	11.300
Binary Relevance - Support Vector Machines, vectorization -> concatenate	49.058
Binary Relevance - Logistic Regression, vectorization -> concatenate		37.579
Binary Relevance - Multinomial Naive 

## Classifier Chains

In [None]:
order_freq_incr = list(range(N_labels))

order_freq_decr = list(range(N_labels-1,-1,-1))

order_random = copy.deepcopy(order_freq_incr)
random.shuffle(order_random)

In [None]:
parameters = [
    {
        'classifier': [SVC()],
        'classifier__kernel': ['linear'],
        'classifier__verbose': [0],
        'order': [order_freq_decr, order_freq_incr, order_random],
        'require_dense': [[False, True]],
    },
    {
        'classifier': [LogisticRegression()],
        'classifier__solver': ['lbfgs'],
        'classifier__verbose': [0],
        'order': [order_freq_decr, order_freq_incr, order_random],
        'require_dense': [[False, True]],
    },
]

In [None]:
result = GridSearchCVMultipleFits(ClassifierChain(), parameters, 
                              cv=ShuffleSplit(test_size=0.20, n_splits=1, random_state=0), 
                              scoring=['f1_weighted', 'f1_micro'],
                              refit='f1_weighted', verbose=0,
                              data_list=data_list)

In [None]:
print(result)

{'mean_test_f1_micro': [0.006981910504601713, 0.509778775248477, 0.012763241863433313, 0.006329113924050632, 0.40456513183785914, 0.007889546351084813, 0.00885282183696053, 0.48807339449541287, 0.0036643459142543062, 0.006208425720620843, 0.25963668586619404, 0.003525782282944028], 'mean_test_f1_weighted': [0.004737152697258611, 0.49291218051463515, 0.00826308589067634, 0.002252857741337004, 0.3759615058062378, 0.0034054753518418302, 0.00375022156077076, 0.4460137318089905, 0.0013143002679248067, 0.0014148231085193684, 0.2247720182738548, 0.0003983047978537032], 'best_score_': 0.49291218051463515, 'best_estimator_': ClassifierChain(classifier=SVC(C=1.0, break_ties=False, cache_size=200,
                               class_weight=None, coef0=0.0,
                               decision_function_shape='ovr', degree=3,
                               gamma='scale', kernel='linear', max_iter=-1,
                               probability=False, random_state=None,
                          

In [None]:
print("Best F1-score weighted:", result['best_score_'])

Best F1-score weighted: 0.49291218051463515


In [None]:
names = ['Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Least frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Most frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Random order\t\t%.3f',
         'Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Least frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Most frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Random order\t\t%.3f',

         'Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Least frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Most frequent first\t%.3f',
         'Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Random order\t\t%.3f',
         'Classifier Chains - Logistic Regression, concatenate -> vectorization, order: Least frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, concatenate -> vectorization, order: Most frequent first\t%.3f',
         'Classifier Chains - Logistic Regression, concatenate -> vectorization, order: Random order\t\t%.3f']

In [None]:
print("--------------------------------------------")
print("F1-scores Micro averaged on validation data:")
print("--------------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result['mean_test_f1_micro'][i]))

print()
print("--------------------------------------")
print("F1-scores Weighted on validation data:")
print("--------------------------------------")
for i in range(len(names)):
    print(names[i] % (100 * result['mean_test_f1_weighted'][i]))

--------------------------------------------
F1-scores Micro averaged on validation data:
--------------------------------------------
Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Least frequent first	0.698
Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Most frequent first	50.978
Classifier Chains - Support Vector Machines, vectorization -> concatenate, order: Random order		1.276
Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Least frequent first	0.633
Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Most frequent first	40.457
Classifier Chains - Logistic Regression, vectorization -> concatenate, order: Random order		0.789
Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Least frequent first	0.885
Classifier Chains - Support Vector Machines, concatenate -> vectorization, order: Most frequent first	48.807
Classifier Chains -