#### Classificatori Multiclass e Multilabel
- Il classificatore binario è quello più semplice: V/F, 0/1, M/F, sentiment analysis eccetera
- Abbiamo anche incontrato il classificatore **multiclass**, che da in output una distribuzione di probabilità (es. iris, sport/cronaca/politica).
- **Multilabel**: Può capitare che vengano chiesti dei classificatori che non devono dare priorità a una categoria, ma più di una (esempio un post tagger o un recommender)
 - La situazione si può gestire facendo a mano un classificatore binario per ogni categoria
 - Oppure sfuttando gli algoritmi di scikit che supportano il multilabel
   https://scikit-learn.org/stable/modules/multiclass.html#multiclass

In [1]:
from sklearn.datasets import fetch_openml, make_multilabel_classification
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.multiclass import OneVsRestClassifier

from sklearn.metrics import accuracy_score, classification_report, multilabel_confusion_matrix

In [2]:
# dataset sul DNA del lievito e delle funzioni che ricoprono
X, y = fetch_openml('yeast', version=4, return_X_y=True, as_frame=True)
#X, y = make_multilabel_classification()

In [3]:
X.head()

Unnamed: 0,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,Att10,...,Att94,Att95,Att96,Att97,Att98,Att99,Att100,Att101,Att102,Att103
0,0.004168,-0.170975,-0.156748,-0.142151,0.058781,0.026851,0.197719,0.04185,0.066938,-0.056617,...,0.006166,-0.012976,-0.014259,-0.015024,-0.010747,0.000411,-0.032056,-0.018312,0.030126,0.124722
1,-0.103956,0.011879,-0.098986,-0.054501,-0.00797,0.049113,-0.03058,-0.077933,-0.080529,-0.016267,...,0.00768,0.027719,-0.085811,0.111123,0.050541,0.027565,-0.063569,-0.041471,-0.079758,0.017161
2,0.509949,0.401709,0.293799,0.087714,0.011686,-0.006411,-0.006255,0.013646,-0.040666,-0.024447,...,0.096277,-0.044932,-0.08947,-0.009162,-0.01201,0.308378,-0.028053,0.02671,-0.066565,-0.122352
3,0.119092,0.004412,-0.002262,0.072254,0.044512,-0.051467,0.074686,-0.00767,0.079438,0.062184,...,-0.083809,0.200354,-0.075716,0.196605,0.152758,-0.028484,-0.074207,-0.089227,-0.049913,-0.043893
4,0.042037,0.007054,-0.069483,0.081015,-0.048207,0.089446,-0.004947,0.064456,-0.133387,0.068878,...,-0.060467,0.044351,-0.057209,0.028047,0.029661,-0.050026,0.023248,-0.061539,-0.03516,0.067834


In [4]:
#di solito scikit qui vuole 0 e 1 come valori
y.head()

Unnamed: 0,Class1,Class2,Class3,Class4,Class5,Class6,Class7,Class8,Class9,Class10,Class11,Class12,Class13,Class14
0,False,False,False,False,False,False,True,True,False,False,False,True,True,False
1,False,False,True,True,False,False,False,False,False,False,False,False,False,False
2,False,True,True,False,False,False,False,False,False,False,False,True,True,False
3,False,False,True,True,False,False,False,False,False,False,False,False,False,False
4,False,False,True,True,True,True,False,False,False,False,False,False,False,False


In [5]:
def convert_cell(c):
    if c == 'FALSE':
        return 0
    return 1

y = y.applymap(convert_cell)

In [6]:
y.head()

Unnamed: 0,Class1,Class2,Class3,Class4,Class5,Class6,Class7,Class8,Class9,Class10,Class11,Class12,Class13,Class14
0,0,0,0,0,0,0,1,1,0,0,0,1,1,0
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0
2,0,1,1,0,0,0,0,0,0,0,0,1,1,0
3,0,0,1,1,0,0,0,0,0,0,0,0,0,0
4,0,0,1,1,1,1,0,0,0,0,0,0,0,0


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [8]:
#model = LogisticRegression() ## ERROR !!! 
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

model.score(X_test, y_test)

0.13553719008264462

- Ci sono metriche in scikit in grado di lavorare direttamente sui multilabel

In [9]:
p_test = model.predict(X_test)
print( classification_report(y_test, p_test) )

              precision    recall  f1-score   support

           0       0.51      0.47      0.49       203
           1       0.53      0.51      0.52       268
           2       0.55      0.56      0.56       252
           3       0.53      0.56      0.54       218
           4       0.51      0.49      0.50       179
           5       0.42      0.42      0.42       148
           6       0.26      0.33      0.29        89
           7       0.22      0.31      0.25        91
           8       0.02      0.03      0.02        37
           9       0.10      0.10      0.10        62
          10       0.18      0.18      0.18        66
          11       0.72      0.75      0.73       437
          12       0.71      0.75      0.73       432
          13       0.00      0.00      0.00         9

   micro avg       0.53      0.55      0.54      2491
   macro avg       0.38      0.39      0.38      2491
weighted avg       0.54      0.55      0.54      2491
 samples avg       0.54   

In [10]:
multilabel_confusion_matrix(y_test, p_test)

array([[[310,  92],
        [108,  95]],

       [[216, 121],
        [132, 136]],

       [[238, 115],
        [110, 142]],

       [[279, 108],
        [ 96, 122]],

       [[341,  85],
        [ 91,  88]],

       [[370,  87],
        [ 86,  62]],

       [[433,  83],
        [ 60,  29]],

       [[413, 101],
        [ 63,  28]],

       [[522,  46],
        [ 36,   1]],

       [[489,  54],
        [ 56,   6]],

       [[483,  56],
        [ 54,  12]],

       [[ 42, 126],
        [110, 327]],

       [[ 44, 129],
        [109, 323]],

       [[582,  14],
        [  9,   0]]], dtype=int64)

- Posso usare l'algoritmo che voglio, anche se non supporta il multilabel, con la strategia 1-vs-rest e senza scrivere troppo codice

In [20]:
single = LogisticRegression()
model  = OneVsRestClassifier(single)

model.fit(X_train, y_train)

model.score(X_test, y_test)


0.15537190082644628

In [21]:
model.multilabel_

True

In [22]:
model.estimators_

[LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression(),
 LogisticRegression()]

In [23]:
model.predict(X_test[:3])

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0]])