<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| Luca Mossina and [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://supaerodatascience.github.io/machine-learning/">https://supaerodatascience.github.io/machine-learning/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">An application of SVMs in Multi-Label Classification (MLC)</div>

We'll see an application which is both harder and less common than binary classification, that of **multi-label classification** (MLC).  
Given a list of possible labels, the problem consists in finding one **or more** labels associated to a data point.  
For instance, imagine extracting the key topics from a newspaper article, or classifing the elements composing an image. Possibly many labels can be associated to each item.

Given a set of labels $\mathcal{L} = \{l_1, l_2, ..., l_k\} \in \{0,1\}^k$, we want to map elements of a feature space $\mathcal{X}$ to a subset of $\mathcal{L}$:  

$$h : \mathcal{X} \longrightarrow \mathcal{P}(\mathcal{L})$$

The two typical approaches for such problems are known as **Binary Relevance** (BR) and **Label Powerset** (LP).  

 - BR: each label in $\mathcal{L}$ is a binary classification problem, $h_{i} : \mathcal{X} \longrightarrow l_{i}, l_{i} \in \{0,1\}, i = 1, ..., |\mathcal{L}|$.  
 This method ignores any correlation between labels (supposes them independent).

 - LP: transforms a problem of MLC into one of multiclass classification, mapping elements $x \in \mathcal{X}$ directly to $s \in \mathcal{P}(\mathcal{L})$.  
 This method becomes rapidly inapplicable as the number of elemnts in $\mathcal{P}(\mathcal{L})$ grows exponentially with the number of labels.
 
If you are curious on the topic of MLC, you are encouraged to read these references:  
J. Read, P. Reutemann, B. Pfahringer, and Geoff Holmes. **MEKA: A multi-label/multi-target extension to Weka**. Journal of Machine Learning Research, 17(21):1-5, 2016.  
G. Tsoumakas and I. Katakis. **Multi-label classification: An overview**. International Journal on Data Warehousing and Mining, 3(3):1-13, 2007.  
G. Tsoumakas, I. Katakis, and I. Vlahavas. **Mining multi-label data**. Data mining and knowledge discovery handbook, pages 667-685. Springer, 2010.
 
Many other variations exist, but for today we'll focus on BR, the most straightforward to implement. What we will start implementing below is a good start if you want to explore what is done in:  
J. Read, B. Pfahringer, G. Holmes, and E. Frank. **Classifier chains for multi-label classification**. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 254-269, 2009.

The equivalent approach for LP is found in:  
G. Tsoumakas, I. Katakis, and I. Vlahavas. **Random k-labelsets for multi-label classification**. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079-1089, 2011.

For this exercise, we will use a biology dataset from [Elisseeff and Weston 2001]: this dataset contains micro-array expressions and phylogenetic profiles for 2417 yeast genes. Each gene is annotated with a subset of 14 functional categories (e.g. metabolism, energy, etc.) of the top level of the functional catalogue.

<div class="alert alert-warning">

**Exercice**<br>
<ul>

<li> find a suitable package to load the file at `yeast.arff`.  <br>
    Hint: <a href=https://docs.scipy.org/doc/scipy/reference/io.html>scipy.io</a> and _read the doc_.<br>
<li> Store the data in a pandas dataframe.<br>
    Hint: columns of classes will be encoded as 'utf-8', we need integers, look for 'str.decode('utf-8')'
<li> check dataset: you should have 2417 samples $\times$ 117 columns (103 features + 14 labels)
</ul>
</div>

In [15]:
# %load solutions/code1.py
### WRITE YOUR CODE HERE
# If you get stuck, uncomment the line above to load a correction in this cell (then you can execute this code).

# Read arff data
import pandas as pd
import scipy
from scipy.io import arff

# Load yeast.arff via dedicated scipy.io function
raw_data, metadata = arff.loadarff('../data/yeast/yeast.arff')
print("nrows:", len(raw_data))    # 2417
print("ncols:", len(raw_data[0])) #  117

# Data to pandas, converting unicode columns to integers
df = pd.DataFrame(raw_data)
# print(df.shape)           # -> (2417, 117)
# print(df.head(5))         # for free, we get column names
# print(type(df.iloc[0,0])) # -> <class 'bytes'> ## we want to have plain {0,1} integers

classes_list = [name for name in df.columns if "Class" in name]
# print(classes_list)  # -> ['Class1', 'Class2', ... , 'Class14']

for col in df[classes_list]:
    df[col] = (df[col].str.decode('utf-8').astype(int))

print(type(df.iloc[0,0]))  # -> int: as expected
print(type(df.iloc[0,15])) # -> float: as expected
print("dataframe dimensions:", df.shape)    # -> (2417, 117)


nrows: 2417
ncols: 117
<class 'numpy.int64'>
<class 'numpy.float64'>
dataframe dimensions: (2417, 117)


In [18]:
df.shape[0]

2417

<div class="alert alert-warning">

**Exercice**<br>
<ul>
<li> Manually, fit a SVM classifier for each label in the dataset
<li> Apply a cross-validation of 60 ∕ 40: 60% of datapoints to train the model, 40% to test it  <br>
   Remember: it is good practice to <b>randomly shuffle</b> the data, in case the data are ordered w.r.t. some data-dependent criterion.
<li> Report some performance measure
</ul>
</div>

In [31]:
# %load solutions/code2.py
### WRITE YOUR CODE HERE
# If you get stuck, uncomment the line above to load a correction in this cell (then you can execute this code).

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np

CVRATIO = 0.4

# Features list
features_list = [name for name in df.columns if "Att" in name]

# Shuffle dataset
df = df.sample(frac=1, random_state=0)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    df[features_list],
    df[classes_list],   # this contains Class1, Class2, ...
    test_size=CVRATIO,
    random_state=0
)

clf = SVC(gamma='auto')

# Loop through each class column
for col in classes_list:
    # Train
    clf.fit(X_train, y_train[col])

    # Cross-validation score (optional, instead of just test set)
    scores = cross_val_score(clf, X_train, y_train[col], cv=5)

    print(f"for {col} the cross validation score is {np.mean(scores)}")

print("* done!")


for Class1 the cross validation score is 0.6841379310344828
for Class2 the cross validation score is 0.5641379310344827
for Class3 the cross validation score is 0.5882758620689655
for Class4 the cross validation score is 0.6503448275862069
for Class5 the cross validation score is 0.696551724137931
for Class6 the cross validation score is 0.7496551724137931
for Class7 the cross validation score is 0.8303448275862069
for Class8 the cross validation score is 0.8068965517241379
for Class9 the cross validation score is 0.9255172413793102
for Class10 the cross validation score is 0.886896551724138
for Class11 the cross validation score is 0.8703448275862069
for Class12 the cross validation score is 0.7531034482758621
for Class13 the cross validation score is 0.7455172413793105
for Class14 the cross validation score is 0.983448275862069
* done!


In [26]:
from sklearn.utils import shuffle
import numpy as np
from sklearn import svm
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

n=df.shape[0]
Att_list = [name for name in df.columns if "Att" in name]
X=df[Att_list]
def shuffle_and_split(X,y,n):
    X0,y0 = shuffle(X,y)
    Xtrain,Xtest = np.split(X0,[n])
    ytrain,ytest = np.split(y0,[n])
    return Xtrain, ytrain, Xtest, ytest
print("Linear kernel")
for col in df[classes_list]:
    y=df[col]
    Xtrain, ytrain, Xtest, ytest=shuffle_and_split(X,y,int(0.6*n))
    yeast_svc = svm.SVC(kernel='linear', C=1)
    yeast_svc.fit(Xtrain,ytrain)
    print("for the ",col,"the cross validation score is",  yeast_svc.score(Xtest,ytest))
    print('*', end='')
print(" done!")
print("rbf kernel")
for col in df[classes_list]:
    y=df[col]
    Xtrain, ytrain, Xtest, ytest=shuffle_and_split(X,y,int(0.6*n))
    yeast_svc = svm.SVC(kernel='rbf', C=1)
    yeast_svc.fit(Xtrain,ytrain)
    print("for the ",col,"the cross validation score is",  yeast_svc.score(Xtest,ytest))
    print('*', end='')    


Linear kernel
for the  Class1 the cross validation score is 0.7973112719751809
*for the  Class2 the cross validation score is 0.6318510858324715
*for the  Class3 the cross validation score is 0.7114788004136504
*for the  Class4 the cross validation score is 0.7549120992761117
*for the  Class5 the cross validation score is 0.7724922440537746
*for the  Class6 the cross validation score is 0.7476732161323681
*for the  Class7 the cross validation score is 0.8210961737331954
*for the  Class8 the cross validation score is 0.7993795243019648
*for the  Class9 the cross validation score is 0.9255429162357808
*for the  Class10 the cross validation score is 0.8872802481902792
*for the  Class11 the cross validation score is 0.8728024819027922
*for the  Class12 the cross validation score is 0.7621509824198552
*for the  Class13 the cross validation score is 0.7549120992761117
*for the  Class14 the cross validation score is 0.9824198552223371
* done!
rbf kernel
for the  Class1 the cross validation sc

**Congratulations**, you reached the end of the practice session! 