In [1]:
import scipy
import pandas as pd
from scipy.io import arff
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split


Data: a real-world data set from the repository provided by MULAN package. These datasets are present in ARFF format. Here I have downloaded the yeast data set from the repository.

In [2]:
data, meta = scipy.io.arff.loadarff('datasets/multi-label/yeast-train.arff')
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,Att10,...,Class5,Class6,Class7,Class8,Class9,Class10,Class11,Class12,Class13,Class14
0,0.0937,0.139771,0.062774,0.007698,0.083873,-0.119156,0.073305,0.00551,0.027523,0.043477,...,b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0'
1,-0.022711,-0.050504,-0.035691,-0.065434,-0.084316,-0.37856,0.038212,0.08577,0.182613,-0.055544,...,b'0',b'0',b'1',b'1',b'0',b'0',b'0',b'1',b'1',b'0'
2,-0.090407,0.021198,0.208712,0.102752,0.119315,0.041729,-0.021728,0.019603,-0.063853,-0.053756,...,b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'1',b'1',b'0'
3,-0.085235,0.00954,-0.013228,0.094063,-0.013592,-0.030719,-0.116062,-0.131674,-0.165448,-0.123053,...,b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'1',b'1',b'1'
4,-0.088765,-0.026743,0.002075,-0.043819,-0.005465,0.004306,-0.055865,-0.071484,-0.159025,-0.111348,...,b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0'


In [3]:
# this will generate a random multi-label dataset
X, y = make_multilabel_classification(sparse = True, n_labels = 20,
                                      return_indicator = 'sparse', allow_unlabeled = False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Understand the parameters used above.

sparse: If True, returns a sparse matrix, where sparse matrix means a matrix having a large number of zero elements.

n_labels:  The average number of labels for each instance.

return_indicator: If ‘sparse’ return Y in the sparse binary indicator format.

allow_unlabeled: If True, some instances might not belong to any class.

I've used sparse matrix everywhere, and scikit-multilearn also recommends to use data in the sparse form because it is very rare for a real-world data set to be dense. Generally, the number of labels assigned to each instance is very less.

Basically, there are three methods to solve a multi-label classification problem, namely:
Problem Transformation
Adapted Algorithm
Ensemble approaches

### Problem Transformation

This method will try to transform this multi-label problem into single-label problem(s). This method can be carried out in three different ways as:

Binary Relevance
Classifier Chains
Label Powerset

#### Binary Relevance
 Basically treats each label as a separate single class classification problem.

In [8]:
# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(GaussianNB())

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

# This function calculates subset accuracy meaning the predicted set of labels should 
# exactly match with the true set of labels.
accuracy_score(y_test,predictions)

0.75757575757575757

#### Classifier Chains

In this, the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain. 

In [9]:
# using classifier chains
from skmultilearn.problem_transform import ClassifierChain
from sklearn.naive_bayes import GaussianNB

# initialize classifier chains multi-label classifier
# with a gaussian naive bayes base classifier
classifier = ClassifierChain(GaussianNB())

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

accuracy_score(y_test,predictions)

0.84848484848484851

#### Label Powerset

In this, we transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

In [10]:
# using Label Powerset
from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import GaussianNB

# initialize Label Powerset multi-label classifier
# with a gaussian naive bayes base classifier
classifier = LabelPowerset(GaussianNB())

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

accuracy_score(y_test,predictions)

0.81818181818181823

This gives us the highest accuracy among all the three we have discussed till now. The only disadvantage of this is that as the training data increases, number of classes become more. Thus, increasing the model complexity, and would result in a lower accuracy.

Now, let us look at the second method to solve multi-label classification problem.



## Adapted Algorithm

Adapted algorithm, as the name suggests, adapting the algorithm to directly perform multi-label classification, rather than transforming the problem into different subsets of problems. For example, multi-label version of kNN is represented by MLkNN. So, let us quickly implement this on our randomly generated data set.



In [11]:
from skmultilearn.adapt import MLkNN

classifier = MLkNN(k=20)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

accuracy_score(y_test,predictions)

0.81818181818181823

Sci-kit learn provides inbuilt support of multi-label classification in some of the algorithm like Random Forest and Ridge regression. So, you can directly call them and predict the output.