# Yeast Data Set - A multi label classification

A real-world data set from the repository provided by MULAN package. These datasets are present in ARFF format.

In [1]:
#importing python packagesaccuracy

import pandas as pd
import scipy
from scipy.io import arff
from sklearn.model_selection import train_test_split

In [2]:
train = pd.read_csv('yeast-train.csv')

test = pd.read_csv('yeast-test.csv')

In [3]:
train.head(5)

Unnamed: 0,id,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,...,Class5,Class6,Class7,Class8,Class9,Class10,Class11,Class12,Class13,Class14
0,1,0.0937,0.139771,0.062774,0.007698,0.083873,-0.119156,0.073305,0.00551,0.027523,...,0,0,0,0,0,0,0,0,0,0
1,2,-0.022711,-0.050504,-0.035691,-0.065434,-0.084316,-0.37856,0.038212,0.08577,0.182613,...,0,0,1,1,0,0,0,1,1,0
2,3,-0.090407,0.021198,0.208712,0.102752,0.119315,0.041729,-0.021728,0.019603,-0.063853,...,0,0,0,0,0,0,0,1,1,0
3,4,-0.085235,0.00954,-0.013228,0.094063,-0.013592,-0.030719,-0.116062,-0.131674,-0.165448,...,0,0,0,0,0,0,0,1,1,1
4,5,-0.088765,-0.026743,0.002075,-0.043819,-0.005465,0.004306,-0.055865,-0.071484,-0.159025,...,0,0,0,0,0,0,0,0,0,0


Here, Att represents the attributes or the independent variables and Class represents the target variables.

For practice purpose, we have another option to generate an artificial multi-label dataset.  

from sklearn.datasets import make_multilabel_classification

this will generate a random multi-label dataset


**X, y = make_multilabel_classification(sparse = True, n_labels = 20,
return_indicator = 'sparse', allow_unlabeled = False)
 **

Let us understand the parameters used above.

sparse: If True, returns a sparse matrix, where sparse matrix means a matrix having a large number of zero elements.

n_labels:  The average number of labels for each instance.

return_indicator: If ‘sparse’ return Y in the sparse binary indicator format.

allow_unlabeled: If True, some instances might not belong to any class.

You must have noticed that we have used sparse matrix everywhere, and scikit-multilearn also recommends to use data in the sparse form because it is very rare for a real-world data set to be dense. Generally, the number of labels assigned to each instance is very less.

In [4]:
#printing out column names
for col in train.columns:
    print(col)

id
Att1
Att2
Att3
Att4
Att5
Att6
Att7
Att8
Att9
Att10
Att11
Att12
Att13
Att14
Att15
Att16
Att17
Att18
Att19
Att20
Att21
Att22
Att23
Att24
Att25
Att26
Att27
Att28
Att29
Att30
Att31
Att32
Att33
Att34
Att35
Att36
Att37
Att38
Att39
Att40
Att41
Att42
Att43
Att44
Att45
Att46
Att47
Att48
Att49
Att50
Att51
Att52
Att53
Att54
Att55
Att56
Att57
Att58
Att59
Att60
Att61
Att62
Att63
Att64
Att65
Att66
Att67
Att68
Att69
Att70
Att71
Att72
Att73
Att74
Att75
Att76
Att77
Att78
Att79
Att80
Att81
Att82
Att83
Att84
Att85
Att86
Att87
Att88
Att89
Att90
Att91
Att92
Att93
Att94
Att95
Att96
Att97
Att98
Att99
Att100
Att101
Att102
Att103
Class1
Class2
Class3
Class4
Class5
Class6
Class7
Class8
Class9
Class10
Class11
Class12
Class13
Class14


Seeing at the above output we can see that on the basis of 103 Attributes we have 14 classes or we can say that we have 14 labels.

In [5]:
#spliting the features and labels into seperate variables
train_x = train[['Att1', 'Att2', 'Att3', 'Att4', 'Att5', 'Att6', 'Att7', 'Att8', 'Att9', 'Att10', 'Att11', 'Att12', 'Att13', 'Att14', 'Att15', 'Att16', 'Att17', 'Att18', 'Att19', 'Att20', 'Att21', 'Att22', 'Att23', 'Att24', 'Att25', 'Att26', 'Att27', 'Att28', 'Att29', 'Att30', 'Att31', 'Att32', 'Att33', 'Att34', 'Att35', 'Att36', 'Att37', 'Att38', 'Att39', 'Att40', 'Att41', 'Att42', 'Att43', 'Att44', 'Att45', 'Att46', 'Att47', 'Att48', 'Att49', 'Att50', 'Att51', 'Att52', 'Att53', 'Att54', 'Att55', 'Att56', 'Att57', 'Att58', 'Att59', 'Att60', 'Att61', 'Att62', 'Att63', 'Att64', 'Att65', 'Att66', 'Att67', 'Att68', 'Att69', 'Att70', 'Att71', 'Att72', 'Att73', 'Att74', 'Att75', 'Att76', 'Att77', 'Att78', 'Att79', 'Att80', 'Att81', 'Att82', 'Att83', 'Att84', 'Att85', 'Att86', 'Att87', 'Att88', 'Att89', 'Att90', 'Att91', 'Att92', 'Att93', 'Att94', 'Att95', 'Att96', 'Att97', 'Att98', 'Att99', 'Att100', 'Att101', 'Att102', 'Att103']]
train_y = train[['Class1', 'Class2', 'Class3', 'Class4', 'Class5', 'Class6', 'Class7', 'Class8', 'Class9', 'Class10', 'Class11', 'Class12', 'Class13', 'Class14']]

In [6]:
#spliting the features and labels into seperate variables
test_x = test[['Att1', 'Att2', 'Att3', 'Att4', 'Att5', 'Att6', 'Att7', 'Att8', 'Att9', 'Att10', 'Att11', 'Att12', 'Att13', 'Att14', 'Att15', 'Att16', 'Att17', 'Att18', 'Att19', 'Att20', 'Att21', 'Att22', 'Att23', 'Att24', 'Att25', 'Att26', 'Att27', 'Att28', 'Att29', 'Att30', 'Att31', 'Att32', 'Att33', 'Att34', 'Att35', 'Att36', 'Att37', 'Att38', 'Att39', 'Att40', 'Att41', 'Att42', 'Att43', 'Att44', 'Att45', 'Att46', 'Att47', 'Att48', 'Att49', 'Att50', 'Att51', 'Att52', 'Att53', 'Att54', 'Att55', 'Att56', 'Att57', 'Att58', 'Att59', 'Att60', 'Att61', 'Att62', 'Att63', 'Att64', 'Att65', 'Att66', 'Att67', 'Att68', 'Att69', 'Att70', 'Att71', 'Att72', 'Att73', 'Att74', 'Att75', 'Att76', 'Att77', 'Att78', 'Att79', 'Att80', 'Att81', 'Att82', 'Att83', 'Att84', 'Att85', 'Att86', 'Att87', 'Att88', 'Att89', 'Att90', 'Att91', 'Att92', 'Att93', 'Att94', 'Att95', 'Att96', 'Att97', 'Att98', 'Att99', 'Att100', 'Att101', 'Att102', 'Att103']]
test_y = test[['Class1', 'Class2', 'Class3', 'Class4', 'Class5', 'Class6', 'Class7', 'Class8', 'Class9', 'Class10', 'Class11', 'Class12', 'Class13', 'Class14']]

# Methods to solve Multi Label Classification

1. Problem transformation
2. Adapted Algorithm
3. Ensemble Approach

## Problem Transformation
**We will try to transform the multi label problem into single label problems**
This method can be carried out in three different ways as:<br/><br/>

Binary Relevance<br/>
Classifier Chains<br/>
Label Powerset<br/>

**Binary Relevance**

This is the simplest technique, which basically treats each label as a separate single class classification problem.

In [7]:
# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB

# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(classifier=GaussianNB())

# train
classifier.fit(train_x, train_y)

# predict
predictions = classifier.predict(test_x)

In [8]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_y,predictions)

print("The accuracy for binary relevance model is {}%".format(acc*100))

The accuracy for binary relevance model is 10.359869138495092%


We can see that this model is mot performing very well this is because that model does not consider the label correlation

**Classifier Chains**

In this, the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.<br/>
This is quite similar to binary relevance, the only difference being it forms chains in order to preserve label correlation. So, let’s try to implement this using multi-learn library.

In [9]:
# using classifier chains
from skmultilearn.problem_transform import ClassifierChain

# initialize classifier chains multi-label classifier
# with a gaussian naive bayes base classifier
classifier = ClassifierChain(classifier = GaussianNB())

# train
classifier.fit(train_x, train_y)

# predict
predictions = classifier.predict(test_x)


acc = accuracy_score(test_y,predictions)

print("The accuracy for classifier chain model is {}%".format(acc*100))

The accuracy for classifier chain model is 9.269356597600872%


The accuracy is very less, we can come to a conclusion that there might be due to lack of labelled correlation

**Label Powerset**

In this, we transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

In [10]:
# using Label Powerset
from skmultilearn.problem_transform import LabelPowerset

# initialize Label powerset multi-label classifier
# with a gaussian naive bayes base classifier
classifier = ClassifierChain(classifier = GaussianNB())

# train
classifier.fit(train_x, train_y)

# predict
predictions = classifier.predict(test_x)


acc = accuracy_score(test_y,predictions)

print("The accuracy for label powerset model is {}%".format(acc*100))

The accuracy for label powerset model is 9.269356597600872%


# MLP classifier 

In [11]:
from sklearn.neural_network import MLPClassifier

classifier = MLPClassifier(alpha=0.1)

# train
classifier.fit(train_x, train_y)

# predict
predictions = classifier.predict(test_x)


acc = accuracy_score(test_y,predictions)

print("The accuracy for adapted algorithm model is {}%".format(acc*100))

The accuracy for adapted algorithm model is 18.974918211559434%




In [12]:
predictions[0]

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0])