<a href="https://colab.research.google.com/github/kochlisGit/Advanced-ML/blob/main/Multi_Label_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Title**

Assignment 2 - Multi-Label Learning

**Course**

Advanced Machine Learning Topics - Master in Artificial Intelligence

**Authors**:

1.   Anastasia Papadopoulou
2.   Vasileios Kochliaridis

In [1]:
import warnings
warnings.filterwarnings("ignore")

Let's download the DeliciousMIL dataset!

DeliciousMIL consists of a subset of tagged web pages from the social bookmarking site delicious.com. The original web pages were obtained from DeliciousT140 dataset. Users of the website delicious.com bookmarked each page with word tags. The class labels of this dataset are: reference, design, programming, internet, computer, web, java, writing, English, grammar, style, language, books, education, philosophy, politics, religion, science, history, and culture.

This dataset provides ground-truth class labels to evaluate performance of multi-instance learning models on both instance-level and bag-level label predictions. Each text document is a bag within a multi-instance learning framework consisting of multiple sentences (instances). The goal is to predict document-level and sentence-level class labels on the test set using a model which is trained given only the document-level class labels in the training set. To evaluate performance of such a model, we have manually labeled 1468 randomly selected sentences from the test documents.

In [2]:
import requests
from zipfile import ZipFile

dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00418/DeliciousMIL.zip'
download_zipfile = 'deliciousmil.zip'
train_inputs_fp = 'Data/train-data.dat'
train_labels_fp = 'Data/train-label.dat'
test_inputs_fp = 'Data/test-data.dat'
test_labels_fp = 'Data/test-label.dat'
label_names_fp = 'Data/labels.txt'
token_names_fp = 'Data/vocabs.txt'

req = requests.get(dataset_url)
with open(download_zipfile, 'wb') as output_file:
  output_file.write(req.content)
print('Download completed!\n')

zf = ZipFile(download_zipfile, 'r')

Download completed!



Reading the labels & tokens (vocals).

In [3]:
def read_dict_from_textfile(fp):
  d = {}

  with zf.open(fp, 'r') as txtfile:
    lines = txtfile.read().splitlines()
    for line in lines:
      name, i = line.decode('utf-8').split(', ')
      d[int(i)] = name
  return d


labels_dict = read_dict_from_textfile(label_names_fp)
print('Labels\n' + str(labels_dict))

tokens_dict = read_dict_from_textfile(token_names_fp)

print('\nFirst 5 Tokens')
for i in range(10):
  print('{}:\t{}'.format(i, tokens_dict[i]))

print('Extracted {} tokens'.format(len(tokens_dict)))

Labels
{0: 'programming', 1: 'style', 2: 'reference', 3: 'java', 4: 'web', 5: 'internet', 6: 'culture', 7: 'design', 8: 'education', 9: 'language', 10: 'books', 11: 'writing', 12: 'computer', 13: 'english', 14: 'politics', 15: 'history', 16: 'philosophy', 17: 'science', 18: 'religion', 19: 'grammar'}

First 5 Tokens
0:	defer
1:	wood
2:	spider
3:	hang
4:	woodi
5:	comic
6:	local
7:	seven
8:	orthographi
9:	origin
Extracted 8520 tokens


Reading training & test documents.

In [4]:
def read_inputs_from_file(fp, tokens):
  inputs = []
  
  with zf.open(fp, 'r') as txtfile:
    lines = txtfile.read().decode('utf-8').splitlines()
    for line in lines:
      data = line.split(' ')
      num_doc_sentences = int(data[0][1: -1])
      doc_index = 1
      doc = ''

      for i in range(num_doc_sentences):
        num_sentence_tokens = int(data[doc_index][1: -1])
        doc_index += 1

        for i in range(doc_index, num_sentence_tokens + doc_index):
          doc += tokens[int(data[i])] + ' '
        doc_index += num_sentence_tokens
      inputs.append(doc)
    return inputs

train_docs = read_inputs_from_file(train_inputs_fp, tokens_dict)
test_docs = read_inputs_from_file(test_inputs_fp, tokens_dict)

print('\nFirst 5 train docs')
for i in range(5):
  print('{}:\t{}'.format(i, train_docs[i]))
print('Extracted {} documents'.format(len(train_docs)))


First 5 train docs
0:	rubi rail helper demo more info auto complet see new helper action 
1:	null length substr locat exec messag messag pleas edit remov follow word content roll stone com news song previou next page good vibrat beach boy smell teen spirit nirvana want hold hand beatl hound dog elvi god know beach boy walk line johnni cash heaven led zeppelin sympathi devil roll stone river deep mountain high turner woman cri bob day buddi holli georgia mind ray charl heartbreak hotel elvi bridg over troubl water simon track tear robinson miracl messag flash five man love woman long tall salli littl richard whole jerri lee lewi california girl beach boy brand new bag jame brown whole love led zeppelin strawberri field forev beatl mysteri train elvi feel good jame brown 
2:	substr remov addit valu return function result fals progress through version final present adopt wide rang industri project spring focus around provid way manag busi object spring both comprehens modular maximum val

Vectorizing documents by Term Frequency - Inverse Document Frequency (TF-IDF). TF-IDF is considered better than Count Vectorizers because it not only focuses on the frequency of words present in the document but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(vocabulary=None)
train_inputs = tfidf_vectorizer.fit_transform(train_docs)
test_inputs = TfidfVectorizer(vocabulary=tfidf_vectorizer.get_feature_names_out()).fit_transform(test_docs)

train_inputs.shape, test_inputs.shape

((8251, 8520), (3983, 8520))

Reading traing & test labels



In [6]:
import numpy as np


def read_labels_from_file(fp):
  labels = []

  with zf.open(fp, 'r') as txtfile:
    lines = txtfile.read().decode('utf-8').splitlines()
    for line in lines:
      doc_label = []

      for output in line.split(' '):
        doc_label.append(int(output))
      labels.append(doc_label)
  return np.int8(labels)


train_labels = read_labels_from_file(train_labels_fp)
test_labels = read_labels_from_file(test_labels_fp)

print('Labels of the first 5 docs')
print(train_labels[0:5])
train_labels.shape, test_labels.shape

Labels of the first 5 docs
[[1 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0]
 [1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


((8251, 20), (3983, 20))

The main thing we notice about each document is that they may have multiple tags. For example, the 1st document which contains the words:

`rubi rail helper demo more info auto complet see new helper action`

has the following tags:


```
1. Programming
2. Reference
3. Java
4. Web
5. Computer
```

In the following example, we are going to use 3 training methods for multi-label data:

1.   Binary Relevance
2.   Regressor Chain
3.   Multi-Target Models

Then, we are going to evaluate each method using the following multi-label evaluation metrics:

1.   Subset Accuracy
2.   Coverage Error
3.   Ranking Loss
4.   Average Precision





In [13]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import zero_one_loss, coverage_error, label_ranking_loss, label_ranking_average_precision_score

RANDOM_STATE = 0

classifiers = {
    'Multi Target Classifier': MultiOutputClassifier(LogisticRegression(class_weight='balanced', random_state=RANDOM_STATE)),
    'Chain Classifier': ClassifierChain(LogisticRegression(random_state=RANDOM_STATE)),
    'Random Forest Classifier': RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'K-nearest Neighbors Classifier': KNeighborsClassifier(n_neighbors=5),
    'Multi-layer Perceptron Classifier': MLPClassifier(hidden_layer_sizes=(100,), random_state=RANDOM_STATE)
}

for classifier_name, clf in classifiers.items():
  clf.fit(train_inputs, train_labels)
  pred_labels = clf.predict(test_inputs)
  print('\nEvaluating {}'.format(classifier_name))
  print(classification_report(test_labels,pred_labels,zero_division='warn'))

  if classifier_name == 'Chain Classifier' or classifier_name == 'Multi-layer Perceptron Classifier':
    proba_labels = clf.predict_proba(test_inputs)
  else:
    proba_labels = np.array([[k[1] for k in i] for i in clf.predict_proba(test_inputs)]).T
  
  print('Subset accuracy = {}'.format((1-zero_one_loss(test_labels, pred_labels))))
  print('Coverage error = {}'.format((coverage_error(test_labels, pred_labels))))
  print('Ranking loss = {}'.format((label_ranking_loss(test_labels, pred_labels))))
  print('Average precision = {}'.format((label_ranking_average_precision_score(test_labels, proba_labels))))


Evaluating Multi Target Classifier
              precision    recall  f1-score   support

           0       0.70      0.80      0.75       977
           1       0.39      0.62      0.48       228
           2       0.57      0.61      0.59      1558
           3       0.52      0.72      0.61       372
           4       0.56      0.68      0.61      1050
           5       0.34      0.58      0.43       537
           6       0.38      0.68      0.49       702
           7       0.54      0.62      0.58      1079
           8       0.49      0.65      0.56       803
           9       0.50      0.64      0.56       483
          10       0.44      0.62      0.52       507
          11       0.42      0.58      0.49       478
          12       0.36      0.60      0.45       509
          13       0.37      0.59      0.45       355
          14       0.42      0.66      0.51       392
          15       0.37      0.63      0.47       441
          16       0.33      0.61      0.43  

Now, we are going to try to convert our multi-label problem to a `binary classification` problem. Firstly, We need to count the frequencies of all 20 labels, and pick the label with the highest frequency.

In [24]:
label_counts = np.sum(train_labels, axis=0)

for label, counts in enumerate(label_counts):
  print('Label {}:\t{}'.format(label, counts))

most_frequent_label = np.argmax(label_counts)
print('\nMost Frequent Label in the dataset: {}'.format(most_frequent_label))

Label 0:	2050
Label 1:	479
Label 2:	3181
Label 3:	799
Label 4:	2203
Label 5:	1211
Label 6:	1471
Label 7:	2221
Label 8:	1559
Label 9:	1004
Label 10:	1034
Label 11:	939
Label 12:	1049
Label 13:	725
Label 14:	830
Label 15:	898
Label 16:	598
Label 17:	1001
Label 18:	411
Label 19:	224

Most Frequent Label in the dataset: 2


In [26]:
most_frequent_label = 2

def create_binary_labels(multi_labels):
  binary_labels = np.zeros(shape=multi_labels.shape[0], dtype=int)
  binary_labels_index = 0
  for x in multi_labels:
    if (x[most_frequent_label]):
      binary_labels[binary_labels_index] = 1
    binary_labels_index += 1
  return binary_labels

train_binary_labels = create_binary_labels(train_labels)
test_binary_labels = create_binary_labels(test_labels)

(8251,) (3983,)
[1 1 0 0 0 1 0 0 1 0]


In [29]:
def convert_multilabels_to_binary(labels, most_frequent_label):
  binary_labels = np.zeros(shape=labels.shape[0], dtype=np.int8)
  ones_indices = np.where(labels[:, most_frequent_label] == 1)
  binary_labels[ones_indices] = 1
  return binary_labels

train_binary_labels = convert_multilabels_to_binary(train_labels, most_frequent_label)
test_binary_labels = convert_multilabels_to_binary(test_labels, most_frequent_label)

for i in range(10):
  print('Original Multi Target Label: {}\tvs Binary Converted Label: {}'.format(train_labels[i], train_binary_labels[i]))

Original Multi Target Label: [1 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]	vs Binary Converted Label: 1
Original Multi Target Label: [0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0]	vs Binary Converted Label: 1
Original Multi Target Label: [1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]	vs Binary Converted Label: 0
Original Multi Target Label: [1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0]	vs Binary Converted Label: 0
Original Multi Target Label: [1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]	vs Binary Converted Label: 0
Original Multi Target Label: [0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0]	vs Binary Converted Label: 1
Original Multi Target Label: [1 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]	vs Binary Converted Label: 0
Original Multi Target Label: [0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]	vs Binary Converted Label: 0
Original Multi Target Label: [0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0]	vs Binary Converted Label: 1
Original Multi Target Label: [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0]	vs Binary Converted Label: 0


Train and evaluate binary classification

In [30]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import zero_one_loss

classifiers = {
    'Gaussian Naive Bayes Classifier': GaussianNB(),
    'Logistic Regression Classifier': LogisticRegression(random_state=RANDOM_STATE),
    'Support Vector Machine Classifier': LinearSVC(random_state=RANDOM_STATE),
    'K-nearest Neighbors Classifier': KNeighborsClassifier(n_neighbors=5),
    'Random Forest Classifier': RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)
}

for classifier_name, clf in classifiers.items():
  clf.fit(train_inputs.toarray(), train_binary_labels)
  pred_labels = clf.predict(test_inputs.toarray())
  print('\nEvaluating {}'.format(classifier_name))
  print(classification_report(test_binary_labels,pred_labels,zero_division='warn'))
  
  print('Subset accuracy = {}'.format((1-zero_one_loss(test_binary_labels, pred_labels))))


Evaluating Gaussian Naive Bayes Classifier
              precision    recall  f1-score   support

           0       0.69      0.40      0.51      2425
           1       0.44      0.71      0.54      1558

    accuracy                           0.53      3983
   macro avg       0.56      0.56      0.52      3983
weighted avg       0.59      0.53      0.52      3983

Subset accuracy = 0.5254833040421792

Evaluating Logistic Regression Classifier
              precision    recall  f1-score   support

           0       0.69      0.87      0.77      2425
           1       0.66      0.40      0.49      1558

    accuracy                           0.68      3983
   macro avg       0.67      0.63      0.63      3983
weighted avg       0.68      0.68      0.66      3983

Subset accuracy = 0.6834044689932212

Evaluating Support Vector Machine Classifier
              precision    recall  f1-score   support

           0       0.70      0.76      0.73      2425
           1       0.57      0