# Experimento multi-label - corpus Reuters
Link útil: https://www.nltk.org/book/ch02.html

## Obtenção dos dados

Vamos via `nltk`!

In [1]:
import nltk
from nltk.corpus import reuters

In [2]:
nltk.download('punkt')
nltk.download('reuters')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package reuters to /root/nltk_data...


True

In [3]:
!unzip '/root/nltk_data/corpora/reuters.zip' -d '/root/nltk_data/corpora/'

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
  inflating: /root/nltk_data/corpora/reuters/training/2231  
  inflating: /root/nltk_data/corpora/reuters/training/2232  
  inflating: /root/nltk_data/corpora/reuters/training/2234  
  inflating: /root/nltk_data/corpora/reuters/training/2236  
  inflating: /root/nltk_data/corpora/reuters/training/2237  
  inflating: /root/nltk_data/corpora/reuters/training/2238  
  inflating: /root/nltk_data/corpora/reuters/training/2239  
  inflating: /root/nltk_data/corpora/reuters/training/2240  
  inflating: /root/nltk_data/corpora/reuters/training/2244  
  inflating: /root/nltk_data/corpora/reuters/training/2246  
  inflating: /root/nltk_data/corpora/reuters/training/2247  
  inflating: /root/nltk_data/corpora/reuters/training/2249  
  inflating: /root/nltk_data/corpora/reuters/training/225  
  inflating: /root/nltk_data/corpora/reuters/training/2251  
  inflating: /root/nltk_data/corpora/reuters/training/2252  
  inflating: 

Dentro de `nltk.corpus`, temos os seguintes comandos básicos:

| Example	| Description |
|----------|:-------------:|
|fileids()	| the files of the corpus |
|fileids([categories]) |the files of the corpus corresponding to these categories|
|categories()	| the categories of the corpus|
|categories([fileids])	|the categories of the corpus corresponding to these files|
|raw()	|the raw content of the corpus|
|raw(fileids=[f1,f2,f3])	| the raw content of the specified files|
|raw(categories=[c1,c2])|	the raw content of the specified categories|
|words()	|the words of the whole corpus|
|words(fileids=[f1,f2,f3])|	the words of the specified fileids|
|words(categories=[c1,c2])|	the words of the specified categories|
|sents()|	the sentences of the whole corpus|
|sents(fileids=[f1,f2,f3])	|the sentences of the specified fileids|
|sents(categories=[c1,c2])	|the sentences of the specified categories|
|abspath(fileid)|	the location of the given file on disk|
|encoding(fileid)|	the encoding of the file (if known)|
|open(fileid)|	open a stream for reading the given corpus file|
|root|	if the path to the root of locally installed corpus|
|readme()|	the contents of the README file of the corpus|

In [4]:
documents = reuters.fileids()

In [5]:
raw_documents = [reuters.raw(documento) for documento in documents]

Quando carregamos o corpus, a lib já inclui também uma lista de stopwords. Basta abrir o arquivo e incluir em um formato que seja útil pra nós.

In [6]:
stopwords = []
with open('/root/nltk_data/corpora/reuters/stopwords', 'r') as file:
  for line in file:
    line = line.strip()
    stopwords.append(line)

In [7]:
len(stopwords)

571

Aqui definimos nossa função de pré-processamento. Vamos retirar \n, por em lower case e retirar stopwords.

In [8]:
from tqdm.notebook import tqdm
import re

In [32]:
def preprocessing(doc_list):
  processed_docs = []
  for document in tqdm(doc_list):
    handle_document = re.sub("[^\w]", " ", document).lower().split()
    for idx, word in enumerate(handle_document):
      if word in stopwords:
        handle_document.pop(idx)
      hey = ' '.join(handle_document)
    processed_docs.append(hey)
  return processed_docs

In [11]:
labels = reuters.categories()

In [12]:
len(labels)

90

In [15]:
labels[0]

'acq'

Pegando ids dos docs para indexar corretamente treino e teste.

In [13]:
train_docs_ids = list(filter(lambda doc: doc.startswith('train'), documents))
test_docs_ids = list(filter(lambda doc: doc.startswith('test'), documents))

In [14]:
len(train_docs_ids), len(test_docs_ids)

(7769, 3019)

YAY! Boa prof Vinicius. Merecia um cafézinho com pão de queijo no Biscoito Mineiro.

In [29]:
train_docs = [reuters.raw(documento) for documento in train_docs_ids]
y_train = [reuters.categories(documento) for documento in train_docs_ids]
test_docs = [reuters.raw(documento) for documento in test_docs_ids]
y_test = [reuters.categories(documento) for documento in test_docs_ids]

In [30]:
train_docs[0]

'BAHIA COCOA REVIEW\n  Showers continued throughout the week in\n  the Bahia cocoa zone, alleviating the drought since early\n  January and improving prospects for the coming temporao,\n  although normal humidity levels have not been restored,\n  Comissaria Smith said in its weekly review.\n      The dry period means the temporao will be late this year.\n      Arrivals for the week ended February 22 were 155,221 bags\n  of 60 kilos making a cumulative total for the season of 5.93\n  mln against 5.81 at the same stage last year. Again it seems\n  that cocoa delivered earlier on consignment was included in the\n  arrivals figures.\n      Comissaria Smith said there is still some doubt as to how\n  much old crop cocoa is still available as harvesting has\n  practically come to an end. With total Bahia crop estimates\n  around 6.4 mln bags and sales standing at almost 6.2 mln there\n  are a few hundred thousand bags still in the hands of farmers,\n  middlemen, exporters and processors.\n  

In [31]:
y_train[0]

['cocoa']

Aplicando nossa função de pré-processamento nos docs:

In [33]:
train_docs_processed = preprocessing(train_docs)
test_docs_processed = preprocessing(test_docs)

HBox(children=(FloatProgress(value=0.0, max=7769.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3019.0), HTML(value='')))




In [34]:
train_docs_processed[0]

'bahia cocoa review showers continued the week the bahia cocoa zone alleviating drought early january improving prospects the coming temporao normal humidity levels not restored comissaria smith in weekly review dry period means temporao be late year arrivals the week ended february 22 155 221 bags 60 kilos making cumulative total the season 5 93 mln 5 81 the stage year it that cocoa delivered earlier consignment included the arrivals figures comissaria smith there still doubt to much crop cocoa still as harvesting practically to end total bahia crop estimates 6 4 mln bags sales standing almost 6 2 mln are few hundred thousand bags in hands farmers middlemen exporters processors are doubts to much this cocoa be fit export shippers now experiencing dificulties obtaining bahia superior certificates view the lower quality recent weeks farmers sold good part their cocoa held consignment comissaria smith spot bean prices rose 340 350 cruzados arroba 15 kilos bean shippers reluctant offer ne

## Caracterização dos dados de treino e teste

Vamos usar o tradicional `tfidf`, https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

tfidf_train = vectorizer.fit_transform(train_docs_processed)
tfidf_test = vectorizer.transform(test_docs_processed)

In [36]:
tfidf_train, tfidf_test

(<7769x26250 sparse matrix of type '<class 'numpy.float64'>'
 	with 470718 stored elements in Compressed Sparse Row format>,
 <3019x26250 sparse matrix of type '<class 'numpy.float64'>'
 	with 168362 stored elements in Compressed Sparse Row format>)

Como nosso problema é multilabel, i.e., cada instância pode ser associada com mais de um rótulo, precisamos codificar os labels de acordo. Para tanto, usamos o módulo `MultiLabelBinarizer` do sklearn.

In [37]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

In [41]:
y_train_mlb = mlb.fit_transform(y_train)
y_test_mlb = mlb.fit_transform(y_test)

In [43]:
y_train_mlb, y_test_mlb

(array([[0, 0, 0, ..., 0, 0, 0],
        [1, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]))

In [42]:
mlb.classes_

array(['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa',
       'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn',
       'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk',
       'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut',
       'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt',
       'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead',
       'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx',
       'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr',
       'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel',
       'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil',
       'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship',
       'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean',
       'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed',
       'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', '

## Las classificassiones

Essa primeira parte foi feita em aula, e vamos comparar os resultados com outras técnicas.

In [44]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

As duas células abaixo demoram uns 5 minutinhos cada pra rodar, então _do not desperate_.

In [45]:
clf = OneVsRestClassifier(SVC()).fit(tfidf_train, y_train_mlb)

In [47]:
preds_prof = clf.predict(tfidf_test)

In [51]:
from sklearn.metrics import classification_report

print(classification_report(y_test_mlb, preds_prof, target_names=mlb.classes_))

                 precision    recall  f1-score   support

            acq       0.98      0.95      0.96       719
           alum       1.00      0.13      0.23        23
         barley       1.00      0.57      0.73        14
            bop       1.00      0.43      0.60        30
        carcass       1.00      0.11      0.20        18
     castor-oil       0.00      0.00      0.00         1
          cocoa       1.00      0.67      0.80        18
        coconut       0.00      0.00      0.00         2
    coconut-oil       0.00      0.00      0.00         3
         coffee       0.96      0.86      0.91        28
         copper       1.00      0.61      0.76        18
     copra-cake       0.00      0.00      0.00         1
           corn       0.94      0.61      0.74        56
         cotton       1.00      0.35      0.52        20
     cotton-oil       0.00      0.00      0.00         2
            cpi       1.00      0.32      0.49        28
            cpu       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## scikit-multilearn

Uma lib bem interessante é a `scikit-multilearn`, que fornece vários wrappers para algoritmos que lidam com multilabel, o que facilita bastante nossa vida. O link da bendita é o seguinte, http://scikit.ml/index.html.

Vamos testar alguns dos modelos usando os códigos fornecidos na documentação.

In [18]:
!pip install scikit-multilearn

Collecting scikit-multilearn
[?25l  Downloading https://files.pythonhosted.org/packages/bb/1f/e6ff649c72a1cdf2c7a1d31eb21705110ce1c5d3e7e26b2cc300e1637272/scikit_multilearn-0.2.0-py3-none-any.whl (89kB)
[K     |███▊                            | 10kB 14.9MB/s eta 0:00:01[K     |███████▍                        | 20kB 19.4MB/s eta 0:00:01[K     |███████████                     | 30kB 11.8MB/s eta 0:00:01[K     |██████████████▊                 | 40kB 9.1MB/s eta 0:00:01[K     |██████████████████▍             | 51kB 4.4MB/s eta 0:00:01[K     |██████████████████████          | 61kB 5.0MB/s eta 0:00:01[K     |█████████████████████████▊      | 71kB 5.2MB/s eta 0:00:01[K     |█████████████████████████████▍  | 81kB 5.6MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 3.6MB/s 
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0


### Binary Relevance kNN

Na descrição da lib, temos:
- Binary Relevance multi-label classifier based on k-Nearest Neighbors method. This version of the classifier assigns the labels that are assigned to at least half of the neighbors.

http://scikit.ml/api/skmultilearn.adapt.brknn.html#skmultilearn.adapt.BRkNNaClassifier

In [58]:
from skmultilearn.adapt import BRkNNaClassifier

classifier = BRkNNaClassifier(k=3)

# train
classifier.fit(tfidf_train, y_train_mlb)

# predict
brknn_preds = classifier.predict(tfidf_test)

In [60]:
print(classification_report(y_test_mlb, brknn_preds, target_names=mlb.classes_))

                 precision    recall  f1-score   support

            acq       0.93      0.67      0.78       719
           alum       1.00      0.43      0.61        23
         barley       0.70      0.50      0.58        14
            bop       0.59      0.67      0.62        30
        carcass       0.69      0.61      0.65        18
     castor-oil       0.00      0.00      0.00         1
          cocoa       1.00      0.83      0.91        18
        coconut       1.00      1.00      1.00         2
    coconut-oil       0.00      0.00      0.00         3
         coffee       0.83      0.86      0.84        28
         copper       0.59      0.72      0.65        18
     copra-cake       0.00      0.00      0.00         1
           corn       0.70      0.62      0.66        56
         cotton       0.77      0.50      0.61        20
     cotton-oil       0.00      0.00      0.00         2
            cpi       0.71      0.54      0.61        28
            cpu       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Multilabel k Nearest Neighbours

Na documentação da lib, temos a seguinte descrição: 

- MLkNN builds uses k-NearestNeighbors find nearest examples to a test class and uses Bayesian inference to select assigned labels.

http://scikit.ml/api/skmultilearn.adapt.mlknn.html#skmultilearn.adapt.MLkNN

In [56]:
from skmultilearn.adapt import MLkNN

classifier = MLkNN(k=3)

# train
classifier.fit(tfidf_train, y_train_mlb)

# predict
mlknn_preds = classifier.predict(tfidf_test)


In [57]:
print(classification_report(y_test_mlb, mlknn_preds, target_names=mlb.classes_))

                 precision    recall  f1-score   support

            acq       0.93      0.67      0.78       719
           alum       1.00      0.43      0.61        23
         barley       0.70      0.50      0.58        14
            bop       0.59      0.67      0.62        30
        carcass       0.69      0.61      0.65        18
     castor-oil       0.00      0.00      0.00         1
          cocoa       1.00      0.83      0.91        18
        coconut       1.00      1.00      1.00         2
    coconut-oil       0.00      0.00      0.00         3
         coffee       0.83      0.86      0.84        28
         copper       0.75      0.67      0.71        18
     copra-cake       0.00      0.00      0.00         1
           corn       0.70      0.62      0.66        56
         cotton       0.77      0.50      0.61        20
     cotton-oil       0.00      0.00      0.00         2
            cpi       0.71      0.54      0.61        28
            cpu       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Problem Transformation approaches

A ideia aqui é converter problemas multi-label em problemas single-label. 


#### BinaryRelevance

Aqui, o cenário é que o classificador "treats each label as a separate single-class classification problem".

http://scikit.ml/api/skmultilearn.problem_transform.br.html#skmultilearn.problem_transform.BinaryRelevance

In [65]:
from skmultilearn.problem_transform import BinaryRelevance

# initialize Binary Relevance multi-label classifier
# with an SVM classifier
# SVM in scikit only supports the X matrix in sparse representation

br_classifier = BinaryRelevance(
    classifier = SVC(),
    require_dense = [False, True]
)

# train
br_classifier.fit(tfidf_train, y_train_mlb)

# predict
br_preds = br_classifier.predict(tfidf_test)

In [66]:
print(classification_report(y_test_mlb, br_preds, target_names=mlb.classes_))

                 precision    recall  f1-score   support

            acq       0.98      0.95      0.96       719
           alum       1.00      0.13      0.23        23
         barley       1.00      0.57      0.73        14
            bop       1.00      0.43      0.60        30
        carcass       1.00      0.11      0.20        18
     castor-oil       0.00      0.00      0.00         1
          cocoa       1.00      0.67      0.80        18
        coconut       0.00      0.00      0.00         2
    coconut-oil       0.00      0.00      0.00         3
         coffee       0.96      0.86      0.91        28
         copper       1.00      0.61      0.76        18
     copra-cake       0.00      0.00      0.00         1
           corn       0.94      0.61      0.74        56
         cotton       1.00      0.35      0.52        20
     cotton-oil       0.00      0.00      0.00         2
            cpi       1.00      0.32      0.49        28
            cpu       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Classifier Chains

Aqui, o cenário é que o classificador "treats each label as a part of a conditioned chain of single-class classification problems". 

http://scikit.ml/api/skmultilearn.problem_transform.cc.html#skmultilearn.problem_transform.ClassifierChain

In [63]:
from skmultilearn.problem_transform import ClassifierChain
from sklearn.naive_bayes import MultinomialNB

clf_chain = ClassifierChain(
    classifier = MultinomialNB(alpha=0.8)
)

clf_chain.fit(tfidf_train, y_train_mlb)

chain_preds = clf_chain.predict(tfidf_test)

In [64]:
print(classification_report(y_test_mlb, chain_preds, target_names=mlb.classes_))

                 precision    recall  f1-score   support

            acq       0.99      0.58      0.73       719
           alum       0.00      0.00      0.00        23
         barley       0.00      0.00      0.00        14
            bop       0.00      0.00      0.00        30
        carcass       0.00      0.00      0.00        18
     castor-oil       0.00      0.00      0.00         1
          cocoa       0.00      0.00      0.00        18
        coconut       0.00      0.00      0.00         2
    coconut-oil       0.00      0.00      0.00         3
         coffee       0.00      0.00      0.00        28
         copper       0.00      0.00      0.00        18
     copra-cake       0.00      0.00      0.00         1
           corn       0.00      0.00      0.00        56
         cotton       0.00      0.00      0.00        20
     cotton-oil       0.00      0.00      0.00         2
            cpi       0.00      0.00      0.00        28
            cpu       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Resumo dos resultados

In [67]:
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd

In [70]:
results = pd.DataFrame(columns = ['Precision', 'Recall', 'F1 score', 'support']
          )
results.loc['OneVsRestClassifier + SVC'] = precision_recall_fscore_support(
          y_test_mlb, 
          preds_prof ,
          average = 'weighted'
          )
results.loc['Binary Relevance KNN'] = precision_recall_fscore_support(
          y_test_mlb, 
          brknn_preds, 
          average = 'weighted'
          )
results.loc['Multilabel KNN'] = precision_recall_fscore_support(
          y_test_mlb, 
          mlknn_preds, 
          average = 'weighted'
          )
results.loc['Binary Relevance + SVM'] = precision_recall_fscore_support(
          y_test_mlb, 
          br_preds, 
          average = 'weighted'
          )
results.loc['Classifier Chain + MultinomialNB'] = precision_recall_fscore_support(
          y_test_mlb, 
          chain_preds, 
          average = 'weighted'
          )

  _warn_prf(average, modifier, msg_start, len(result))


In [71]:
results

Unnamed: 0,Precision,Recall,F1 score,support
OneVsRestClassifier + SVC,0.90997,0.721421,0.783086,
Binary Relevance KNN,0.794611,0.736111,0.752041,
Multilabel KNN,0.795732,0.736645,0.752728,
Binary Relevance + SVM,0.90997,0.721421,0.783086,
Classifier Chain + MultinomialNB,0.666005,0.391026,0.435059,
