***
# <font color=red>Multi-label classification with nltk and scikit-learn</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Team </font></p>

***

<font color=gray>ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

# Overview
This notebook shows you how to develop a multi-label text classification system on the Reuters Corpus. The skills taught in this notebook are applicable to a wide variety of tasks. Multi-label classification is not significantly more difficult than single-label classification. It does require some slightly different techniques, which are shown in this notebook. 

We use `scikit-learn` and `nltk` to build an effective multi-label classifier in minutes. We use the [Reuters Corpus](https://martin-thoma.com/nlp-reuters/) as our training dataset.

Thom2017-reuters,
  Title                    = {The Reuters Dataset},

  Author                   = {Martin Thoma},
  Month                    = jul,
  Year                     = {2017},

  Url                      = {https://martin-thoma.com/nlp-reuters}
}

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

## Prerequisites:
 - Experience with the topic: Novice
 - Professional experience: None
 
This notebook is intended for Data Scientists with desire to learn about Natural Language Processing tasks and experienced Data Sciencests who want to add another tool to their toolbox.

---

### First, import the necessary libraries:

In [None]:
import nltk
from nltk.corpus import reuters
from nltk.corpus import stopwords
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

Next, download the dataset and the list of stopwords

In [None]:
nltk.download('reuters')
nltk.download('stopwords')

The distributor of the reuters dataset also graciously released their code for loading the data to the public. We utilize it with slight modifications. 

Reuters is a benchmark dataset for document classification. To be more precise, it is a multi-label (each document can belong to many classes) dataset. It has 90 classes, 7769 training documents, and 3019 testing documents. 


During the dataset loading process, we utilize the `MultiLabelBinarizer()` method for converting the labels present originally into the format that scikit-learn wants for doing classification.  This transformer converts between a list of sets or tuples and the supported multi-label format, which is a (samples x classes) binary matrix indicating the presence of a class label. Further details about how this works can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html). 


Once you figure out how to encode the labels, you have to vectorize the text to make it possible for supervised machine learning systems to learn from. One of the most popular and effective strategies for this is called `tf-idf`, which is a vectorization technique that weighs a term’s frequency (tf) and its inverse document frequency (idf). Each word or term that occurs in the text has its respective tf and idf score. Putting them together gives us the `tf-idf` score. Intuitively, a higher score corresponds to a tokens being more "important". Words like "the" have a high term frequency, but a low inverse document frequency because they are utilized everywhere in the corpus. The word "the" would get a low `tf-idf` score. A specific word like "whale" may be utilized very seldomly through the corpus, giving it a high inverse document score and a high term frequency score in the few documents that are about it. As a result, it would get a very high `tf-idf` score. More details about `tf-idf` can be found [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

We limit the `TfidfVectorizer` to only 10000 words for performance reasons.

In [None]:
labels = reuters.categories()
def load_data(config={}):
    """
    Load the Reuters dataset.

    Returns
    -------
    Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
    """
    labels = reuters.categories()
    stop_words = stopwords.words("english") ## See scikit-learn documentation for what these words are
    vectorizer = TfidfVectorizer(stop_words=stop_words, max_features = 10000)
    mlb = MultiLabelBinarizer()

    documents = reuters.fileids()
    test = [d for d in documents if d.startswith("test/")] ##Get the locations for the training documents
    train = [d for d in documents if d.startswith("training/")] ##Get the locations for the testing documents 

    docs = {}
    docs["train"] = [reuters.raw(doc_id) for doc_id in train] ##Populate the list with the docs
    docs["test"] = [reuters.raw(doc_id) for doc_id in test]
    xs = {"train": [], "test": []}
    xs["train"] = vectorizer.fit_transform(docs["train"]).toarray() ##Vectorize the inputs with tf-idf 
    xs["test"] = vectorizer.transform(docs["test"]).toarray()
    ys = {"train": [], "test": []}
    ys["train"] = mlb.fit_transform([reuters.categories(doc_id) for doc_id in train]) ##Vectorize the labels 
    ys["test"] = mlb.transform([reuters.categories(doc_id) for doc_id in test])
    data = {
        "x_train": xs["train"],
        "y_train": ys["train"],
        "x_test": xs["test"],
        "y_test": ys["test"],
        "labels": globals()["labels"],
    }
    return data

You can now load the data easily into a format ready for scikit-learn:

In [None]:
reuters_data = load_data()

In [None]:
X = reuters_data['x_train']
y = reuters_data['y_train']

To properly support multi-label problems, you must use a `OnevsRestClassifier`. More details about the reasoning for this can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier).

We choose the `LinearSVC` model because it is very fast to train and empiraclly effective on NLP problems.

In [None]:
clf = OneVsRestClassifier(LinearSVC(class_weight = "balanced"), n_jobs = -1)
clf.fit(X, y)

Let's see how the model did! You utilize cross validation, a common statistical technique to convince us that our model properly generalizes with a certain performance. K fold cross-validation works by partitioning a dataset into K splits, performing the analysis on one training set, and validating on another smaller data split. For more details about this process, look [here](https://en.wikipedia.org/wiki/Cross-validation_(statistics) and specifically at this image [here](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/1920px-K-fold_cross_validation_EN.svg.png)

By performing cross validation, you get 5 seperate models trained on different train and test splits of the dataset. If you average these scores, you can get a pretty good repersentation of how a model may perform "in the wild" on unseen data. As always, it's not a guarantee of good performance in the future, but it's considered by many to be the gold standard of model evaluation.

In [None]:
cross_val_score(clf, X, y, cv=5)

A pretty robust performance! You have made an effective multi-label text classifier over the Reuters Corpus.