## Multi-Label classification using Naive-Bayes
Reuters dataset is used.


In [0]:
## Importing necessary packages
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.problem_transform import LabelPowerset

#### Preprocesssing Data
We make use of *Beautiful Soup* from *bs4* package to read the SGML files and extract necessary attributes from it's html tags.
<br> Below function takes the output from *Reuters* tag for a single SGML file and returns a dataframe with all the necessary attributes.

In [0]:
def extract_to_df(reuters):
    topics = []
    titles = []
    body = []

    for i in range(len(reuters)):
        tp = reuters[i]('topics')
        tt = reuters[i]('title')
        bd = reuters[i]('text')
    
        ## Titles
        if tt == []:
            titles.append('')
        else:
            titles.append(str(tt[0].contents[0].string))
    
        ## Topics
        if len(tp[0]) == 0:
            temp = ''
        else:
            temp = []
            for j in range(len(tp[0])):
                temp.append(str(tp[0].contents[j].string))
        topics.append(temp)
    
        ## Body
        if len(bd) == 0:
            body.append('')
        else:
            body.append(str(bd[0].contents[-1]))
        
    df = pd.DataFrame({'topics' : topics, 'title' : titles, 'body' : body})
    
    return df

#### Loading Data and constructing a DataFrame
Below code iterates over the 22 SGML files in the Dataset, and builds a dataframe for each file and appends it to the previously formed DataFrame, reulting into a single DataFrame of 21578 rows (indicating the no. of articles) and 3 columns namely *Topics, Title, Body*.
<br>The .sgm file is passed in *BeautifulSoup()* as an argument and this is in turn is used to extract all tags named *reuters* associated with each article. This is then passed on to the function defined above to get DataFrame.

In [0]:
data = pd.DataFrame()

for i in range(0,22):
    file_name = "reut2-" + '%03d' % i + '.sgm'
    with open(file_name) as file:
        soup = BeautifulSoup(file)
        reuters = soup('reuters')
        df = extract_to_df(reuters)
        data = data.append(df, ignore_index = True)

#### Cleaning Data
We want to predict the *Topics* of an article and there are a lot of articles in the dataset that don't have any topic associated with them. <br> We choose to remove such observavtions because, for using Naive Bayes Classification method, the generative model is probabilistic. If there is no *Topic* associated with a row, it creates a problem in fitting the model.<br>
<br> Also, we make another column that contains merged string of attributes *Title* and *Body* of the article, this is because we make use of *Tfidf vectorizer* ahead which requires the data be passed as an numpy array of strings.

In [0]:
data = data[data['topics'] != '']
data['text'] = data['title']
data.describe()

Unnamed: 0,topics,title,body,text
count,11367,11367.0,11367,11367.0
unique,655,10875.0,10316,10875.0
top,[earn],,Blah blah blah.\n\n\n,
freq,3945,62.0,818,62.0


#### Attribute transformation
The target attribte *Topics* is such that each article can have one or more *Topics*. This means it is a *Multi-Label* attribute, not just a Multi-class attribute. <br>
We use a method called *MultiLabelBinarizer* which takes all the possible *Topics* there are in the attribute, and makes an array of 0's and 1's for each row, indicating whether *i'th* position *Topic* is in that row's *Topic* column or not. <br>
There are 120 distinct *Topics* so the array size will be of 120 and very few of them will be 1's. <br>
This method transforms the *Topics* attribute into a numpy array of arrays of 0's and 1's and stores it as a Sparse matrix.

In [0]:
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(data['topics'])

#### TFIDF Vectorizer
The independent attribute of data is a string that contains the *Title* and *Body* of the article merged together in a string. <br><br>
TFIDF - Term Frequency-Inverse Docment Frequency <br>
Tfidf vectorization is the transformation of text into a meaningul format in terms of frequency (and then invere doc freq) of distinct words occuring in the text. It firsts counts the frequency of each word that occurs in the text. Now there are some words that are common to almost all articles and occur almost everwhere like *a, the, of, from, for, etc.* <br>
That is, equal weights are given to common but unimportant words and those words that are quite important but not as common as above words. <br><br>
Tfidf then calculates a term called the Inverse Document Frequency which is the log of ratio of #of docs to #of docs containing the particular word. And then the TFIDF is calculates as the multiplication of *Term Freq * Inverse Doc Freq*. <br>
Thus, each row has a corpus of words and a TFIDF freq associated with it. i.e. Text transformed into a vector of words and frequencies associated with it, which are called the features of the Text. <br> <br>
It might be possible that bigger the text, the number of features is very very large. This is not so desirable as with too many features, the fitted model might not generalise well. So we put a Max value to the #of features and it keeps that many best features and neglects the others.

In [0]:
vect = TfidfVectorizer(max_features = 5500)
X = vect.fit_transform(data['text'])

In [0]:
## Splitting the data X and y into training and testing sets with the testing size of our choice.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#### Fitting the model
We fit the *ComplementNB* model from sklearn.naive_bayes package using the above Training Data. <br>
But since an article can have one or more *Topics* assigned to it, we need a method that can do it. <br>The sklearn Naive_bayes predicts a label, but the *LabelPowerset* method from *sk-multilearn* library predicts multiple labels.

In [0]:
classifier = LabelPowerset(ComplementNB())
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

#### Cross-validation on Training Data
We perform k-fold (here, 10 fold) cross-validation on the Training Data to check whether the model gives significantly distant apart accuracies which indicate the model is not a good fit.

In [0]:
cross_val_score(classifier, X=X_train, y=y_train, cv=10)

array([0.77142857, 0.77802198, 0.77802198, 0.78767877, 0.779978  ,
       0.78107811, 0.77777778, 0.76677668, 0.78437844, 0.78657866])

Thus we can say that the Cross-validation results indicate a consistent behaviour of model in the training data.

#### Predicting topics and finding out accuracy
We use the validated model to predict *Topics* on the Test dataset and check it's accuracy score. <br>
The accuracy here is defined as the ratio of correctly classified topics to the total no. of observations.
*Correctly classified topics* here means that if an article has 3 topics, then a correct classification includes predicting all the 3 topics correctly and hence matching the predicted topics to actual topics exactly.

In [0]:
predictions = classifier.predict(X_test)
y_pred = mlb.inverse_transform(predictions)
accuracy_score(y_test,predictions)

0.7893579595426561

Thus, we get an accuracy of 78% on the Test Dataset.

### Precision and Recall
Precision and Recall are defined slight differently for Multi-Label Classification. We could not figure out the meaning of Confusion matrix for Multi-Label Classification as there are multiple classes involved and also multiple labels assigned to each instance.

#### Precision Score
Definition on Sklearn documentation : *Calculate metrics for each instance, and find their average.* <br><br>
Here, in Multi-Label classification, a score is calculated for each observation, which is the ratio - # of topics common in predicted and true sets to the # of predicted topics.<br>For n observations, n such scores are calculated and the average score is the Precision Score of it. <br>
i.e. For an article, model predicted "egg", "milk" and "yogurt". And true value was "milk" and "yogurt". Then score for this observation will be 2/3.

In [0]:
precision_score(y_test,predictions, average = 'samples')

0.8626397788666917

#### Recall Score
Similar setting to that of Precision for Multi-label classification.<br><br>
A score is calculated for each observation, which is the ratio - # of topics common in predicted and true sets to the # of true topics. And then average of n scores is taken to be the Recall Score. (n is the # of observations present) <br>
i.e. For an article, model predicted "egg", "milk" and "yogurt". And true value was "milk" and "yogurt". Then score for this observation will be 2/2 = 1.

In [0]:
recall_score(y_test, predictions, average = "samples")

0.8372726352607619

In [0]:
y_pred[:20]

[('acq',),
 ('earn',),
 ('crude',),
 ('crude',),
 ('trade',),
 ('earn',),
 ('earn',),
 ('acq',),
 ('earn',),
 ('acq',),
 ('grain', 'wheat'),
 ('acq',),
 ('earn',),
 ('acq',),
 ('earn',),
 ('trade',),
 ('carcass', 'livestock'),
 ('acq',),
 ('interest', 'money-fx'),
 ('earn',)]