# Multi-label classification on Reuters dataset

In this notebook, we are going to deal with a multi-label classification problem. The dataset we are using is `Reuters` which contains newswire articles with 90 categories. The dataset has 7769 records for training and 3019 for test, and it is available on `nltk.corpus`.

## Some Visualization

In [1]:
from nltk.corpus import reuters ##import the corpus from nltk

In [2]:
documents_ids = reuters.fileids() #in the format type/num_doc(e.g. training/1000)
print("Number of documents: {}".format(len(documents_ids)))

#Number of training examples
training_examples = list(filter(lambda doc: doc.startswith('training'),documents_ids))
print("Number of training examples: {}".format(len(training_examples)))

#Number of test examples
test_examples = list(filter(lambda doc: doc.startswith('test'),documents_ids))
print("Number of test examples: {}".format(len(test_examples)))

#Number of categories
print("Number of categories: {}".format(len(reuters.categories())))

Number of documents: 10788
Number of training examples: 7769
Number of test examples: 3019
Number of categories: 90


Every document is in the form `content - categories`. Let us visualize one of them.

In [9]:
oneDoc = 'training/100'
print('Content: ',reuters.raw(oneDoc)) #It outputs the content of the document in one string
print('Categorie(s): ',reuters.categories(oneDoc)) #It outputs the categorie(s) of one document in a list

Content:  N.Z. TRADING BANK DEPOSIT GROWTH RISES SLIGHTLY
  New Zealand's trading bank seasonally
  adjusted deposit growth rose 2.6 pct in January compared with a
  rise of 9.4 pct in December, the Reserve Bank said.
      Year-on-year total deposits rose 30.6 pct compared with a
  26.3 pct increase in the December year and 34.5 pct rise a year
  ago period, it said in its weekly statistical release.
      Total deposits rose to 17.18 billion N.Z. Dlrs in January
  compared with 16.74 billion in December and 13.16 billion in
  January 1986.
  


Categorie(s):  ['money-supply']


In [4]:
from collections import Counter
categories =[]
for doc in documents_ids:
    categories.append(reuters.categories(doc)[0])

dict_categories = Counter(categories)
print('The 5 most common categories are :',dict_categories.most_common(5))

The 5 most common categories are : [('earn', 3926), ('acq', 2369), ('crude', 552), ('interest', 453), ('money-fx', 362)]


## Preprocess the documents

In [11]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

stop_words = stopwords.words('english')         # set of stopwords that we will remove in the documents
tf_vectorizer = TfidfVectorizer(stop_words, max_features=20000, use_idf=True)     # define the vectorizer

train_docs = [reuters.raw(doc) for doc in training_examples] # get the training documents
test_docs = [reuters.raw(doc) for doc in test_examples]      # get the test documents

vectorized_train_docs = tf_vectorizer.fit_transform(train_docs) # embedding of the training documents
vectorized_test_docs = tf_vectorizer.transform(test_docs)    


In [6]:
# from nltk.corpus import stopwords
# from sklearn.feature_extraction.text import CountVectorizer

# stop_words = stopwords.words('english')         # set of stopwords that we will remove in the documents
# vectorizer = CountVectorizer(stop_words)     # define the vectorizer

# train_docs = [reuters.raw(doc) for doc in training_examples] # get the training documents
# test_docs = [reuters.raw(doc) for doc in test_examples]      # get the test documents

# vectorized_train_docs = vectorizer.fit_transform(train_docs) # embedding of the training documents
# vectorized_test_docs = vectorizer.transform(test_docs)    


## Define a classifier

In [12]:
from sklearn.preprocessing import MultiLabelBinarizer
#from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform([reuters.categories(doc) for doc in training_examples])
test_labels = mlb.transform([reuters.categories(doc) for doc in test_examples])

#OvR = OneVsRestClassifier(LogisticRegression(C=1, solver='sag', max_iter = 10000))
OvR = OneVsRestClassifier(LinearSVC(random_state=123))
OvR.fit(vectorized_train_docs, train_labels)
pred_labels = OvR.predict(vectorized_test_docs)

## Check the metrics

In [13]:
%matplotlib inline
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score, roc_auc_score
print('Accuracy Score: ', accuracy_score(test_labels,pred_labels))
print('Precision Score: ', precision_score(test_labels, pred_labels, average='micro'))
print('F1 Score: ', f1_score(test_labels, pred_labels,average='micro'))
print('Recall Score: ',recall_score(test_labels, pred_labels,average='micro'))
print('ROC_AUC Score: ',roc_auc_score(test_labels, pred_labels, average='micro'))

Accuracy Score:  0.8068896985756873
Precision Score:  0.9476205685084638
F1 Score:  0.8631272727272727
Recall Score:  0.7924679487179487
ROC_AUC Score:  0.8959279653876869
