In [1]:
import warnings
warnings.filterwarnings('ignore')

# Classification and Topic Modeling 

## Classification - Hot News

In this example I will perform document classification in order to predict the category of news articles from the Reuters Corpus using a **bag-of-words** model and **one-hot encoding**. Lastly, I will perform topic modeling with **LDA** to see whether we can predict the categories of news articles without any labelled data.


## The Reuters Corpus

The Reuters Corpus is a collection of news documents along with category tags that are commonly used to test document classification. It is split into two sets: the *training* documents used to train a classification algorithm, and the *test* documents used to test the classifier's performance.

The Reuters Corpus is accessible through NLTK; for more information see the [NLTK Corpus HOWTO](http://www.nltk.org/howto/corpus.html#categorized-corpora).

In [2]:
from nltk.corpus import reuters
import nltk
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /home/jrock/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /home/jrock/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
ids_train = list(filter(lambda article: article.startswith("train"), reuters.fileids()))
ids_test = list(filter(lambda article: article.startswith("test"), reuters.fileids()))
print('How many documents are in the Reuters Corpus?')
print(f'There are {len(reuters.fileids())} articles,{len(ids_train)} for training and {len(ids_test)} for testing.')

How many documents are in the Reuters Corpus?
There are 10788 articles,7769 for training and 3019 for testing.


In [4]:
words_train = sum(nltk.FreqDist(reuters.words(ids_train)).values())
words_test = sum(nltk.FreqDist(reuters.words(ids_test)).values())
print(f'There are {words_train} words in the train data set, and {words_test} in the test data set.')

There are 1253696 words in the train data set, and 467205 in the test data set.


In [5]:
cats = reuters.categories(ids_train)
counts = {}
for c in cats:
    counts[c] = len(list(filter(lambda article: article.startswith('train'), reuters.fileids(categories=c))))
print(f'The five most common categories in the training documents are:\n{sorted(counts, key=counts.get, reverse=True)[:5]}')

The five most common categories in the training documents are:
['earn', 'acq', 'money-fx', 'grain', 'crude']


## Classifying Reuters (supervised) using LinearSVC (SVM)

Now I will put these together in order to build a classifier for Reuters articles.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

text_train = [reuters.raw(i) for i in ids_train]
text_test = [reuters.raw(i) for i in ids_test]

# Converting the training and testing documents into matrices X and X2 of feature vectors using CountVectorizer()
X = vectorizer.fit_transform(text_train)
X2 = vectorizer.transform(text_test)

In [7]:
from sklearn.preprocessing import MultiLabelBinarizer

cat_train = []
for id in ids_train:
    cat_train.append(reuters.categories(fileids=[id]))
    
cat_test = []
for id in ids_test:
    cat_test.append(reuters.categories(fileids=[id]))
# Convert the category labels into matrices y and y2 of binary features for classification using MultiLabelBinarizer() from scikit-learn
labeler = MultiLabelBinarizer()
y = labeler.fit_transform(cat_train)
y2 = labeler.transform(cat_test)

In [8]:
# Code to fit a multiclass SVM classifier on the training data
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

clf = OneVsRestClassifier(LinearSVC())
clf.fit(X, y)
y_pred = clf.predict(X2)

In [9]:
print(classification_report(y2, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96       719
           1       1.00      0.39      0.56        23
           2       1.00      0.64      0.78        14
           3       0.78      0.70      0.74        30
           4       0.92      0.67      0.77        18
           5       0.00      0.00      0.00         1
           6       1.00      0.83      0.91        18
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         3
           9       0.93      0.93      0.93        28
          10       1.00      0.78      0.88        18
          11       0.00      0.00      0.00         1
          12       0.91      0.86      0.88        56
          13       1.00      0.50      0.67        20
          14       0.00      0.00      0.00         2
          15       0.70      0.50      0.58        28
          16       0.00      0.00      0.00         1
          17       0.84    

## Topic Modeling with LatentDirichletAllocation (LDA)

Now we will see if we can use topic modeling to discover the topics in the Reuters news articles without using the labels provided in the corpus.


In [10]:
from nltk.corpus import stopwords
nltk.download('stopwords')

vectorizer = CountVectorizer(stop_words=stopwords.words('english'))  #Exclude stopwords
X = vectorizer.fit_transform(text_train + text_test) # Encode the articles as a matrix of feature vectors using one-hot encoding

[nltk_data] Downloading package stopwords to /home/jrock/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
X 

<10788x30778 sparse matrix of type '<class 'numpy.int64'>'
	with 629849 stored elements in Compressed Sparse Row format>

Each column should be a word, and each row should be a document. The cells should contain the number of times the word appears in the given document.

In [12]:
from sklearn.decomposition import LatentDirichletAllocation # Creating a model *lda* by using scikit-learn's LatentDirichletAllocation to model the topics in the documents. 
lda = LatentDirichletAllocation(n_components=len(cats)) # Setting the argument *n_components* to equal the number of categories in Reuters 
res = lda.fit_transform(X)  # Using the matrix X as input to the model's *fit_transform()* function.

In [13]:
res.shape

(10788, 90)

What does the output of this function represent?  
we get a probability for each document to be each of the classes.

In [14]:
res[0] # Example

array([3.30687831e-05, 4.85713811e-02, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.40484451e-02,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 2.23323268e-02, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       1.15231963e-01, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 2.30192269e-02,
       3.30687831e-05, 3.30687831e-05, 3.30687831e-05, 3.30687831e-05,
       3.30687831e-05, 3.30687831e-05, 2.73691399e-01, 3.42676719e-02,
      

In [15]:
len(res[0]) # The number of classes

90

The most prominent topic for document 0 is cat number:

In [16]:
import numpy as np
np.argmax(res[0])

54