## Latent Dirichlet Allocation

LDA is an unsupervised classification methods used to classify different text documents into a particular topics. Classifications are based on definining word-topic relationships using Dirichlet distributions.

In [102]:
import random
import gensim
import re
from gensim.utils import simple_preprocess
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import *
import numpy as np
import spacy
import string
import nltk
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_sci_lg  # model downloaded in previous step
nlp = spacy.load("en_core_web_sm")    ### Load spacy NLP processor
#nltk.download('wordnet')
np.random.seed(400)

import warnings
warnings.filterwarnings('ignore')

### Dataset

The dataset used for this example is the 20newsgroup dataset available from sklearn. This dataset has news articles grouped into 20 categories.

In [39]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

### Text preprocessing

The following steps will be performed in order to clean raw text and reduce the sparsity of the dataset:
- Tokenization: split text into words, convert to lowercase and remove punctuation
- Remove stopwords
- Lemmatize words: removes inflection endings of words to obtain base/dictionary form of the word (e.g. am, are, is $\Rightarrow$ be)

In [76]:
def clean_txt(txt):
    """
    Function accepts raw text string and performs tokenization, removes punctuation and lemmitizers words. Only tokens
    with length greater than 3 are kept.
    
    Returns list of tokens
    """
    processed_text = []
    for token in gensim.utils.simple_preprocess(txt):
        #Remove stopwords or small tokens
        if token not in STOP_WORDS and len(token) > 3:
            processed_text.append(WordNetLemmatizer().lemmatize(token, pos = 'v'))
            
    return processed_text

In [78]:
'''
Example of preprocessing workflow
'''
document_num = 50
doc_sample = newsgroups_train.data[document_num]

print("Original document: ")
print(doc_sample.split(' '))
print("\n\nTokenized and lemmatized document: ")
print(clean_txt(doc_sample))

Original document: 
['From:', 'johnc@crsa.bu.edu', '(John', 'Collins)\nSubject:', 'Problem', 'with', 'MIT-SHM\nOrganization:', 'Boston', 'University\nLines:', '27\n\nI', 'am', 'trying', 'to', 'write', 'an', 'image', 'display', 'program', 'that', 'uses\nthe', 'MIT', 'shared', 'memory', 'extension.', '', 'The', 'shared', 'memory', 'segment\ngets', 'allocated', 'and', 'attached', 'to', 'the', 'process', 'with', 'no', 'problem.\nBut', 'the', 'program', 'crashes', 'at', 'the', 'first', 'call', 'to', 'XShmPutImage,\nwith', 'the', 'following', 'message:\n\nX', 'Error', 'of', 'failed', 'request:', '', 'BadShmSeg', '(invalid', 'shared', 'segment', 'parameter)\n', '', 'Major', 'opcode', 'of', 'failed', 'request:', '', '133', '(MIT-SHM)\n', '', 'Minor', 'opcode', 'of', 'failed', 'request:', '', '3', '(X_ShmPutImage)\n', '', 'Segment', 'id', 'in', 'failed', 'request', '0x0\n', '', 'Serial', 'number', 'of', 'failed', 'request:', '', '741\n', '', 'Current', 'serial', 'number', 'in', 'output', 'strea

In [114]:
#preprocess all documents in data
processed_documents_train = []
processed_documents_test = []

for train_doc, test_doc in zip(newsgroups_train.data, newsgroups_test.data):
    processed_documents_train.append(clean_txt(train_doc))
    processed_documents_test.append(clean_txt(test_doc))

### Convert documents to Bag of Words representation

BoW representation results in displaying each document as a count of the number of times each unique word appears in the document

In [115]:
bow_dictionary = gensim.corpora.Dictionary(processed_documents)

"""
Filter out
- Words occurring infrequently (less than or equal to 10)
- Words occurring in more than 80% of all documents
"""

bow_dictionary.filter_extremes(no_below=10, no_above=0.8)

#Convert to BoW
train_bow_corpus = [bow_dictionary.doc2bow(doc) for doc in processed_documents_train]
test_bow_corpus = [bow_dictionary.doc2bow(doc) for doc in processed_documents_test]

In [144]:
# Example document
example = train_bow_corpus[2]

#Show 10 words
for i in range(10):
    print("Word {} (\"{}\") has a score of {} ".format(example[i][0],\
         bow_dictionary[example[i][0]], example[i][1]))

Word 13 ("info") has a score of 2 
Word 14 ("know") has a score of 1 
Word 16 ("look") has a score of 2 
Word 23 ("post") has a score of 1 
Word 30 ("thank") has a score of 1 
Word 32 ("university") has a score of 1 
Word 33 ("wonder") has a score of 1 
Word 37 ("answer") has a score of 1 
Word 48 ("disk") has a score of 2 
Word 56 ("haven") has a score of 1 


### Run the LDA

The parameters that will be adjusted in this model will be:

- <b>num_topics</b>: The number of latent topics to be extracted across all documents. We will define this as 10 upfront
- <b>alpha</b> and <b>eta</b>: define the sparsity of the document-topic (eta) and topic-word (lambda) distributions. Higher values indicate that each document/word comprises a mixture of a broader set of topics/words versus lower values which indicate that each document/word is defined by a smaller subset

In [130]:
#Train LDA model

lda_model =  gensim.models.LdaMulticore(train_bow_corpus, 
                                   num_topics = 8, 
                                   id2word = bow_dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [131]:
# Show words and weightings by topic
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.011*"file" + 0.009*"windows" + 0.008*"program" + 0.007*"write" + 0.007*"drive" + 0.007*"image" + 0.006*"post" + 0.006*"work" + 0.006*"system" + 0.005*"card"


Topic: 1 
Words: 0.010*"space" + 0.008*"write" + 0.007*"nasa" + 0.007*"article" + 0.006*"post" + 0.005*"nntp" + 0.005*"host" + 0.004*"work" + 0.004*"university" + 0.004*"program"


Topic: 2 
Words: 0.011*"people" + 0.008*"think" + 0.008*"write" + 0.006*"know" + 0.006*"say" + 0.006*"believe" + 0.005*"article" + 0.005*"time" + 0.005*"mean" + 0.004*"right"


Topic: 3 
Words: 0.010*"file" + 0.007*"information" + 0.007*"encryption" + 0.005*"program" + 0.005*"post" + 0.005*"chip" + 0.005*"privacy" + 0.005*"number" + 0.004*"output" + 0.004*"security"


Topic: 4 
Words: 0.013*"post" + 0.012*"write" + 0.011*"article" + 0.010*"nntp" + 0.010*"host" + 0.009*"university" + 0.007*"like" + 0.006*"distribution" + 0.006*"know" + 0.005*"think"


Topic: 5 
Words: 0.008*"people" + 0.008*"israel" + 0.008*"scsi" + 0.007*"armenian" +

#### Topic inference

Based on the topic groupings and relative weightings of component words. The 8 topics can be infered:

- 0: Technology (graphics)
- 1: Space
- 2: Religion
- 3: Encryption
- 4: Education
- 5: Middle East
- 6: Sport
- 7: Opinion

### Classify unseen document from test set

In [132]:
sample_index = random.randint(0, len(test_bow_corpus) - 1)
bow_doc = test_bow_corpus[sample_index]
print(newsgroups_test.data[sample_index])

From: jussi@tor.abo.fi (Jussi Laaksonen DC)
Subject: Lasergraphics Language ?
Organization: ]bo Akademi University, Finland
Distribution: comp.graphics
Lines: 25

Hi!

We have an old Montage FR-1 35mm film recorder. When connected to a PC with
its processor card it can directly take HPGL, Targa and Lasergraphics Language
files. 24 bit Targa is quite OK for raster images, but conversion from 
whatever one happens to have can be quite slow. This Lasergraphics Language
seems to be (got the source file for one test image) a vector-based language
that can handle one million colors. It does some polygons too, and perhaps
something else ?

The question is, where can I find some information about this language ?
A FTP site, a book, a company address,.... ?

(OK, it would be nice to have a Windows driver for it, but I'm not THAT
optimistic...)

Thanks in advance for any help!

	jussi


--
	Jussi Laaksonen
        Computing Centre / ]bo Akademi University,  Finland




In [133]:
# Data preprocessing step for the unseen document
for index, score in sorted(lda_model[bow_doc], key=lambda tup: -1*tup[1]):
    print("Score: {:.2f}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.98	 Topic: 0.011*"file" + 0.009*"windows" + 0.008*"program" + 0.007*"write" + 0.007*"drive"


In [141]:
#list of topics in test set
possible_outcomes=list(newsgroups_test.target_names)

In [142]:
#Actual label of test case
print(possible_outcomes[newsgroups_test.target[sample_index]])

comp.graphics


The model correctly classifies the document as technology (graphics) with 98% probability