# TOPIC MODELING

Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents.

## Working Problem

Today, I will be showing how to use regular LDA and guidedLDA for topic modeling and I tested the LDA algorithm on 20 Newsgroup data present in sklearn set which has thousands of news articles from many sections of a news report

## Latent Dirichlet Allocation ##

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

## Step 1: Load the dataset

The dataset we'll use is the 20newsgroup dataset that is available from sklearn. This dataset has news articles grouped into 20 news categories

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

In [2]:
import lda

In [3]:
print(newsgroups_train.filenames.shape, newsgroups_train.target.shape)

(11314,) (11314,)


In [4]:
# Lets look at some sample news and target groups
print(newsgroups_train.data[:1],newsgroups_train.target[:1]) 

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"] [7]


In [5]:
print(list(newsgroups_train.target_names)), 

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


(None,)

#### As you can see that there are some distinct themes in the news categories like 

* sports 
* religion
* science 
* technology
* politics etc.

## Step 2: Simple Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.


In [6]:
'''
Loading Gensim and nltk libraries
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lubasisikwibele/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Stemming the Stemmer
Let's also look at a stemming example. Let's throw a number of words at the stemmer and see how it deals with each one:

In [68]:
import pandas as pd
stemmer = SnowballStemmer("english")


In [11]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def stemming_and_lemmatizing(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocessing(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(stemming_and_lemmatizing(token))
            
    return result

In [12]:
'''
Preview a document after preprocessing
'''
document_num = 50
doc_sample = 'This disk has failed many times. I would like to get it replaced.'

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocessing(doc_sample))

Original document: 
['This', 'disk', 'has', 'failed', 'many', 'times.', 'I', 'would', 'like', 'to', 'get', 'it', 'replaced.']


Tokenized and lemmatized document: 
['disk', 'fail', 'time', 'like', 'replac']


Let's now preprocess all the news headlines we have. To do that, we iterate over the list of documents in our training sample

In [13]:
docs = []

for doc in newsgroups_train.data:
    docs.append(preprocessing(doc))

In [14]:
'''
Preview 'docs' to see what all words are being captured
'''
print(docs[:2])

[['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['guykuo', 'carson', 'washington', 'subject', 'clock', 'poll', 'final', 'summari', 'final', 'clock', 'report', 'keyword', 'acceler', 'clock', 'upgrad', 'articl', 'shelley', 'qvfo', 'innc', 'organ', 'univers', 'washington', 'line', 'nntp', 'post', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'soul', 'upgrad', 'clock', 'oscil', 'share', 'experi', 'poll', 'send', 'brief', 'messag', 'detail', 'experi', 'procedur', 'speed', 'attain', 'rat', 'speed', 'card', 'adapt', 'heat', 'sink', 'hour', 'usag', 'floppi', 'disk', 'function', 'floppi', 'especi', 'request', 'summar', 'day',

## Step 3: Bag of words on the dataset

In [15]:
'''
capturing all the different words present in 'docs'
'''
all_word_corpus = []
for doc in docs:
    all_word_corpus += doc

** Length of docs vs. Various Percentiles **

Filter out docs that have

* less than a certain size (absolute number) or
* more than a certain size (fraction of total corpus size, not absolute number).

In [16]:
doc_lengths = pd.Series([len(sent) for sent in docs])

In [17]:
doc_lengths.describe(percentiles=[x/10 for x in range(11)])

count    11314.000000
mean       126.780272
std        234.656018
min          9.000000
0%           9.000000
10%         37.000000
20%         48.000000
30%         58.000000
40%         69.000000
50%         81.000000
60%         94.000000
70%        114.000000
80%        144.000000
90%        213.000000
100%      5494.000000
max       5494.000000
dtype: float64

In [18]:
# printing 0% to 10% percentiles
doc_lengths.describe(percentiles=[x/100 for x in range(0,10)])

count    11314.000000
mean       126.780272
std        234.656018
min          9.000000
0%           9.000000
1%          20.000000
2%          23.000000
3%          27.000000
4%          28.000000
5%          30.000000
6%          32.000000
7%          33.000000
8%          34.000000
9%          36.000000
50%         81.000000
max       5494.000000
dtype: float64

In [20]:
# printing 90% to 100% percentiles
doc_lengths.describe(percentiles=[x/100 for x in range(90,100)])

count    11314.000000
mean       126.780272
std        234.656018
min          9.000000
50%         81.000000
90%        213.000000
91%        224.000000
92%        241.000000
93%        262.090000
94%        288.000000
95%        319.000000
96%        367.000000
97%        426.610000
98%        558.220000
99%        885.000000
max       5494.000000
dtype: float64

In [21]:
# chosing upper_bound = 300 and lower_bound = 30
chosen_docs = []
for doc in docs:
    if len(doc) > 30 and len(doc) < 300:
        chosen_docs.append(doc)

In [22]:
print(len(docs), len(chosen_docs))

11314 10063


In [23]:
docs = chosen_docs

In [19]:
'''
capturing all the different words present in 'docs'
'''
all_word_corpus = []
for doc in docs:
    all_word_corpus += doc

## Step 4: Creating Corpus Before LDA ##

Some of the initialisations necessary for LDA are:
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **word2id** is a mapping from words (strings) to word ids (integers). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **vocab** is a list of words (strings). It is used to determine the various words used in the text, as well as for debugging and topic printing.

In [20]:
word2id = {}
id2word = {}
vocab = []
currentWordId = 0

In [21]:
from collections import Counter

for word, count in Counter(all_word_corpus).most_common():
    word2id[word] = currentWordId
    id2word[currentWordId] = word
    currentWordId += 1
    vocab.append(word)

In [22]:
len(docs), len(word2id), len(id2word)

(11314, 61411, 61411)

In [23]:
final_docs = []

for doc in docs:
    currentDoc = []
    for word in doc: 
        if word in word2id:
            currentDoc.append(word2id[word])
    final_docs.append(currentDoc)

In [27]:
docs = final_docs

## Step 5: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

Only parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.

In [28]:
X = np.zeros((len(docs), len(word2id)),dtype=int)

In [29]:
X.shape

(11314, 61411)

In [30]:
for idx, doc in enumerate(docs):
    for idy in doc:
        X[idx,idy] += 1

### Case 1 : Regular LDA ###

In [31]:
# pip install guidedlda
from lda import guidedlda as glda

In [112]:
top_words = list(reversed(model.word_topic_.T[1].argsort()))

In [33]:
from lda import guidedlda

In [34]:
from guidedlda import GuidedLDA

ModuleNotFoundError: No module named 'guidedlda'

### Case 2 : Guided LDA ###

In [35]:
seed_dict = {'Graphics Cards' : 0, 'Space' : 1, 'Religion' : 2 , 'Politics' : 3, 'Gun Violence' : 4,
             'Technology' : 5, 'Sports' : 6, 'Encryption' : 7 }

In [36]:
seed_topics = {}

In [37]:
# Topic 0: Possibly Graphics Cards
seed_topics.update({
    word2id["drive"] : seed_dict["Graphics Cards"], word2id["sale"] : seed_dict["Graphics Cards"], 
    word2id["driver"] : seed_dict["Graphics Cards"], word2id["wire"] : seed_dict["Graphics Cards"], 
    word2id["card"] : seed_dict["Graphics Cards"], word2id["graphic"] : seed_dict["Graphics Cards"], 
    word2id["price"] : seed_dict["Graphics Cards"], word2id["appl"] : seed_dict["Graphics Cards"],
    word2id["softwar"] : seed_dict["Graphics Cards"], word2id["monitor"] : seed_dict["Graphics Cards"]
})

In [38]:
# Topic 2: Possibly Space
seed_topics.update({
    word2id["space"] : seed_dict["Space"], word2id["nasa"] : seed_dict["Space"], 
    word2id["drive"] : seed_dict["Space"], word2id["scsi"] : seed_dict["Space"], 
    word2id["orbit"] : seed_dict["Space"], word2id["launch"] : seed_dict["Space"],
    word2id["data"] : seed_dict["Space"], word2id["control"] : seed_dict["Space"], 
    word2id["earth"] : seed_dict["Space"],word2id["moon"] : seed_dict["Space"]
})

In [39]:
# Topic 6: Possibly Sports
seed_topics.update({
    word2id["game"] : seed_dict["Sports"], word2id["team"] : seed_dict["Sports"], 
    word2id["play"] : seed_dict["Sports"], word2id["player"] : seed_dict["Sports"], 
    word2id["hockey"] : seed_dict["Sports"], word2id["season"] : seed_dict["Sports"], 
    word2id["pitt"] : seed_dict["Sports"], word2id["score"] : seed_dict["Sports"], 
    word2id["leagu"] : seed_dict["Sports"], word2id["pittsburgh"] : seed_dict["Sports"]
})

In [40]:
# Topic 4: Possibly Politics
seed_topics.update({
    word2id["armenian"] : seed_dict["Politics"], word2id["public"] : seed_dict["Politics"], 
    word2id["govern"] : seed_dict["Politics"], word2id["turkish"] : seed_dict["Politics"], 
    word2id["columbia"] : seed_dict["Politics"], word2id["nation"] : seed_dict["Politics"], 
    word2id["presid"] : seed_dict["Politics"], word2id["turk"] : seed_dict["Politics"], 
    word2id["american"] : seed_dict["Politics"], word2id["group"] : seed_dict["Politics"]
})

In [41]:
# Topic 5: Possibly Gun Violence
seed_topics.update({
    word2id["kill"] : seed_dict["Gun Violence"], word2id["bike"] : seed_dict["Gun Violence"], 
    word2id["live"] : seed_dict["Gun Violence"], word2id["leav"] : seed_dict["Gun Violence"], 
    word2id["weapon"] : seed_dict["Gun Violence"], word2id["happen"] : seed_dict["Gun Violence"], 
    word2id["gun"] : seed_dict["Gun Violence"], word2id["crime"] : seed_dict["Gun Violence"],
    word2id["car"] : seed_dict["Gun Violence"], word2id["hand"] : seed_dict["Gun Violence"]
})

In [42]:
model = glda.GuidedLDA(n_topics= 8, n_iter=1000, random_state=1, refresh=50)
model.fit(X, seed_topics = seed_topics, seed_confidence = 0.3)
topic_word = model.topic_word_
n_top_words = 25
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words + 1):-1]
    print('\n')
    print('Topic {} : {}'.format(i, ', '.join(topic_words)))

INFO:lda:n_documents: 11314
INFO:lda:vocab_size: 61411
INFO:lda:n_words: 1434392
INFO:lda:n_topics: 8
INFO:lda:n_iter: 1000
INFO:lda:<0> log likelihood: -16188091
INFO:lda:<50> log likelihood: -12449256
INFO:lda:<100> log likelihood: -12346263
INFO:lda:<150> log likelihood: -12304848
INFO:lda:<200> log likelihood: -12284159
INFO:lda:<250> log likelihood: -12268061
INFO:lda:<300> log likelihood: -12259401
INFO:lda:<350> log likelihood: -12255148
INFO:lda:<400> log likelihood: -12251819
INFO:lda:<450> log likelihood: -12247948
INFO:lda:<500> log likelihood: -12245794
INFO:lda:<550> log likelihood: -12246700
INFO:lda:<600> log likelihood: -12244817
INFO:lda:<650> log likelihood: -12238984
INFO:lda:<700> log likelihood: -12235704
INFO:lda:<750> log likelihood: -12234234
INFO:lda:<800> log likelihood: -12233578
INFO:lda:<850> log likelihood: -12232271
INFO:lda:<900> log likelihood: -12230158
INFO:lda:<950> log likelihood: -12228875
INFO:lda:<999> log likelihood: -12229782




Topic 0 : window, file, line, subject, organ, program, write, post, problem, work, version, host, like, know, card, imag, drive, need, graphic, mail, nntp, univers, scsi, avail, help


Topic 1 : space, nasa, organ, line, subject, research, orbit, year, post, univers, center, launch, write, program, data, articl, develop, inform, scienc, access, work, includ, earth, satellit, time


Topic 2 : informatik, hamburg, appear, intercon, bontchev, nrhj, amanda, dresden, wwiz, organ, fbihh, repli, navi, vote, subject, gizw, vesselin, cover, germani, comic, copi, bhjn, wolverin, walker, line


Topic 3 : peopl, govern, right, write, state, armenian, organ, subject, line, articl, encrypt, say, know, think, israel, secur, public, like, time, post, isra, presid, chip, american, clipper


Topic 4 : write, line, articl, subject, organ, like, peopl, think, know, go, post, time, say, thing, good, host, nntp, come, right, year, want, look, tell, bike, univers


Topic 5 : write, peopl, christian, think,

## Step 6: Testing model on unseen document ##

In [54]:
num = 100
unseen_document = newsgroups_test.data[num]
print(unseen_document)

Subject: help
From: C..Doelle@p26.f3333.n106.z1.fidonet.org (C. Doelle)
Lines: 13

Hello All!

    It is my understanding that all True-Type fonts in Windows are loaded in
prior to starting Windows - this makes getting into Windows quite slow if you
have hundreds of them as I do.  First off, am I correct in this thinking -
secondly, if that is the case - can you get Windows to ignore them on boot and
maybe make something like a PIF file to load them only when you enter the
applications that need fonts?  Any ideas?


Chris

 * Origin: chris.doelle.@f3333.n106.z1.fidonet.org (1:106/3333.26)



In [69]:
document = preprocessing(unseen_document)

In [70]:
type(document)

list

In [51]:
from gensim import corpora

In [74]:
bow_vector = gensim.corpora.Dictionary.doc2bow(document,)

TypeError: doc2bow() missing 1 required positional argument: 'document'

In [63]:
# Data preprocessing step for the unseen document


for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

NameError: name 'lda_model' is not defined

In [58]:
print(newsgroups_test.target[num])

2


The model correctly classifies the unseen document with 'x'% probability to the X category.