# LDA - Scratch implementation with token filtering

Here, I'm using LDA(Scratch implementation) model with token filtering. 

# Importing libraries

In [1]:
from pprint import pprint
import pandas as pd
from nltk import word_tokenize

import gensim
from gensim import corpora, models

# Global Variables

If you're in local machine, you should run this cell below:

In [2]:
BASE_PATH = "./"

If you're in Google Colab, you should run this cell below:

In [3]:
# BASE_PATH = "<ENTER YOUR DRIVE PATH>"

# Load Data
Preprocessed training and testing data from 
[20-news-dataset-pre-processing](https://github.com/nimmitahsin1727/20-news-dataset-pre-processing)

Reading TRAINING from CSV:

In [4]:
training_df = pd.read_csv(f'{BASE_PATH}training_df.csv') 

Reading TESTING from CSV:

In [5]:
testing_df = pd.read_csv(f'{BASE_PATH}testing_df.csv') 

Create data_words with training data

In [6]:
data_words = training_df.data.map(lambda doc: word_tokenize(doc)).values.tolist()

# Corpus Creation

**Bag of Words on the Data set**

Create a dictionary from `data_words` containing the number of times a word appears in the training set.

In [7]:
dictionary = corpora.Dictionary(data_words)

In [8]:
print("Total words: ", len(dictionary.iteritems()))

Total words:  22094


Printing some samples from dictionary:

In [9]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 actively
1 ama
2 apr
3 assistance
4 away
5 big
6 bike
7 board
8 camp
9 childish
10 count


**Gensim filter_extremes**

***Filter out tokens that appear in***

less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).

In [10]:
dictionary.filter_extremes(no_below=15, no_above=0.5)

In [11]:
print("Total words after filter: ", len(dictionary.iteritems()))

Total words after filter:  1832


**Gensim doc2bow**

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [12]:
bow_corpus = [dictionary.doc2bow(doc) for doc in data_words]

Preview Bag Of Words for our sample preprocessed document.



In [13]:
bow_doc_0 = bow_corpus[0]
bow_doc_0

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 2),
 (4, 2),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 9),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 1),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 1),
 (34, 1)]

In [14]:
for i in range(len(bow_doc_0)):
    print(f'Word {bow_doc_0[i][0]} (\"{dictionary[bow_doc_0[i][0]]}\") appears {bow_doc_0[i][1]} time.')

Word 0 ("ama") appears 1 time.
Word 1 ("apr") appears 1 time.
Word 2 ("away") appears 1 time.
Word 3 ("big") appears 2 time.
Word 4 ("bike") appears 2 time.
Word 5 ("board") appears 1 time.
Word 6 ("count") appears 1 time.
Word 7 ("dod") appears 1 time.
Word 8 ("dog") appears 9 time.
Word 9 ("eat") appears 1 time.
Word 10 ("face") appears 1 time.
Word 11 ("great") appears 1 time.
Word 12 ("hate") appears 1 time.
Word 13 ("hear") appears 1 time.
Word 14 ("heard") appears 1 time.
Word 15 ("large") appears 1 time.
Word 16 ("leg") appears 1 time.
Word 17 ("lift") appears 1 time.
Word 18 ("love") appears 1 time.
Word 19 ("make") appears 1 time.
Word 20 ("need") appears 1 time.
Word 21 ("owner") appears 1 time.
Word 22 ("party") appears 1 time.
Word 23 ("personal") appears 1 time.
Word 24 ("ride") appears 1 time.
Word 25 ("rider") appears 1 time.
Word 26 ("say") appears 1 time.
Word 27 ("seat") appears 1 time.
Word 28 ("seek") appears 1 time.
Word 29 ("shit") appears 1 time.
Word 30 ("should

# LDA

**Running LDA - Scratch**

In [15]:
import numpy as np
from lda_vb import vbLDA

In [16]:
n_topic = 10
max_iter=100

voca = [v for k, v in dictionary.iteritems()]

n_doc = len(voca)
n_voca = len(dictionary.iteritems())

print(n_doc, n_voca)

1832 1832


In [17]:
doc_ids = [list(map(lambda bow: bow[0], bow_doc)) for bow_doc in bow_corpus]
doc_cnt = [list(map(lambda bow: bow[1], bow_doc)) for bow_doc in bow_corpus]

LDA model initialization

In [18]:
lda_vb_model = vbLDA(n_doc, n_voca, n_topic)

Model fitting

In [20]:
lda_vb_model.fit(doc_ids, doc_cnt, max_iter=max_iter)

IndexError: list index out of range

Print the Keyword in the 10 topics

In [None]:
def get_top_words(topic_word_matrix, vocab, topic, n_words=20):
    if not isinstance(vocab, np.ndarray):
        vocab = np.array(vocab)
    top_words = vocab[topic_word_matrix[topic].argsort()[::-1][:n_words]]
    return top_words

In [None]:
for ti in range(n_topic):
    top_words = get_top_words(lda_vb_model._lambda, voca, ti, n_words=10)
    print('Topic', ti ,': ', ','.join(top_words))

Topic 0 :  gun,crime,use,firearm,criminal,control,time,like,rate,kill
Topic 1 :  file,firearm,state,weapon,law,gun,use,control,handgun,united
Topic 2 :  gun,like,cop,police,carry,just,revolver,state,say,behanna
Topic 3 :  right,militia,state,government,weapon,law,arm,make,group,just
Topic 4 :  image,file,use,jpeg,graphic,format,data,program,software,package
Topic 5 :  dod,bike,like,dog,ride,just,say,motorcycle,rid,make
Topic 6 :  bike,make,just,good,dod,like,apr,helmet,know,look
Topic 7 :  say,fbi,child,atf,compound,day,waco,start,make,come
Topic 8 :  file,use,polygon,know,computer,program,look,color,point,need
Topic 9 :  know,point,graphic,use,card,need,video,bit,mode,driver


**Running LDA - GENSIM**

Train our lda model using gensim.models.LdaMulticore and save it to lda_model

In [47]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=n_topic, id2word=dictionary, random_state=100)

# lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=n_topic, id2word=dictionary, random_state=100,  passes=2, workers=2)

Print the Keyword in the 10 topics

In [48]:
pprint(lda_model.print_topics())

[(0,
  '0.014*"gun" + 0.012*"use" + 0.008*"point" + 0.008*"law" + 0.007*"weapon" + '
  '0.007*"make" + 0.007*"like" + 0.006*"firearm" + 0.006*"right" + '
  '0.005*"file"'),
 (1,
  '0.009*"use" + 0.009*"good" + 0.007*"time" + 0.007*"know" + 0.007*"image" + '
  '0.007*"like" + 0.006*"make" + 0.006*"bike" + 0.006*"graphic" + '
  '0.006*"just"'),
 (2,
  '0.008*"use" + 0.008*"gun" + 0.008*"like" + 0.007*"just" + 0.006*"say" + '
  '0.006*"know" + 0.006*"file" + 0.005*"time" + 0.005*"bike" + 0.005*"case"'),
 (3,
  '0.011*"use" + 0.010*"image" + 0.006*"know" + 0.006*"say" + 0.005*"dod" + '
  '0.005*"file" + 0.005*"data" + 0.005*"just" + 0.005*"dog" + 0.005*"like"'),
 (4,
  '0.011*"like" + 0.010*"know" + 0.009*"make" + 0.009*"thing" + 0.009*"say" + '
  '0.008*"gun" + 0.007*"good" + 0.007*"bike" + 0.007*"just" + 0.007*"apr"'),
 (5,
  '0.011*"use" + 0.007*"just" + 0.007*"image" + 0.007*"time" + 0.006*"gun" + '
  '0.006*"bike" + 0.006*"like" + 0.005*"graphic" + 0.005*"good" + '
  '0.005*"know"'),
