# LDA - Gensim with token filtering

Here, I'm using LDA model from gensim. 

# Importing libraries

In [1]:
from pprint import pprint
import pandas as pd
from nltk import word_tokenize

import gensim
from gensim import corpora, models

# Global Variables

If you're in local machine, you should run this cell below:

In [5]:
BASE_PATH = "./"

If you're in Google Colab, you should run this cell below:

In [3]:
BASE_PATH = "<ENTER YOUR DRIVE PATH>"

# Load Data
Preprocessed training and testing data from 
[20-news-dataset-pre-processing](https://github.com/nimmitahsin1727/20-news-dataset-pre-processing)

Reading TRAINING from CSV:

In [6]:
training_df = pd.read_csv(f'{BASE_PATH}training_df.csv') 

Reading TESTING from CSV:

In [7]:
testing_df = pd.read_csv(f'{BASE_PATH}testing_df.csv') 

Create data_words with training data

In [8]:
data_words = training_df.data.map(lambda doc: word_tokenize(doc)).values.tolist()

**Bag of Words on the Data set**

Create a dictionary from `data_words` containing the number of times a word appears in the training set.

In [9]:
dictionary = corpora.Dictionary(data_words)


Printing some samples from dictionary:

In [10]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 actively
1 ama
2 apr
3 assistance
4 away
5 big
6 bike
7 board
8 camp
9 childish
10 count


**Gensim filter_extremes**

***Filter out tokens that appear in***

less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
after the above two steps.

In [11]:
dictionary.filter_extremes(no_below=15, no_above=0.5)

**Gensim doc2bow**

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [12]:
bow_corpus = [dictionary.doc2bow(doc) for doc in data_words]

Preview Bag Of Words for our sample preprocessed document.



In [13]:
bow_doc_123 = bow_corpus[123]

In [14]:
for i in range(len(bow_doc_123)):
    print(f'Word {bow_doc_123[i][0]} (\"{dictionary[bow_doc_123[i][0]]}\") appears {bow_doc_123[i][1]} time.')

Word 4 ("bike") appears 1 time.
Word 7 ("dod") appears 2 time.
Word 19 ("make") appears 1 time.
Word 26 ("say") appears 2 time.
Word 33 ("work") appears 1 time.
Word 45 ("good") appears 1 time.
Word 48 ("just") appears 1 time.
Word 51 ("like") appears 1 time.
Word 82 ("old") appears 1 time.
Word 114 ("thing") appears 1 time.
Word 135 ("replyto") appears 1 time.
Word 150 ("check") appears 1 time.
Word 156 ("damage") appears 1 time.
Word 184 ("right") appears 1 time.
Word 187 ("spot") appears 1 time.
Word 193 ("today") appears 1 time.
Word 211 ("difference") appears 1 time.
Word 228 ("piece") appears 1 time.
Word 243 ("wish") appears 1 time.
Word 274 ("year") appears 1 time.
Word 276 ("corner") appears 2 time.
Word 309 ("confuse") appears 1 time.
Word 311 ("delete") appears 1 time.
Word 321 ("mike") appears 1 time.
Word 350 ("hit") appears 2 time.
Word 351 ("home") appears 1 time.
Word 381 ("honda") appears 1 time.
Word 404 ("day") appears 1 time.
Word 448 ("list") appears 1 time.
Word 4

**Running LDA using Bag of Words**

Train our lda model using gensim.models.LdaMulticore and save it to lda_model

In [15]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

Print the Keyword in the 10 topics

In [16]:
pprint(lda_model.print_topics())

[(0,
  '0.019*"gun" + 0.015*"use" + 0.012*"image" + 0.009*"like" + 0.008*"make" + '
  '0.006*"file" + 0.006*"right" + 0.006*"just" + 0.006*"know" + '
  '0.005*"program"'),
 (1,
  '0.009*"say" + 0.008*"make" + 0.008*"fbi" + 0.007*"right" + 0.007*"gun" + '
  '0.006*"just" + 0.006*"time" + 0.006*"start" + 0.006*"like" + 0.005*"good"'),
 (2,
  '0.016*"bike" + 0.010*"like" + 0.008*"right" + 0.008*"time" + 0.008*"just" + '
  '0.008*"say" + 0.007*"make" + 0.006*"dod" + 0.006*"gun" + 0.006*"use"'),
 (3,
  '0.025*"file" + 0.008*"good" + 0.006*"use" + 0.006*"know" + 0.005*"make" + '
  '0.005*"image" + 0.005*"world" + 0.005*"need" + 0.005*"like" + '
  '0.005*"graphic"'),
 (4,
  '0.018*"image" + 0.009*"graphic" + 0.009*"file" + 0.009*"point" + '
  '0.008*"use" + 0.007*"package" + 0.007*"know" + 0.006*"software" + '
  '0.006*"data" + 0.006*"look"'),
 (5,
  '0.009*"jpeg" + 0.009*"bike" + 0.008*"dod" + 0.008*"behanna" + 0.008*"use" + '
  '0.007*"right" + 0.007*"know" + 0.006*"state" + 0.006*"image" +