# LDA - Gensim with TF-IDF

Here, I'm using LDA model from gensim. 

# Importing libraries

In [1]:
from pprint import pprint
import pandas as pd
from nltk import word_tokenize

import gensim
from gensim import corpora, models

# Global Variables

If you're in local machine, you should run this cell below:

In [2]:
BASE_PATH = "./"

If you're in Google Colab, you should run this cell below:

In [3]:
BASE_PATH = "<ENTER YOUR DRIVE PATH>"

# Load Data
Preprocessed training and testing data from 
[20-news-dataset-pre-processing](https://github.com/nimmitahsin1727/20-news-dataset-pre-processing)

Reading TRAINING from CSV:

In [3]:
training_df = pd.read_csv(f'{BASE_PATH}training_df.csv') 

Reading TESTING from CSV:

In [4]:
testing_df = pd.read_csv(f'{BASE_PATH}testing_df.csv') 

# Corpus Creation

Create data_words with training data

In [5]:
data_words = training_df.data.map(lambda doc: word_tokenize(doc)).values.tolist()

**Bag of Words on the Data set**

Create a dictionary from `data_words` containing the number of times a word appears in the training set.

In [6]:
dictionary = corpora.Dictionary(data_words)

Printing some samples from dictionary:

In [7]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 actively
1 ama
2 apr
3 assistance
4 away
5 big
6 bike
7 board
8 camp
9 childish
10 count


**Gensim filter_extremes**

***Filter out tokens that appear in***

less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).

In [8]:
dictionary.filter_extremes(no_below=15, no_above=0.5)

**Gensim doc2bow**

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [9]:
bow_corpus = [dictionary.doc2bow(doc) for doc in data_words]

Preview Bag Of Words for our sample preprocessed document.



In [10]:
bow_doc_123 = bow_corpus[123]

In [11]:
for i in range(len(bow_doc_123)):
    print(f'Word {bow_doc_123[i][0]} (\"{dictionary[bow_doc_123[i][0]]}\") appears {bow_doc_123[i][1]} time.')

Word 4 ("bike") appears 1 time.
Word 7 ("dod") appears 2 time.
Word 19 ("make") appears 1 time.
Word 26 ("say") appears 2 time.
Word 33 ("work") appears 1 time.
Word 45 ("good") appears 1 time.
Word 48 ("just") appears 1 time.
Word 51 ("like") appears 1 time.
Word 82 ("old") appears 1 time.
Word 114 ("thing") appears 1 time.
Word 135 ("replyto") appears 1 time.
Word 150 ("check") appears 1 time.
Word 156 ("damage") appears 1 time.
Word 184 ("right") appears 1 time.
Word 187 ("spot") appears 1 time.
Word 193 ("today") appears 1 time.
Word 211 ("difference") appears 1 time.
Word 228 ("piece") appears 1 time.
Word 243 ("wish") appears 1 time.
Word 274 ("year") appears 1 time.
Word 276 ("corner") appears 2 time.
Word 309 ("confuse") appears 1 time.
Word 311 ("delete") appears 1 time.
Word 321 ("mike") appears 1 time.
Word 350 ("hit") appears 2 time.
Word 351 ("home") appears 1 time.
Word 381 ("honda") appears 1 time.
Word 404 ("day") appears 1 time.
Word 448 ("list") appears 1 time.
Word 4

**TF-IDF**

Create tf-idf model object using models.TfidfModel on `bow_corpus` and save it to `tfidf`.

In [12]:
tfidf = models.TfidfModel(bow_corpus)

Then apply transformation to the entire corpus and call it `corpus_tfidf`. 

In [13]:
corpus_tfidf = tfidf[bow_corpus]

Finally we preview TF-IDF scores for our first document.

In [14]:
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.09489126715998976),
 (1, 0.04122880275854039),
 (2, 0.07715428997063789),
 (3, 0.14758920597550684),
 (4, 0.10070770263942229),
 (5, 0.11385022401381809),
 (6, 0.10716872118878243),
 (7, 0.047726607296539166),
 (8, 0.820901229577695),
 (9, 0.1164666362552121),
 (10, 0.09961072060371479),
 (11, 0.06929385519173295),
 (12, 0.1316869776947954),
 (13, 0.07155540336586026),
 (14, 0.10587086007883405),
 (15, 0.07668335931854188),
 (16, 0.1225290967934336),
 (17, 0.13858883679180878),
 (18, 0.08726454060961343),
 (19, 0.035006405260232965),
 (20, 0.04539626019819227),
 (21, 0.0889926820797483),
 (22, 0.12036536133205145),
 (23, 0.10173508037814769),
 (24, 0.07255198705502106),
 (25, 0.09361235903184789),
 (26, 0.03758256619598643),
 (27, 0.11556819406164225),
 (28, 0.12876483420491122),
 (29, 0.12486610912807189),
 (30, 0.1367045012371823),
 (31, 0.05324109095911733),
 (32, 0.11934079770069406),
 (33, 0.05054406090088698),
 (34, 0.07621990407260527)]


# LDA

**Running LDA using TF-IDF**

Train our lda model using gensim.models.LdaMulticore and save it to lda_model

In [15]:
lda_model = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

Print the Keyword in the 10 topics

In [16]:
pprint(lda_model.print_topics())

[(0,
  '0.005*"bike" + 0.004*"bnrca" + 0.004*"infante" + 0.004*"dog" + 0.003*"dod" '
  '+ 0.003*"san" + 0.003*"motorcycle" + 0.003*"mile" + 0.003*"bmw" + '
  '0.003*"computer"'),
 (1,
  '0.004*"gun" + 0.004*"curve" + 0.003*"bike" + 0.003*"lock" + 0.003*"helmet" '
  '+ 0.003*"version" + 0.003*"firearm" + 0.003*"make" + 0.003*"know" + '
  '0.003*"point"'),
 (2,
  '0.005*"bike" + 0.004*"graphic" + 0.004*"tank" + 0.004*"chuck" + 0.003*"gun" '
  '+ 0.003*"like" + 0.003*"look" + 0.003*"buy" + 0.003*"value" + '
  '0.003*"email"'),
 (3,
  '0.004*"gun" + 0.003*"mode" + 0.003*"cdt" + 0.003*"child" + '
  '0.003*"motorcycle" + 0.003*"know" + 0.003*"say" + 0.003*"make" + '
  '0.003*"course" + 0.003*"rider"'),
 (4,
  '0.007*"file" + 0.006*"image" + 0.005*"format" + 0.004*"graphic" + '
  '0.004*"use" + 0.004*"bit" + 0.004*"behanna" + 0.004*"need" + 0.003*"know" + '
  '0.003*"thanks"'),
 (5,
  '0.003*"bike" + 0.003*"dave" + 0.003*"image" + 0.003*"gun" + 0.003*"use" + '
  '0.003*"routine" + 0.003*"dog"