# Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a popular approach for topic modeling. It works by identifying the key topics within a set of text documents, and the key words that make up each topic.

Under LDA, each document is assumed to have a mix of underlying (latent) topics, each topic with a certain probability of occurring in the document. Individual text documents can therefore be represented by the topics that make them up.

In this way, LDA topic modeling can be used to categorize or classify documents based on their topic content.

Each LDA topic model requires:

1. A set of documents for training the model—the training corpus
2. A dictionary of words to form the vocabulary used in the model—this can be derived from the training corpus


Once a model has been trained, it can be applied to a new set of documents to identify the topics in those new documents.

## Dataset

In [2]:
import pandas as pd

data = pd.read_csv('C:/Users/manna/OneDrive/Documents/movie_review_final.csv', error_bad_lines=False);
data_text = data[['text']]
data_text['index'] = data_text.index
documents = data_text

In [3]:
len(documents)

64720

In [4]:
documents.head()

Unnamed: 0,text,index
0,films adapted from comic books have had plenty...,0
1,"for starters , it was created by alan moore ( ...",1
2,to say moore and campbell thoroughly researche...,2
3,"the book ( or "" graphic novel , "" if you will ...",3
4,"in other words , don't dismiss this film becau...",4


## Pre-processing

In [7]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.0.1-cp37-cp37m-win_amd64.whl (23.9 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-5.0.0-py3-none-any.whl (56 kB)
Collecting Cython==0.29.21
  Downloading Cython-0.29.21-cp37-cp37m-win_amd64.whl (1.6 MB)
Installing collected packages: smart-open, Cython, gensim
  Attempting uninstall: Cython
    Found existing installation: Cython 0.29.15
    Uninstalling Cython-0.29.15:
      Successfully uninstalled Cython-0.29.15
Successfully installed Cython-0.29.21 gensim-4.0.1 smart-open-5.0.0


In [8]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)



In [9]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\manna\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [10]:
print(WordNetLemmatizer().lemmatize('went', pos='v'))

go


In [13]:
stemmer = SnowballStemmer('english')
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data = {'original word': original_words, 'stemmed': singles})

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [14]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

## Reviewing a Pre-Processed Document

In [15]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['this', 'film', 'uses', 'color', 'and', 'light', 'in', 'such', 'a', 'fantastic', 'way', ',', 'that', 'it', 'will', 'be', 'sad', 'to', 'see', 'it', 'degraded', 'when', 'it', 'has', 'to', 'bee', 'transferred', 'to', 'video', '.']


 tokenized and lemmatized document: 
['film', 'use', 'color', 'light', 'fantast', 'degrad', 'transfer', 'video']


In [17]:
processed_docs = documents['text'].map(preprocess)

In [18]:
processed_docs[:10]

0    [film, adapt, comic, book, plenti, success, su...
1    [starter, creat, alan, moor, eddi, campbel, br...
2    [moor, campbel, thorough, research, subject, j...
3    [book, graphic, novel, page, long, includ, nea...
4                         [word, dismiss, film, sourc]
5    [past, comic, book, thing, stumbl, block, hell...
6    [get, hugh, brother, direct, ludicr, cast, car...
7    [ghetto, question, cours, whitechapel, london,...
8    [filthi, sooti, place, whore, call, unfortun, ...
9    [stiff, turn, copper, peter, godley, robbi, co...
Name: text, dtype: object

## Bag of Words on the Data set

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [20]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [21]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 adapt
1 arthous
2 batman
3 book
4 casper
5 comic
6 crowd
7 film
8 gear
9 ghost
10 hell


## Gensim doc2bow

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [22]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [23]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(6, 1), (167, 1), (336, 1), (353, 1), (1155, 1), (1725, 1), (1962, 1)]

In [24]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 6 ("film") appears 1 time.
Word 167 ("color") appears 1 time.
Word 336 ("fantast") appears 1 time.
Word 353 ("light") appears 1 time.
Word 1155 ("use") appears 1 time.
Word 1725 ("transfer") appears 1 time.
Word 1962 ("video") appears 1 time.


In [25]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [28]:
corpus_tfidf = tfidf[bow_corpus]

In [27]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.21246634848742238),
 (1, 0.20980660150162397),
 (2, 0.3624850812950978),
 (3, 0.2834258265657344),
 (4, 0.35835645432120256),
 (5, 0.23281017415313882),
 (6, 0.06588996776146727),
 (7, 0.26458149213053767),
 (8, 0.2238836624823618),
 (9, 0.20064171335504224),
 (10, 0.1883740644116179),
 (11, 0.10009772720609417),
 (12, 0.2182159681867206),
 (13, 0.23281017415313882),
 (14, 0.17485457201825594),
 (15, 0.269477578635442),
 (16, 0.28190706672243887),
 (17, 0.1475832865099792)]


In [31]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [32]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.040*"like" + 0.029*"movi" + 0.020*"know" + 0.011*"go" + 0.009*"want" + 0.009*"film" + 0.007*"look" + 0.007*"video" + 0.007*"review" + 0.007*"watch"
Topic: 1 
Words: 0.046*"film" + 0.015*"movi" + 0.011*"comedi" + 0.008*"littl" + 0.007*"take" + 0.007*"mayb" + 0.007*"high" + 0.006*"year" + 0.006*"star" + 0.006*"rat"
Topic: 2 
Words: 0.035*"film" + 0.021*"scene" + 0.012*"minut" + 0.012*"work" + 0.011*"movi" + 0.011*"final" + 0.010*"obvious" + 0.008*"thing" + 0.007*"hour" + 0.007*"laugh"
Topic: 3 
Words: 0.010*"play" + 0.008*"charact" + 0.008*"film" + 0.008*"come" + 0.007*"king" + 0.007*"want" + 0.006*"peopl" + 0.006*"get" + 0.006*"fall" + 0.006*"wait"
Topic: 4 
Words: 0.015*"film" + 0.009*"like" + 0.009*"love" + 0.008*"play" + 0.008*"kill" + 0.007*"movi" + 0.007*"charact" + 0.007*"plot" + 0.006*"tri" + 0.005*"friend"
Topic: 5 
Words: 0.010*"like" + 0.010*"life" + 0.008*"movi" + 0.008*"long" + 0.007*"live" + 0.007*"feel" + 0.006*"shoot" + 0.006*"real" + 0.006*"world" + 0.

In [33]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [35]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.012*"film" + 0.008*"movi" + 0.007*"work" + 0.005*"say" + 0.005*"care" + 0.005*"charact" + 0.005*"better" + 0.005*"scene" + 0.004*"time" + 0.004*"like"
Topic: 1 Word: 0.007*"film" + 0.007*"think" + 0.006*"movi" + 0.004*"laugh" + 0.004*"like" + 0.004*"charact" + 0.004*"play" + 0.003*"come" + 0.003*"okay" + 0.003*"scene"
Topic: 2 Word: 0.009*"film" + 0.006*"year" + 0.006*"movi" + 0.005*"bore" + 0.004*"wait" + 0.004*"like" + 0.004*"begin" + 0.004*"reason" + 0.003*"suppos" + 0.003*"godzilla"
Topic: 3 Word: 0.008*"film" + 0.008*"movi" + 0.007*"effect" + 0.006*"special" + 0.005*"like" + 0.005*"star" + 0.005*"look" + 0.004*"miss" + 0.004*"rat" + 0.004*"charact"
Topic: 4 Word: 0.008*"like" + 0.008*"film" + 0.007*"movi" + 0.005*"act" + 0.005*"thing" + 0.005*"go" + 0.005*"stori" + 0.005*"charact" + 0.005*"entertain" + 0.004*"tell"
Topic: 5 Word: 0.009*"know" + 0.008*"film" + 0.006*"movi" + 0.004*"wors" + 0.004*"charact" + 0.004*"like" + 0.004*"scene" + 0.004*"kill" + 0.004*"come"

Classification of the topics

In [36]:
processed_docs[4310]

['film', 'use', 'color', 'light', 'fantast', 'degrad', 'transfer', 'video']

In [37]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6273005604743958	 
Topic: 0.046*"film" + 0.015*"movi" + 0.011*"comedi" + 0.008*"littl" + 0.007*"take" + 0.007*"mayb" + 0.007*"high" + 0.006*"year" + 0.006*"star" + 0.006*"rat"

Score: 0.27261605858802795	 
Topic: 0.040*"like" + 0.029*"movi" + 0.020*"know" + 0.011*"go" + 0.009*"want" + 0.009*"film" + 0.007*"look" + 0.007*"video" + 0.007*"review" + 0.007*"watch"

Score: 0.01251349039375782	 
Topic: 0.010*"play" + 0.008*"charact" + 0.008*"film" + 0.008*"come" + 0.007*"king" + 0.007*"want" + 0.006*"peopl" + 0.006*"get" + 0.006*"fall" + 0.006*"wait"

Score: 0.01251315139234066	 
Topic: 0.042*"film" + 0.017*"movi" + 0.015*"effect" + 0.015*"like" + 0.014*"look" + 0.013*"action" + 0.011*"pictur" + 0.009*"special" + 0.009*"director" + 0.008*"good"

Score: 0.012510421685874462	 
Topic: 0.010*"like" + 0.010*"life" + 0.008*"movi" + 0.008*"long" + 0.007*"live" + 0.007*"feel" + 0.006*"shoot" + 0.006*"real" + 0.006*"world" + 0.005*"stori"

Score: 0.012510191649198532	 
Topic: 0.035*"film" +

In [38]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.8874301910400391	 
Topic: 0.009*"film" + 0.006*"year" + 0.006*"movi" + 0.005*"bore" + 0.004*"wait" + 0.004*"like" + 0.004*"begin" + 0.004*"reason" + 0.003*"suppos" + 0.003*"godzilla"

Score: 0.012510078027844429	 
Topic: 0.007*"film" + 0.007*"think" + 0.006*"movi" + 0.004*"laugh" + 0.004*"like" + 0.004*"charact" + 0.004*"play" + 0.003*"come" + 0.003*"okay" + 0.003*"scene"

Score: 0.012508516199886799	 
Topic: 0.011*"good" + 0.009*"film" + 0.008*"movi" + 0.006*"funni" + 0.005*"pretti" + 0.005*"mean" + 0.004*"think" + 0.004*"charact" + 0.004*"love" + 0.004*"time"

Score: 0.012508366256952286	 
Topic: 0.008*"film" + 0.008*"movi" + 0.007*"effect" + 0.006*"special" + 0.005*"like" + 0.005*"star" + 0.005*"look" + 0.004*"miss" + 0.004*"rat" + 0.004*"charact"

Score: 0.012508119456470013	 
Topic: 0.012*"film" + 0.008*"movi" + 0.007*"work" + 0.005*"say" + 0.005*"care" + 0.005*"charact" + 0.005*"better" + 0.005*"scene" + 0.004*"time" + 0.004*"like"

Score: 0.012507270090281963	 
Topic: 

In [39]:
import numpy
numpy.version.version

'1.18.1'