## Latent Dirichlet Allocation (LDA) for Topic Modeling

LDA is used to extract topics from unstructured data. Particularly useful for finding reasonably accurate mixtures of topics within a given document set.

It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In this notebook, we look at movie descriptions and try to look at topic representations. 

Tutorial references used in this notebook-

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

https://www.linkedin.com/pulse/nlp-a-complete-guide-topic-modeling-latent-dirichlet-sahil-m

https://medium.datadriveninvestor.com/trump-tweets-topic-modeling-using-latent-dirichlet-allocation-e4f93b90b6fe

In [7]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:65% !important; }</style>"))

In [8]:
# Imports
import nltk

import numpy as np
import pandas as pd

# Visualize
import pyLDAvis
import pyLDAvis.gensim

# For Text pre-processing
from nltk.stem.porter import *
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


# Gensim
import gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models.coherencemodel import CoherenceModel


In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/heena.otia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/heena.otia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
## Read the movie genres dataset 
# https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
df = pd.read_csv('movie_genres.csv',encoding = "ISO-8859-1", engine='python')
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [12]:
df.shape

(1000, 16)

In [13]:
df = df[['Series_Title','Genre','Overview']]
df.columns = ['title','genre','overview']
df.head()

Unnamed: 0,title,genre,overview
0,The Shawshank Redemption,Drama,Two imprisoned men bond over a number of years...
1,The Godfather,"Crime, Drama",An organized crime dynasty's aging patriarch t...
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...
3,The Godfather: Part II,"Crime, Drama",The early life and career of Vito Corleone in ...
4,12 Angry Men,"Crime, Drama",A jury holdout attempts to prevent a miscarria...


In [14]:
# nltk.download('omw-1.4')

### Clean text overview

In [15]:
## Clean the movie overview text
def text_preprocess(sentence):
    # convert text to lowercase
    sentence=str(sentence)
    sentence = sentence.lower()
    
    # A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
    tokenizer = nltk.RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence) 

    # Remove stopwords
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]

    # Stemming the words (eg. Studies => Studi)
    stemmer = PorterStemmer()
    stem_words=[stemmer.stem(w) for w in filtered_words]

    # Lemmatize (eg. Studies => Study)
    lemmatizer = WordNetLemmatizer()
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


In [16]:
df['clean_overview'] = df['overview'].apply(text_preprocess)
df.head()

Unnamed: 0,title,genre,overview,clean_overview
0,The Shawshank Redemption,Drama,Two imprisoned men bond over a number of years...,two imprisoned men bond number years finding s...
1,The Godfather,"Crime, Drama",An organized crime dynasty's aging patriarch t...,organized crime dynasty aging patriarch transf...
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...,menace known joker wreaks havoc chaos people g...
3,The Godfather: Part II,"Crime, Drama",The early life and career of Vito Corleone in ...,early life career vito corleone 1920s new york...
4,12 Angry Men,"Crime, Drama",A jury holdout attempts to prevent a miscarria...,jury holdout attempts prevent miscarriage just...


In [17]:
df.overview[0]

'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'

In [18]:
df.clean_overview[0]

'two imprisoned men bond number years finding solace eventual redemption acts common decency'

In [19]:
data_text = df[['clean_overview']]
data_text['index'] = data_text.index
documents = data_text
documents.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,clean_overview,index
0,two imprisoned men bond number years finding s...,0
1,organized crime dynasty aging patriarch transf...,1
2,menace known joker wreaks havoc chaos people g...,2
3,early life career vito corleone 1920s new york...,3
4,jury holdout attempts prevent miscarriage just...,4


In [20]:
df['clean_overview_list'] = df['clean_overview'].str.split(' ')
df.head(5)

Unnamed: 0,title,genre,overview,clean_overview,clean_overview_list
0,The Shawshank Redemption,Drama,Two imprisoned men bond over a number of years...,two imprisoned men bond number years finding s...,"[two, imprisoned, men, bond, number, years, fi..."
1,The Godfather,"Crime, Drama",An organized crime dynasty's aging patriarch t...,organized crime dynasty aging patriarch transf...,"[organized, crime, dynasty, aging, patriarch, ..."
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...,menace known joker wreaks havoc chaos people g...,"[menace, known, joker, wreaks, havoc, chaos, p..."
3,The Godfather: Part II,"Crime, Drama",The early life and career of Vito Corleone in ...,early life career vito corleone 1920s new york...,"[early, life, career, vito, corleone, 1920s, n..."
4,12 Angry Men,"Crime, Drama",A jury holdout attempts to prevent a miscarria...,jury holdout attempts prevent miscarriage just...,"[jury, holdout, attempts, prevent, miscarriage..."


- The Dictionary() function traverses texts, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics. To see each token’s unique integer id, try print(dictionary.token2id).

In [21]:
dictionary = gensim.corpora.Dictionary(df.clean_overview_list)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break


0 acts
1 bond
2 common
3 decency
4 eventual
5 finding
6 imprisoned
7 men
8 number
9 redemption
10 solace


- filter_extremes() - Filter out tokens that appear in 
- less than 15 documents (absolute number) or
- more than 0.5 documents (fraction of total corpus size, not absolute number).
- after the above two steps, keep only the first 100000 most frequent tokens.

In [22]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

- word frequency stored in Bag of Words Corpus: bow_corpus

In [23]:
bow_corpus = [dictionary.doc2bow(doc) for doc in df.clean_overview_list]
bow_corpus[0]

[(0, 1), (1, 1), (2, 1)]

- Previewing bow_corpus

In [24]:
bow_doc_150 = bow_corpus[150]
for i in range(len(bow_doc_150)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_150[i][0], 
                                               dictionary[bow_doc_150[i][0]], 
bow_doc_150[i][1]))

Word 14 ("army") appears 1 time.
Word 15 ("world") appears 1 time.
Word 51 ("young") appears 1 time.
Word 73 ("officer") appears 1 time.


- TF-IDF

In [25]:
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

for doc in corpus_tfidf:
    print(doc)
    break

[(0, 0.676919229672093), (1, 0.42623776353687154), (2, 0.6000847652084048)]


#### Running LDA using Bag of Words (bow_corpus)

Parameters:

num_topics: required. An LDA model requires the user to determine how many topics should be generated. 

id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.
    
passes: optional. The number of laps the model will take through corpus. 
The greater the number of passes, the more accurate the model will be. A lot of passes can be slow on a very large corpus.
    

In [26]:
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10,
                                       id2word=dictionary,
                                       passes=2,
                                       workers=2)

In [27]:
for i,topic in lda_model.show_topics(formatted=True, num_topics=10, num_words=10):
    print(str(i)+": "+ topic)
    print()

0: 0.120*"war" + 0.086*"world" + 0.059*"three" + 0.042*"german" + 0.034*"two" + 0.031*"men" + 0.028*"battle" + 0.026*"school" + 0.025*"high" + 0.025*"french"

1: 0.076*"young" + 0.070*"search" + 0.059*"lives" + 0.058*"new" + 0.042*"school" + 0.040*"wife" + 0.040*"woman" + 0.035*"high" + 0.032*"future" + 0.024*"forced"

2: 0.079*"young" + 0.054*"man" + 0.045*"family" + 0.043*"love" + 0.042*"year" + 0.036*"two" + 0.034*"old" + 0.033*"life" + 0.027*"takes" + 0.023*"must"

3: 0.054*"world" + 0.046*"old" + 0.044*"young" + 0.043*"american" + 0.041*"daughter" + 0.041*"help" + 0.035*"man" + 0.031*"girl" + 0.029*"army" + 0.029*"mysterious"

4: 0.109*"two" + 0.087*"life" + 0.060*"story" + 0.045*"man" + 0.032*"live" + 0.023*"lawyer" + 0.022*"family" + 0.022*"young" + 0.022*"school" + 0.021*"years"

5: 0.050*"friends" + 0.048*"murder" + 0.047*"woman" + 0.046*"man" + 0.043*"years" + 0.042*"time" + 0.033*"set" + 0.033*"former" + 0.032*"world" + 0.028*"wife"

6: 0.099*"life" + 0.068*"family" + 0.065*

- The weights reflect how important a keyword is to that topic.

#### Running LDA using TF-IDF

In [28]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                               num_topics=10,
                                               id2word=dictionary,
                                               passes=2,
                                               workers=4)

for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.079*"mysterious" + 0.075*"father" + 0.038*"finds" + 0.030*"life" + 0.029*"two" + 0.027*"friend" + 0.027*"living" + 0.024*"people" + 0.024*"battle" + 0.024*"sent"
Topic: 1 Word: 0.057*"team" + 0.052*"daughter" + 0.041*"years" + 0.033*"help" + 0.032*"search" + 0.031*"man" + 0.031*"young" + 0.029*"love" + 0.028*"family" + 0.027*"world"
Topic: 2 Word: 0.074*"war" + 0.052*"battle" + 0.051*"world" + 0.050*"day" + 0.047*"forced" + 0.040*"find" + 0.034*"german" + 0.029*"family" + 0.028*"young" + 0.028*"set"
Topic: 3 Word: 0.123*"life" + 0.066*"become" + 0.044*"son" + 0.036*"death" + 0.033*"mother" + 0.033*"must" + 0.032*"town" + 0.028*"small" + 0.027*"two" + 0.027*"detective"
Topic: 4 Word: 0.063*"one" + 0.046*"former" + 0.040*"story" + 0.040*"love" + 0.038*"struggles" + 0.038*"men" + 0.038*"violent" + 0.037*"boy" + 0.034*"life" + 0.031*"three"
Topic: 5 Word: 0.089*"two" + 0.062*"year" + 0.059*"old" + 0.058*"friends" + 0.047*"new" + 0.035*"leads" + 0.032*"city" + 0.030*"crime"

### Visualize the topic-keywords

In [51]:
vis = pyLDAvis.gensim.prepare(topic_model=lda_model_tfidf,
                              corpus=bow_corpus,
                              dictionary=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

#### Compute model coherence

In [29]:
id2word = corpora.Dictionary()
coherence_model_lda = CoherenceModel(model = lda_model_tfidf,
                                     texts = df.clean_overview_list,
                                     dictionary = dictionary,
                                     coherence = 'c_v'
                                    )

coherence_test = coherence_model_lda.get_coherence()
print("Coherence score:", coherence_test)

Coherence score: 0.33919666307203766


### Testing model

In [30]:
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [31]:
unseen_document = 'Three years after the demise of Jurassic World a volcanic eruption threatens the remaining dinosaurs on \
                   the isla Nublar so Claire Dearing the former park manager recruits Owen Grady to help prevent the extinction\
                   of the dinosaurs once again.'

bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.7749472856521606	 Topic: 0.054*"world" + 0.046*"old" + 0.044*"young" + 0.043*"american" + 0.041*"daughter"
Score: 0.025018205866217613	 Topic: 0.079*"young" + 0.054*"man" + 0.045*"family" + 0.043*"love" + 0.042*"year"
Score: 0.02500728704035282	 Topic: 0.050*"friends" + 0.048*"murder" + 0.047*"woman" + 0.046*"man" + 0.043*"years"
Score: 0.025007063522934914	 Topic: 0.076*"young" + 0.070*"search" + 0.059*"lives" + 0.058*"new" + 0.042*"school"
Score: 0.025005577132105827	 Topic: 0.120*"war" + 0.086*"world" + 0.059*"three" + 0.042*"german" + 0.034*"two"
Score: 0.0250051561743021	 Topic: 0.097*"man" + 0.079*"find" + 0.058*"war" + 0.046*"one" + 0.043*"journey"
Score: 0.025003090500831604	 Topic: 0.081*"young" + 0.071*"woman" + 0.066*"team" + 0.049*"must" + 0.043*"story"
Score: 0.025002775713801384	 Topic: 0.099*"life" + 0.068*"family" + 0.065*"one" + 0.056*"new" + 0.056*"son"
Score: 0.025002039968967438	 Topic: 0.079*"town" + 0.059*"finds" + 0.055*"small" + 0.041*"man" + 0.039*"bec