<a href="https://colab.research.google.com/github/hawc2/Text-Analysis-with-Python/blob/master/notebooks/topic-modeling/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Topic Modeling with Gensim and pyLDAvis

This Colab Notebook guides you through using Python to create an interactive topic modeling visualization. It walks you through the steps of importing data and the necessary packages, cleaning and processing text data, creating a topic model, and visualizing the topics in an interactive, web-based application.

Note for Colab: some install/setup steps are commented out in the code. Uncomment those cells when running in Colab.

If you would like to do more advanced topic modeling, including by integrating Mallet, testing for coherence of the model, visualizing metrics, and examining topic distributions over a set of documents Gensim provides a wide array of resources. I will separately upload a tutorial of advanced topic modeling strategies.

# Mount Drive

In [1]:
try:
    from google.colab import drive
    drive.mount('/gdrive')
    %cd /gdrive
except Exception:
    print("Colab drive mount skipped (not in Colab).")

Colab drive mount skipped (not in Colab).


# Upload Files

In [2]:
#from google.colab import files

#uploaded = files.upload()

#for fn in uploaded.keys():
#  print('User uploaded file "{name}" with length {length} bytes'.format(
#      name=fn, length=len(uploaded[fn])))

# Import CSV Data from Github

In [3]:
import os

local_rt = "../../Data/RottenTomatoes.csv"
RTdata = local_rt if os.path.exists(local_rt) else "https://raw.githubusercontent.com/hawc2/Text-Analysis-with-Python/master/Data/RottenTomatoes.csv"

In [4]:
#SFdata = 'https://raw.githubusercontent.com/hawc2/Text-Analysis-with-Python/master/Scifi.csv'

# Convert RottenTomatoes.csv to Data Frame

In [5]:
import numpy as np
import pandas as pd

In [6]:
df = pd.read_csv(RTdata, usecols=['Username', 'content'], encoding = 'utf-8')

In [7]:
data = df.content.values.tolist()

### View Dataframe

In [8]:
print(df)

                                                content     Username
0     I totally misheard and thought this was going ...       Matt D
1     Great movie that shares a very wide range of e...      Marks V
2     Despite a minimal narrative arc, the film does...      Jared D
3     Kathryn Bigelow's The Hurt Locker is something...      Brett C
4                           Worth the best picture win.  Christian H
...                                                 ...          ...
1007  Intense look into the lives of an American ord...  Westleigh Q
1008  if you've heard anything about this movie you'...      Tyson P
1009  O.V.R 8....can't believe I've waited this long...       Bryn D
1010  Explosive, dominating, an emotional firestorm,...      Aaron J
1011  The Hurt Locker is easily the best "dramatizat...       Joel D

[1012 rows x 2 columns]


In [9]:
try:
    from google.colab import data_table
    get_ipython().run_line_magic("load_ext", "google.colab.data_table")
except Exception:
    pass
df

Unnamed: 0,content,Username
0,I totally misheard and thought this was going ...,Matt D
1,Great movie that shares a very wide range of e...,Marks V
2,"Despite a minimal narrative arc, the film does...",Jared D
3,Kathryn Bigelow's The Hurt Locker is something...,Brett C
4,Worth the best picture win.,Christian H
...,...,...
1007,Intense look into the lives of an American ord...,Westleigh Q
1008,if you've heard anything about this movie you'...,Tyson P
1009,O.V.R 8....can't believe I've waited this long...,Bryn D
1010,"Explosive, dominating, an emotional firestorm,...",Aaron J


# Convert Scifi.CSV to Data Frame

In [10]:
#dfSF = pd.read_csv(SFdata, usecols=['BookChapter', 'text'], encoding = 'utf-8')

In [11]:
#dfSF['text']=dfSF['text'].apply(str)

In [12]:
#dataSF = dfSF.text.values.tolist()

# Clean Texts

In [13]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /Users/alexwermer-
[nltk_data]     colan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
# A simple way to add further stop words
#stop_words.append('movie')

In [15]:
!pip3 install spacy
!python -m spacy download en_core_web_lg

Defaulting to user installation because normal site-packages is not writeable





[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m


zsh:1: command not found: python


In [16]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()



In [17]:
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess

In [18]:
import re

In [19]:
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]

In [20]:
def sent_to_words(sentences):
    for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

In [21]:
print(data_words)



In [22]:
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [23]:
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]

def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]

#def make_trigrams(texts):
#   return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
     doc = nlp(" ".join(sent))
     texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

In [24]:
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])

In [25]:
#print(data_lemmatized[:4])

# Building Dictionary and Corpus

In [26]:
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)], [(14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)], [(0, 1), (1, 1), (8, 1), (9, 5), (14, 2), (17, 23), (20, 3), (21, 1), (22, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 5), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 2), (53, 1), (54, 1), (55, 1), (56, 1), (57, 2), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 2), (72, 3), (73, 1), (74, 1), (75, 1), (76, 1), (77, 2), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 2), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 2), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102

# Create Topic Model - Topics 20

In [27]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=2,
                                           chunksize=100,
                                           passes=20,
                                           alpha='auto',
                                           per_word_topics=True)

# Create Visualization (Save HTML)

The easiest way to create the visualization is to reveal it in the Google Colab notebook and save it as an html file that you can view on your browser. 

In [28]:
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models

Defaulting to user installation because normal site-packages is not writeable





[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m


In [29]:
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
#vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds='mmds')

In [30]:
pyLDAvis.save_html(vis, 'LDAviz.html')

In [31]:
pyLDAvis.display(vis)

# Topic Modeling Model - 60 Topics

In [32]:
lda_model60 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=60,
                                           random_state=100,
                                           update_every=2,
                                           chunksize=100,
                                           passes=20,
                                           iterations=200,
                                           alpha='auto',
                                           per_word_topics=True)

# Create Visualization (Save HTML)

The easiest way to create the visualization is to reveal it in the Google Colab notebook and save it as an html file that you can view on your browser. 

In [33]:
vis60 = pyLDAvis.gensim_models.prepare(lda_model60, corpus, id2word)
#vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds='mmds')

In [34]:
import sys

output_path = "/content/LDAviz60.html" if "google.colab" in sys.modules else "LDAviz60.html"
pyLDAvis.save_html(vis60, output_path)


In [35]:
pyLDAvis.display(vis60)

# Serve Visualization in Browser

You can also serve the visualization locally in the browser using the below chunk of code. Beware that caching in your browser and other issues, such as ad-blockers, may require some debugging to get this working on your machine. 

In [36]:
#pyLDAvis.enable_notebook()
#pyLDAvis.show(vis)