<a href="https://colab.research.google.com/github/hawc2/Text-Analysis-with-Python/blob/master/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Topic Modeling with Gensim and pyLDAvis

This Colab Notebook guides you through using Python to create an interactive topic modeling visualization. It walks you through the steps of importing data and the necessary packages, cleaning and processing text data, creating a topic model, and visualizing the topics in an interactive, web-based application.


If you would like to do more advanced topic modeling, including by integrating Mallet, testing for coherence of the model, visualizing metrics, and examining topic distributions over a set of documents Gensim provides a wide array of resources. I will separately upload a tutorial of advanced topic modeling strategies.

# Mount Drive

In [1]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


# Upload Files

In [2]:
#from google.colab import files

#uploaded = files.upload()

#for fn in uploaded.keys():
#  print('User uploaded file "{name}" with length {length} bytes'.format(
#      name=fn, length=len(uploaded[fn])))

# Import CSV Data from Github

In [3]:
RTdata = 'https://raw.githubusercontent.com/hawc2/Text-Analysis-with-Python/master/RottenTomatoes.csv'

In [4]:
#SFdata = 'https://raw.githubusercontent.com/hawc2/Text-Analysis-with-Python/master/Scifi.csv'

# Convert RottenTomatoes.csv to Data Frame

In [4]:
import numpy as np
import pandas as pd

In [5]:
df = pd.read_csv(RTdata, usecols=['Username', 'content'], encoding = 'utf-8')

In [6]:
data = df.content.values.tolist()

### View Dataframe

In [None]:
df

In [None]:
%load_ext google.colab.data_table 
df

# Convert Scifi.CSV to Data Frame

In [5]:
#dfSF = pd.read_csv(SFdata, usecols=['BookChapter', 'text'], encoding = 'utf-8')

In [20]:
#dfSF['text']=dfSF['text'].apply(str)

In [21]:
#dataSF = dfSF.text.values.tolist()

# Clean Texts

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
!pip3 install spacy
!python -m spacy download en_core_web_lg

In [11]:
import spacy
import en_core_web_sm
#from spacy.lang.en import English
#parser = English()
#nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])

In [12]:
import en_core_web_lg
nlp = en_core_web_lg.load()

In [13]:
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess

In [14]:
import re
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]

In [15]:
def sent_to_words(sentences):
    for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

In [16]:
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)



In [17]:
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
     doc = nlp(" ".join(sent))
     texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

In [18]:
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])

In [None]:
#print(data_lemmatized[:4])

# Building Dictionary and Corpus

In [None]:
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus)

# Create Topic Model

In [20]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=2,
                                           chunksize=100,
                                           passes=20,
                                           alpha='auto',
                                           per_word_topics=True)

# Create Visualization (Save HTML)

The easiest way to create the visualization is to reveal it in the Google Colab notebook and save it as an html file that you can view on your browser. 

In [None]:
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim

In [22]:
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
#vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds='mmds')

In [None]:
pyLDAvis.save_html(vis, '/content/LDAviz.html')

In [23]:
pyLDAvis.display(vis)

# Serve Visualization in Browser

You can also serve the visualization locally in the browser using the below chunk of code. Beware that caching in your browser and other issues, such as ad-blockers, may require some debugging to get this working on your machine. 

In [None]:
#pyLDAvis.enable_notebook()
#pyLDAvis.show(vis)