<a href="https://colab.research.google.com/github/mkane968/Extracted-Features/blob/master/Topic_Modeling_with_SciFi_Corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Topic Modeling with Gensim and pyLDAvis

This Colab Notebook guides you through using Python to create an interactive topic modeling visualization. It walks you through the steps of importing data and the necessary packages, cleaning and processing text data, creating a topic model, and visualizing the topics in an interactive, web-based application.


If you would like to do more advanced topic modeling, including by integrating Mallet, testing for coherence of the model, visualizing metrics, and examining topic distributions over a set of documents Gensim provides a wide array of resources. I will separately upload a tutorial of advanced topic modeling strategies.

# Mount Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Upload Files

In [3]:
from google.colab import files

uploaded = files.upload()

Saving output (2).csv to output (2).csv


# Convert CSV to Data Frame

In [4]:
import numpy as np
import pandas as pd
import io

In [8]:
df = pd.read_csv(io.StringIO(uploaded['output (2).csv'].decode('utf-8')))
df

Unnamed: 0.1,Unnamed: 0,Book + Chapter,Text
0,0,1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_0,we d211 d249 d261 d340 d421 d457 1963 a a ...
1,1,1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_1,okay they 1 25mg a a a a a a a a a a...
2,2,1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_2,2 a a a a a a a a a a a a a ...
3,3,1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_3,2143 3 4 a a a a a a a a a a a a a a ...
4,4,1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_4,watch 4 a a a a a a a a a a...
...,...,...,...
78,78,1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_0,a a a a after and and and and and another a...
79,79,1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_1,come dad fine for i its im ive the they wev...
80,80,1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_2,then and called captives city dead flames last...
81,81,1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_3,a amateur and and characters city conducted c...


In [10]:
data = df.Text.values.tolist()

### View Dataframe

In [11]:
print(df)

    Unnamed: 0                              Book + Chapter  \
0            0  1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_0   
1            1  1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_1   
2            2  1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_2   
3            3  1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_3   
4            4  1963_DICK_THEGAMEPLAYERSOFTITAN _Chapter_4   
..         ...                                         ...   
78          78  1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_0   
79          79  1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_1   
80          80  1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_2   
81          81  1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_3   
82          82  1965_DELANY_CITYOFATHOUSANDSUNS _Chapter_4   

                                                 Text  
0    we d211 d249 d261 d340 d421 d457    1963 a a ...  
1   okay  they          1 25mg a a a a a a a a a a...  
2                     2 a a a a a a a a a a a a a ...  
3            2143 3 4 a a a a a

In [None]:
%load_ext google.colab.data_table 
df

# Convert Scifi.CSV to Data Frame

In [None]:
#dfSF = pd.read_csv(SFdata, usecols=['BookChapter', 'text'], encoding = 'utf-8')

In [None]:
#dfSF['text']=dfSF['text'].apply(str)

In [None]:
#dataSF = dfSF.text.values.tolist()

# Clean Texts

In [34]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# A simple way to add further stop words
#stop_words.append('movie')

In [15]:
!pip3 install spacy
!python -m spacy download en_core_web_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2022-08-01 19:52:37.344077: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 8.9 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [16]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

In [17]:
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess

In [18]:
import re

In [None]:
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]

In [31]:
def sent_to_words(sentences):
    for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

In [21]:
print(data_words[:10])



In [22]:
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)



In [32]:
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]

def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]

#def make_trigrams(texts):
#   return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
     doc = nlp(" ".join(sent))
     texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

In [35]:
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])

In [36]:
print(data_lemmatized[:4])

[['absolute', 'ace', 'ace', 'back', 'good', 'book_book', 'books_book', 'come', 'case', 'characterize', 'chill', 'clear', 'collection', 'convention', 'cosmic', 'cryptic', 'day', 'divide', 'edition', 'eighteenth', 'enough', 'enthusiasm', 'entirely', 'eye', 'feeling', 'fertile', 'fiction', 'follow', 'frighteningly', 'futurity', 'game', 'gameplaye', 'garden', 'hammer', 'handwriting', 'hed_he', 'high', 'hostile', 'include', 'know', 'know', 'lead', 'life', 'lottery', 'magazine', 'man', 'matchbook', 'message', 'mind', 'nightmare', 'northern', 'novellength', 'novel', 'observer', 'peaceful', 'planet', 'play', 'puppet', 'quite', 'recognition', 'record', 'relation', 'resident', 'science', 'sciencefiction', 'seem', 'short', 'show', 'sky', 'solar', 'speculative', 'stake', 'start', 'story', 'sure', 'surround', 'symbolism', 'talents_talent', 'talk', 'time', 'titanian', 'unusual', 'usual', 'variable', 'volume', 'vugs_vug', 'want', 'wife', 'work'], ['mg', 'abandon', 'absence', 'absorb', 'accident', 'ac

# Building Dictionary and Corpus

In [37]:
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus)

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 2), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1)], [(2, 2), (10, 1), (21, 1), (22, 1), (25, 1), (36, 1), (38, 1), (46, 1), (54, 1), (64, 1), (70, 2), (85, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1),

# Create Topic Model - Topics 20

In [38]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=2,
                                           chunksize=100,
                                           passes=20,
                                           alpha='auto',
                                           per_word_topics=True)

# Create Visualization (Save HTML)

The easiest way to create the visualization is to reveal it in the Google Colab notebook and save it as an html file that you can view on your browser. 

In [39]:
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.4 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=30b9b8d5106813d49a9a1abba04b5a9ac8f2de9ef663bcda5bf3408dcf9473c2
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.

  from collections import Iterable


In [40]:
vis = gensimvis.prepare(lda_model, corpus, id2word)

#vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds='mmds')

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [41]:
pyLDAvis.save_html(vis, '/content/LDAviz.html')

In [42]:
pyLDAvis.display(vis)

# Topic Modeling Model - 60 Topics

In [None]:
lda_model60 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=60,
                                           random_state=100,
                                           update_every=2,
                                           chunksize=100,
                                           passes=20,
                                           iterations=200,
                                           alpha='auto',
                                           per_word_topics=True)

# Create Visualization (Save HTML)

The easiest way to create the visualization is to reveal it in the Google Colab notebook and save it as an html file that you can view on your browser. 

In [None]:
vis60 = pyLDAvis.gensim.prepare(lda_model60, corpus, id2word)
#vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds='mmds')

In [None]:
pyLDAvis.save_html(vis60, '/content/LDAviz60.html')

In [None]:
pyLDAvis.display(vis60)

# Serve Visualization in Browser

You can also serve the visualization locally in the browser using the below chunk of code. Beware that caching in your browser and other issues, such as ad-blockers, may require some debugging to get this working on your machine. 

In [None]:
#pyLDAvis.enable_notebook()
#pyLDAvis.show(vis)