# HW: Topic Modeling with Gensim
Topic modeling a corpus of poems from Dorothy Parker's _Enough Rope_, found here: https://www.gutenberg.org/cache/epub/68353/pg68353.txt

## Set up

In [379]:
! pip install funcy



In [380]:
! pip install tzdata



In [381]:
! pip install --no-dependencies pyLDAvis



In [382]:
! pip install wget



In [383]:
from collections import defaultdict
import wget
from gensim import corpora, models
import pandas as pd
import pyLDAvis.gensim
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import requests
import spacy

# set up nlp pipline
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

['ner', 'parser']

## Upload data

In [384]:
# use requests library to load text
url = 'https://www.gutenberg.org/cache/epub/68353/pg68353.txt'
response = requests.get(url)
text = response.text

In [385]:
start = text.find('_Threnody_')
end = text.find('*** END OF THE PROJECT GUTENBERG EBOOK ENOUGH ROPE: POEMS ***') - 1

In [386]:
tale = text[start:end]

In [387]:
tale_poems = tale.split('_')

In [388]:
for i in range (0, len(tale_poems)):
    tale_poems[i] = tale_poems[i].replace('\n', '')
    tale_poems[i] = tale_poems[i].replace('\r', '')
    tale_poems[i] = tale_poems[i].strip()

In [389]:
# creating empty lists for author and title will be handy for building our dataframe
author = []
title = []
text = []

for poem in tale_poems:
    author.append('Dorothy Parker')
    if len(poem) < 50 and len(poem) > 2 and poem[0].isupper():
        title.append(poem)
    elif len(poem) > 50:
        text.append(poem)
    else:
        pass

In [390]:
# create dataframe
df = pd.DataFrame(list(zip(author, title, text)), columns=['author', 'title', 'text'])
df.head()

Unnamed: 0,author,title,text
0,Dorothy Parker,Threnody,Lilacs blossom just as sweet Now my heart i...
1,Dorothy Parker,The Small Hours,No more my little song comes back; And no...
2,Dorothy Parker,The False Friends,"They laid their hands upon my head, They st..."
3,Dorothy Parker,The Trifler,Death's the lover that I'd be taking; Wil...
4,Dorothy Parker,A Very Short Song,"Once when I was young and true, Someone l..."


In [391]:
# extract lemmas
def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

In [392]:
# apply process_text to text column
df['lemmas'] = df['text'].apply(process_text)

In [393]:
# sanity check
df.head()

Unnamed: 0,author,title,text,lemmas
0,Dorothy Parker,Threnody,Lilacs blossom just as sweet Now my heart i...,lilacs blossom sweet heart shatter bowl street...
1,Dorothy Parker,The Small Hours,No more my little song comes back; And no...,little song come night lie head watch black wa...
2,Dorothy Parker,The False Friends,"They laid their hands upon my head, They st...",lay hand head stroke cheek brow time heal hurt...
3,Dorothy Parker,The Trifler,Death's the lover that I'd be taking; Wil...,death lover take wild fickle fierce small care...
4,Dorothy Parker,A Very Short Song,"Once when I was young and true, Someone l...",young true leave break brittle heart bad love ...


## Prepare data for topic model


In [394]:
# extract the data out of the DataFrame
documents = df['lemmas'].to_list()
df.head()

Unnamed: 0,author,title,text,lemmas
0,Dorothy Parker,Threnody,Lilacs blossom just as sweet Now my heart i...,lilacs blossom sweet heart shatter bowl street...
1,Dorothy Parker,The Small Hours,No more my little song comes back; And no...,little song come night lie head watch black wa...
2,Dorothy Parker,The False Friends,"They laid their hands upon my head, They st...",lay hand head stroke cheek brow time heal hurt...
3,Dorothy Parker,The Trifler,Death's the lover that I'd be taking; Wil...,death lover take wild fickle fierce small care...
4,Dorothy Parker,A Very Short Song,"Once when I was young and true, Someone l...",young true leave break brittle heart bad love ...


In [395]:
df.to_csv(f'dorothy_parker_poems.csv', index=False)

`Gensim` needs each document to be tokenized. We can use [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) to quickly achieve this result. When complete, our data will now look like this:

`[
  ['This', 'is', 'document', '1'],
  ['This', 'is', 'document', '2'],
  ['This', 'is', 'document', '3'],
]`

In [396]:
# tokenize - the syntax below will create a list of lists
texts =[
    [word for word in document.lower().split()]
    for document in documents
]
len(texts)

91

It takes a lot of preparation to build a useful topic model. An important part of that preparation is to eliminate "noise" from you model. One way to do this is to remove pieces of data that are irrelevant. Here we will remove tokens that only occur once. **You may want to adjust this as you refine your topic model.**

In [397]:
# create a count of each token
frequency = defaultdict(int)
for text in texts:
  for token in text:
    frequency[token] += 1

In [398]:
# remove words that appear only 1 time
texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

## Build topic model

In [405]:
# create a dictionary based off our texts
# The dictionary maps each token to a unique integer id
dictionary = corpora.Dictionary(texts)

In [406]:
# create a corpus based off our dictionary and our texts
corpus = [dictionary.doc2bow(text) for text in texts]

In [407]:
# build LDA model
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, passes=75)

In [408]:
# explore topics
lda_model.print_topics()

[(0,
  '0.040*"heart" + 0.024*"man" + 0.024*"word" + 0.024*"woman" + 0.024*"love" + 0.016*"eye" + 0.016*"find" + 0.016*"whistle" + 0.016*"away" + 0.016*"lady"'),
 (1,
  '0.028*"little" + 0.024*"like" + 0.024*"oh" + 0.020*"heart" + 0.020*"death" + 0.012*"let" + 0.012*"wait" + 0.012*"till" + 0.012*"young" + 0.012*"know"'),
 (2,
  '0.029*"fair" + 0.024*"lad" + 0.024*"pass" + 0.014*"little" + 0.014*"life" + 0.014*"young" + 0.014*"good" + 0.014*"way" + 0.010*"play" + 0.010*"affluent"'),
 (3,
  '0.067*"love" + 0.034*"shall" + 0.021*"heart" + 0.018*"lady" + 0.018*"way" + 0.017*"die" + 0.017*"old" + 0.017*"day" + 0.017*"run" + 0.013*"year"'),
 (4,
  '0.022*"look" + 0.022*"time" + 0.017*"lie" + 0.011*"pretty" + 0.011*"way" + 0.011*"leave" + 0.011*"word" + 0.011*"warm" + 0.011*"hair" + 0.011*"head"'),
 (5,
  '0.033*"shall" + 0.018*"know" + 0.015*"oh" + 0.015*"heart" + 0.015*"laugh" + 0.015*"tell" + 0.015*"thing" + 0.015*"little" + 0.015*"rain" + 0.015*"hand"'),
 (6,
  '0.002*"sepulchral" + 0.002

In [409]:
# Find topics in each document
lda_model.get_document_topics(corpus[0])

[(11, 0.9813721)]

In [410]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis