**ASSIGNMENT 3: TOPIC MODELING**

This notebook introduces you to topic modeling with Latent Dirichlet Allocation (LDA). A topic model takes a collection of texts and attempts to assign different "topics" based on words that text to appear together. For this assignment, we'll be working with two corpora: **associatedPress.txt** and **poetryFoundation.csv**. The first is a collection of articles from the associated press and the second is a collection of poems scraped from the Poetry Foundation website.

**PART 1: Associated Press Articles**

First, let's import the libraries we'll be using for the activitity. LDA topic modeling has been implemented in many different libaries, but we'll use the sklearn version, which is that same library from our previous assignment.

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation

One library used for visualization does not come pre-installed in Google colab, so we install that here.

In [None]:
!pip install pyLDAvis==2.1.2

Once it's installed, we can import that as well.

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

  from collections import Iterable


Now, go ahead and import associatedPress.txt.

In [None]:
from google.colab import files
files.upload()

Saving associatedPress.txt to associatedPress.txt




The next cell reads the uploaded text file into a format Python can use.

In [None]:
with open('associatedPress.txt', 'r') as txtfile:
    text = txtfile.read()

The Associated Press data comes wrapped in HTML tags, so in this next cell we extract the plain text from those tags.

In [None]:
soup = BeautifulSoup(text)
ap_texts = [elem.text for elem in soup.find_all('text')]

This cell converts all the texts into a bag-of-words representation, by counting up how often each word appears. Additionally, it strips a few super-common tokens that will swamp the results, such as "said," which is really common in news articles.

In [None]:
add_stopwords = ['000', 'said']
stopwords = ENGLISH_STOP_WORDS.union(add_stopwords)
ap_vectorizer = CountVectorizer(
    ngram_range=(1, 1),
    stop_words=stopwords,
    max_features=10000
)
ap_features = ap_vectorizer.fit_transform(ap_texts)

In the following cell, you can declare your target number of topics and then train the topic model. By default, I have it set to 25 topics, but you can experiment with different numbers if you like.

Note that topic models take a while to train. With this corpus with 25 topics, it might take a full minute to run. Increasing the topics will make it take longer.

In [None]:
number_of_topics = 100
seed_number = 100

ap_lda = LatentDirichletAllocation(n_components=number_of_topics, random_state=seed_number)
ap_lda_docs = ap_lda.fit_transform(ap_features)

Now, we can use the very handy Python LDA Visualization library to produce a nice interactive graphic that lets us explore our topic model. 

There are two things to note about this visualization. The right side gives the top words for a selected topic. Blue bars represent the frequency of the word in the entire corpus and red bars represent its frequency within the specific topic. The left side shows a 2-dimensional projection of the topics. Topics that are similar to each other will appear closer to each other on that visualization.

In [None]:
ap_viz = pyLDAvis.sklearn.prepare(ap_lda, ap_features, ap_vectorizer, sort_topics=False)
pyLDAvis.display(ap_viz)

Once the model has been trained, each document is represented as a distribution of topics: 25% topic 1, 19% topic 2, 15% topic 3, and so on. In this cell, we can choose a topic and print out the documents that have the highest topic proportion for that topic, as a simple way of seeing some of the most "characteristic" documents for that topic. 

You can select the topic of your choice with the "topic" variable, as well as alter the number of documents to view (it is set to 5 by default). 

Since it is just plain text, the formatting is kind of annoying to read, unfortunately.

In [None]:
topic = 10
num_docs_to_view = 5

topic_top_docs = np.array([doc[topic - 1] for doc in ap_lda_docs]).argsort()[::-1][:num_docs_to_view]
for docid in topic_top_docs:
    print(ap_texts[docid])


 Soldiers shot at anti-government demonstrators in central Bucharest tonight after the protesters occupied state-run television and stormed and burned police headquarters, witnesses said. One witness reported seeing at least two bodies after the shooting, but this could not be immediately confirmed. The most serious violence in the Romanian capital since December's revolution that toppled the Ceausescu regime was caused by a pre-dawn police raid that ended a 53-day anti-Communist protest in University Square. Demonstrators clashed with police during the afternoon in the square, then fanned out to police headquarters, secret police headquarters, the TV station and Victory Square, headquarters of the governing National Salvation Front. The shooting occurred outside the secret police headquarters, where the late dictator Nicolae Ceausescu's hated Securitate force operated. Before the buildings were attacked, President-elect Ion Iliescu issued a communique calling on all ``aware and respo

**Part 1 Reflection**

Spend a few moments browsing the topic output for this corpus. Can you assign a logical theme to any of the topics? Try to come up with names for five of them and list a few words that support your interpretation. Then, try to see if you can figure out roughly how old these articles are based the topic output. Just give it your best guess (though some of the topics might be dead giveaways). 

**PART TWO: Poetry**

In our reading by Lisa Rhody, we saw that the output of a topic model trained on poetry won't be "topics" per se, since poetry is extremely varied in the language it uses to discuss specific topics or themes. Instead it produces what might better be described as "discourses." In this section, we'll see that phenomenon in action by looking at 4215 poems scraped from the Poetry Foundation website. These are all the poems on the site written by the 200 authors that appear most frequently on the site. A list of these authors ordered by frequency is available as "authors.txt" on Canvas (William Shakespeare is, unsurprisingly, #1).

Go ahead and upload **poetryFoundation.csv**. This time, it's a .csv (comma-separated values) file, which is a spreadsheet format.

In [None]:
files.upload()

The spreadsheet includes information on genre and author-name that we won't be using for the topic model itself, so the next cell extracts just the text of the "Poem" column.

In [None]:
df = pd.read_csv('poetryFoundation.csv')
pf_texts = df['Poem']

Now, we make a bag-of-words representation of the text by counting up each word.

In [None]:
pf_vectorizer = CountVectorizer(
    ngram_range=(1, 1),
    stop_words=stopwords,
    max_features=10000
)
pf_features = pf_vectorizer.fit_transform(pf_texts)

Now, train your topic model. It uses 20 topics by default, though you are welcome to tinker with this number.

This topic model will take a little bit longer to train than the previous one, since we're working with more than 4000 poems. Give it a second to run - it might take a couple minutes. 

In [None]:
number_of_topics = 20
seed_number = 100

pf_lda = LatentDirichletAllocation(n_components=number_of_topics, random_state=seed_number)
pf_lda_docs = pf_lda.fit_transform(pf_features)

Finally, we make a visualization of our topic model.

In [None]:
pf_viz = pyLDAvis.sklearn.prepare(pf_lda, pf_features, pf_vectorizer, sort_topics=False)
pyLDAvis.display(pf_viz)

Again, we can print out the top poems for each topic. This cell also prints out the author and title of the poem. The formatting will likely be a little bit wacky, so you may want to just scroll to the title and poet and find it by Googling it. They should theoreticaly all be available on the Poetry Foundation website.

In [None]:
topic = 1
num_docs_to_view = 5

topic_top_docs = np.array([doc[topic - 1] for doc in pf_lda_docs]).argsort()[::-1][:num_docs_to_view]
for docid in topic_top_docs:
    print(df['Title'][docid])
    print(df['Poet'][docid])
    print(pf_texts[docid])

**Part 2 Reflection**

Take a few moments to browse the output of the topic model and write a brief reflection on the results. How do these topics compare to the topics produced by the first model? Do you find them equally as coherent or more difficult to interpret? 

Then, try to come up with a coherent theme for three of the topics and cite a few of the top words to support your interpretation. Keep in mind the question of figurative language. Do you see any topics in which one of the words appears to be a metaphor, image, or trope rather than a word on a specific "topic"? I'll be especially impressed if you can find an interpretation of a topic that is "semantically opaque," as Lisa Rhody puts it. But the topics will be tricky, so it's also fine to just pick the three most self-evident topics you find.

Note that since the corpus includes poetry composed over more than 400 years, your topic model might be picking up incidental, non-topical stuff, like historical linguistic variation. It also might pick up the personal style of a particular author if that author appears very frequently in the corpus (read: Shakespeare). That's ok - it's a limitation of the data with which we are working rather than the method itself.