# Discussion: Topic Modeling

## Group Names and Roles

- Eirik (Driver)
- Bryan (Reviewer)
- Celine (Proposer)

## Intro

In this Discussion activity, we'll continue with with topic modeling. Recall that topic modeling can often be used to infer themes (or "topics") from sets of text data. Today, we will work through an example in which we download some data, prepare it appropriately, and deploy a topic model to obtain insights about the general themes present in the data. 

Our data set for this activity consists of the texts of a number of Associated Press articles. It was originally collected by David Blei. I retrieved this data set [here](https://github.com/tdhopper/topic-modeling-datasets/tree/master/data/lda-c/blei-ap). 

Run the following code chunk to create a large string `s` containing the entire data set. 

In [1]:
import urllib
def retrieve_text(url):
    """
    Retrieve text from the specified url and return 
    it as a string
    """
    
    # grab the data and parse it
    filedata = urllib.request.urlopen(url) 
    data = filedata.read()
    
    return(data.decode())

url = 'https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/blei-ap.txt'
s = retrieve_text(url)

## Part A

Inspect `s`. Don't print out the entire string; just take a look at the first 5,000 characters or so. Write a function which, when `s` is provided as input, will return a list of document texts. It should exclude the excess tags and other metadata. Call this function to create a new list variable called `texts`, where each element of `texts` is the complete text of one news story. 

- ***Hint***: *First, split `s` on the newline character `"\n"`. Then, return a list of elements of the result with length longer than 100. This can be done with a for-loop, but a conditional list comprehension will be more compact*

The resulting list of news stories should have length 2226. 

Comments and docstrings are not necessary for this function. 

In [2]:
s[:5000]

"<DOC>\n<DOCNO> AP881218-0003 </DOCNO>\n<TEXT>\n A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family 

In [11]:
# check the length of the result
def modify_text(st):
    st = st.split('\n')
    L = [i for i in st if len(i) >= 100]
    return (L)
news_stories = modify_text(s)
print(len(news_stories))

2226


## Part B

Create a `pandas` data frame called `df` with a single column called `text`, whose rows are the entries of `texts`. This data frame should have 2226 rows. Show your data frame to check that it looks ok. 

In [14]:
import pandas as pd

df = pd.DataFrame({"text" : news_stories})
df

Unnamed: 0,text
0,A 16-year-old student at a private Baptist sc...
1,The Bechtel Group Inc. offered in 1985 to sel...
2,A gunman took a 74-year-old woman hostage aft...
3,"Today is Saturday, Oct. 29, the 303rd day of ..."
4,Cupid has a new message for lovers this Valen...
...,...
2221,The dollar rose in quiet European trading thi...
2222,Here are the companies known to be conducting...
2223,Bloodstains on a pillowcase and exercise bar ...
2224,When Ron Thompson sat down for lunch on New Y...


## Part C

Create the term-document matrix. The group's **Reviewer** might want to check the [lecture notes](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/NLP/NLP_2.ipynb) on topic modeling for some code to do this. Add the term-document matrix to `df`. Make sure that the columns are labeled with the relevant word. 

I found that, for the purposes of the later parts of this exercise, the following arguments to `CountVectorizer` worked well: 

> `max_df = 100, min_df=0.01, stop_words='english'`

However, please feel free to experiment. 

Call the new data frame with counts `big_df`. 

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_df = 100, min_df = 0.01, stop_words = "english")
counts = vec.fit_transform(df['text'])
counts = counts.toarray()
count_df = pd.DataFrame(counts, columns = vec.get_feature_names())
big_df = pd.concat((df, count_df), axis = 1)

## Part D

Create an input matrix `X` which is identical to `big_df` but drops the `text` column. Then, create a Nonnegative Matrix Factorization (NMF) model and fit it to `X`. Start with 10 components. 

In [18]:
X = big_df.drop(['text'], axis = 1)

from sklearn.decomposition import NMF
model = NMF(n_components = 10, init = "random", random_state = 0)
model.fit(X)

NMF(init='random', n_components=10, random_state=0)

## Part E

The following code (from lecture) will extract the top words within each topic. Run this code.

In [19]:
import numpy as np
def top_words(X, model, component, num_words):
    """
    Extract the top words from the specified component 
    for a topic model trained on data. 
    X: a term-document matrix, assumed to be a pd.DataFrame
    model: a sklearn model with a components_ attribute, e.g. NMF
    component: the desired component, specified as an integer. 
        Must be less than than the total number of components in model
    num_words: the number of words to return.
    """
    orders = np.argsort(model.components_, axis = 1)
    important_words = np.array(X.columns)[orders]
    return important_words[component][-num_words:]

Use this code to investigate the topics constructed by the model. Can you interpret any of them? Keep in mind that many of these news articles are from the 1980s and 1990s. That's before many of you were born, you whippersnappers! 

I was able to find U.S. political party conventions; fluctuations in the price of oil; U.S. / Soviet tensions; and international finance news, among other things. Show the top words for a few different topics, and see whether any of them look interpretable to you.  

In [22]:
# show the top words for a topic
for i in range(1,10):
    top10 = top_words(X, model, i, 10)
    print("Topic " + str(i) + ":")
    print(top10)
    print('\n')

Topic 1:
['berlin' 'currency' 'tokyo' 'dealers' 'troy' 'ounce' 'bid' 'gold' 'yen'
 'german']


Topic 2:
['polls' 'primary' 'voters' 'republicans' 'democrats' 'convention' 'poll'
 'governor' 'massachusetts' 'dukakis']


Topic 3:
['jews' 'strip' 'jerusalem' 'occupied' 'palestinians' 'palestinian'
 'jewish' 'arab' 'israeli' 'israel']


Topic 4:
['contract' 'gallon' 'settled' 'crude' 'gold' 'corn' 'pound' 'futures'
 'cent' 'cents']


Topic 5:
['persian' 'arab' 'iran' 'arabia' 'saddam' 'gulf' 'saudi' 'kuwait' 'iraqi'
 'iraq']


Topic 6:
['ohio' 'river' 'inches' 'northern' 'farmers' 'rain' 'environmental'
 'plant' 'fair' 'water']


Topic 7:
['soviets' 'nuclear' 'treaty' 'republic' 'reform' 'mikhail' 'summit'
 'republics' 'moscow' 'gorbachev']


Topic 8:
['bond' 'listed' 'dow' 'deficit' 'volume' 'inflation' 'shares' 'quarter'
 'stocks' 'index']


Topic 9:
['fbi' 'flight' 'african' 'parents' 'meese' 'contra' 'testimony' 'jury'
 'aids' 'africa']


