# Discussion: Topic Modeling

## Group Names and Roles

- Group Member 1 (Role)
- Group Member 2 (Role)
- Group Member 3 (Role)

## Intro

In this Discussion activity, we'll continue with with topic modeling. Recall that topic modeling can often be used to infer themes (or "topics") from sets of text data. Today, we will work through an example in which we download some data, prepare it appropriately, and deploy a topic model to obtain insights about the general themes present in the data. 

Our data set for this activity consists of the texts of a number of Associated Press articles. It was originally collected by David Blei. I retrieved this data set [here](https://github.com/tdhopper/topic-modeling-datasets/tree/master/data/lda-c/blei-ap). 

Run the following code chunk to create a large string `s` containing the entire data set. 

In [None]:
import urllib
def retrieve_text(url):
    """
    Retrieve text from the specified url and return 
    it as a string
    """
    
    # grab the data and parse it
    filedata = urllib.request.urlopen(url) 
    data = filedata.read()
    
    return(data.decode())

url = 'https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/blei-ap.txt'
s = retrieve_text(url)

## Part A

Inspect `s`. Don't print out the entire string; just take a look at the first 5,000 characters or so. Write a function which, when `s` is provided as input, will return a list of document texts. It should exclude the excess tags and other metadata. Call this function to create a new list variable called `texts`, where each element of `texts` is the complete text of one news story. 

- ***Hint***: *First, split `s` on the newline character `"\n"`. Then, return a list of elements of the result with length longer than 100. This can be done with a for-loop, but a conditional list comprehension will be more compact*

The resulting list of news stories should have length 2226. 

Comments and docstrings are not necessary for this function. 

In [None]:
# check the length of the result


## Part B

Create a `pandas` data frame called `df` with a single column called `text`, whose rows are the entries of `texts`. This data frame should have 2226 rows. Show your data frame to check that it looks ok. 

## Part C

Create the term-document matrix. The group's **Reviewer** might want to check the [lecture notes](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/NLP/NLP_2.ipynb) on topic modeling for some code to do this. Add the term-document matrix to `df`. Make sure that the columns are labeled with the relevant word. 

I found that, for the purposes of the later parts of this exercise, the following arguments to `CountVectorizer` worked well: 

> `max_df = 100, min_df=0.01, stop_words='english'`

However, please feel free to experiment. 

Call the new data frame with counts `big_df`. 

## Part D

Create an input matrix `X` which is identical to `big_df` but drops the `text` column. Then, create a Nonnegative Matrix Factorization (NMF) model and fit it to `X`. Start with 10 components. 

## Part E

The following code (from lecture) will extract the top words within each topic. Run this code.

In [None]:
import numpy as np
def top_words(X, model, component, num_words):
    """
    Extract the top words from the specified component 
    for a topic model trained on data. 
    X: a term-document matrix, assumed to be a pd.DataFrame
    model: a sklearn model with a components_ attribute, e.g. NMF
    component: the desired component, specified as an integer. 
        Must be less than than the total number of components in model
    num_words: the number of words to return.
    """
    orders = np.argsort(model.components_, axis = 1)
    important_words = np.array(X.columns)[orders]
    return important_words[component][-num_words:]

Use this code to investigate the topics constructed by the model. Can you interpret any of them? Keep in mind that many of these news articles are from the 1980s and 1990s. That's before many of you were born, you whippersnappers! 

I was able to find U.S. political party conventions; fluctuations in the price of oil; U.S. / Soviet tensions; and international finance news, among other things. Show the top words for a few different topics, and see whether any of them look interpretable to you.  

In [None]:
# show the top words for a topic


In [None]:
# show the top words for a topic


In [None]:
# show the top words for a topic


In [None]:
# show the top words for a topic
