# HW5: Topic Models and LDA


**STATS271/371: Applied Bayesian Statistics**

_Stanford University. Spring, 2021._

---

**Name:** _Your name here_

**Names of any collaborators:** _Names here_

*Due: 11:59pm Monday, May 10, 2021 via GradeScope*

---

Recall the following generatize model for LDA. Suppose we have $K$ topics and $N$ documents.

For each topic $k \leq K$, draw a topic 
$$\eta_k \sim \text{Dir}(\phi)$$

Then, for each document $n \leq N$, draw topic proportions 
$$\pi_n \sim \text{Dir}(\alpha)$$

Finally, for each word $l$ in document $n$, first draw a topic assignment 
$$
z_{n,l} \mid \pi_n \sim \text{Cat}(\pi_n)
$$
and draw a word
$$
x_{n,l} \mid z_{n,l} \sim \text{Cat}(z_{n,l})
$$

As mentioned in class, while this formulation is easier to present, it's more efficient to represent the documents as sparse vectors of _word counts_, $\mathbf{y}_n \in \mathbb{N}^V$ where $y_{n,v} = \sum_{l=1}^L \mathbb{I}[x_{n,l} = v]$. 

In this assignment, we will be re-exploring the Federalist papers in their entirety. We've provided a $N \times V$ dataframe of the essays represented as word counts. The rows of the data frame correspond to the 85 individual essays and the columns correspond to the 5320 words in the vocabulary. We have already preprocessed the raw essays to remove very common and very infrequent words.

Using this data, we will fit a topic model and do some analysis.

In [1]:
import pandas as pd 

# Load the data
df = pd.read_csv('tokenized_fed.csv', index_col = 0)
df

Unnamed: 0,unequivocal,experience,inefficacy,subsisting,federal,called,deliberate,new,constitution,united,...,chancery,jurisprudence,reexamination,writ,commonlaw,intent,refutation,habeas,corpus,clerks
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,7.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,2.0,0.0,0.0,2.0,1.0,0.0,2.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,0.0,0.0,0.0,0.0,6.0,1.0,1.0,8.0,12.0,8.0,...,1.0,1.0,5.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
81,0.0,0.0,0.0,0.0,12.0,0.0,0.0,2.0,4.0,5.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
82,0.0,1.0,0.0,0.0,7.0,2.0,2.0,9.0,13.0,6.0,...,7.0,1.0,1.0,0.0,5.0,2.0,2.0,1.0,1.0,1.0
83,0.0,0.0,0.0,0.0,2.0,1.0,0.0,9.0,26.0,12.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,3.0,1.0


## Problem 1: Fit LDA on this data set.

Fit a 10 topic LDA on the data using CAVI. For each topic, output the top 5 words. You might find the structure in the [Poisson matrix factorization notebook](https://github.com/slinderman/stats271sp2021/blob/main/notebooks/Lap_5_Poisson_MF.ipynb) helpful. (Note that that notebook used the JAX backend available in the `tfp-nightly` package, but you could have used the regular [TensorFlow Probability](tensorflow.org/probability/api_docs/python/tfp) package instead. The nice thing about TFP is that its functions broadcast nicely, which is helpful when we have lots of factors in the mean field variational posterior.

## Problem 2: Analysis/Exploration

Using the model, for each essay assign it the most likely topic. For the undisputed papers, plot the histogram of this topic usage vs author.

In [2]:
#load authorship
authorship = pd.read_csv('authorship.csv', index_col = 0)

## Problem 3: Short Answer questions

### Part a)

Explain what approach you would take if you wanted to use LDA to help settle disputed authorship. How would you incorporate authorship by different authors into your model?

### Part b)

A shortcoming of LDA discussed in this class is the fact that the model is exchangeable (which is not a very reasonable assumption for essays). What would you do to address this shortcoming? In essence, how could you account for dependencies between words that are near each other in the essay?

# Submission Instructions


**Formatting:** check that your code does not exceed 80 characters in line width. If you're working in Colab, you can set _Tools &rarr; Settings &rarr; Editor &rarr; Vertical ruler column_ to 80 to see when you've exceeded the limit. 

Download your notebook in .ipynb format and use the following commands to convert it to PDF:
```
jupyter nbconvert --to pdf hw5_yourname.ipynb
```

**Dependencies:**

- `nbconvert`: If you're using Anaconda for package management, 
```
conda install -c anaconda nbconvert
```

**Upload** your .ipynb and .pdf files to Gradescope. 
