# Topic models and LDA

## Dataset for the exercise

* [New York Times Comments](https://www.kaggle.com/aashita/nyt-comments/data)  <-  set of readers' comments to articles published in the New York Times.

## Overarching research question

The comments provide a perspective to the kinds of concerns people raise in discussions related to online articles.
What kind of meaningful themes - if any - emerge from this data?

In [None]:
## Data collection from files.
## To keep the dataset fairly small, we conduct random data selection here.
## This is *ONLY* for teaching purposes, to ensure that the model runs relatively fast.

path <- 'data/nyt-comments/'
files <- list.files( path ) ## Get all files from directory path
files <- files[ grepl("Comments", files) ]
files <- paste( path, files, sep = '') ## Add path to file names

set.seed(1) # Set random seed for reproducible results.

data <- data.frame()

for( file in files ){
    d <- read.csv( file )
    data <- rbind( data, d ) ## TODO: This is a slow and poor method of doing this merging.
}

documents <- data[ runif( nrow(data) ) > .99, ] ## Choose content randomly
            
print("Data sample size" )
print( nrow(documents) )

In [None]:
# Format comments as character strings
documents$commentBody <- as.character( documents$commentBody )

## From text data to document-term matrix

To analyse textual data we transform them to a document term matrix, where rows correspond to documents (= reader comments) and columns correspond to words in the dataset.

Note how we **preprocess** below the texts for analysis. We remove stopwords (through a set of common English stopwords; we could also create our own lists), stem the content of comments to ensure language is treated well and lowercase everything in the content. Thus, the `document_terms` that preprocessing produces is a huge sparse matrix in the end. Preprocessing is its own kind of art, as it can [influence results](https://www.cambridge.org/core/product/identifier/S1047198717000444/type/journal_article).

In [None]:
library(quanteda)

# Add to or replace this list to use custom stopwords
stopwords = stopwords('en')

corp <- corpus( documents, text_field = "commentBody" ) # Create corpus

token <- tokens( corp, remove_punct=TRUE ) # Remove punctuation
token <- tokens_select( token, pattern = stopwords, selection = 'remove') # Remove stopwords
token <- tokens_wordstem( token ) # Stem words

document_terms <- dfm( token )

## From document-term matrix to analysis

Finally we run the Latent Dirichlet Allocation process to the document-term matrix to create topics.
Similarly to k-means, we need to choose the number of topics; there are also other parameters which could be used to _fine tune_ topic models, see [documentation](https://www.rdocumentation.org/packages/topicmodels/) for details.
However, [topic models work on a different abstration level than humans](http://doi.wiley.com/10.1002/asi.23786) and thus interpretation and validation of the results is always needed when using the results.

In [None]:
library(topicmodels)

document_terms <- convert( document_terms, to = "topicmodels")
model <- LDA( document_terms, k = 5 )

In [None]:
# Check model terms
terms( model, 5 )

In [None]:
## Check the distribution of topics in a single document
posterior( model )$topics[1,]

## Tasks

* If the model terms seem to contain unwanted words or characters, rerun preprocessing to remove these.
* Compute the distribution of each topic for each document. Where could you use this?
* Modify the code and examine a few potential topic numbers. What differences can you detect?
* Modify the preprocessing to remove all words which shorter than four characters. What do you learn now?

## Model evaluation

There are many different approaches to evaluating topic models (see, [1](http://doi.acm.org/10.1145/1553374.1553515), [2](https://journal.fi/politiikka/article/view/79629) for examples).
We can evaluate the suitability of topic models using statistical measurements such as loglikelihood, but [some say](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf) that this might be bad practice - and [others](https://journal.fi/politiikka/article/view/79629) recommend it.
You can get the loglikelihood for a model by running the following code.

In [None]:
logLik( model )

## Tasks

* Evaluate a set of different models based on loglikelihood. Which one would you choose?