# Topic models and LDA

## Dataset for the exercise

* [New York Times Comments](https://www.kaggle.com/aashita/nyt-comments/data)  <-  set of readers' comments to articles published in the New York Times.

## Overarching research question

The comments provide a perspective to the kinds of concerns people in discussions related to online articles.
What kind of meaningful themes - if any - emerge from this data?

In [1]:
## Data collection from files.
## To keep the dataset fairly small, we conduct random data selection here.
## This is *ONLY* for teaching purposes, to ensure that the model runs relatively fast.

import os
import csv
import random

random.seed(1) # Set random seed for reproducible results

path = 'data/nyt-comments/'
files = os.listdir( path ) ## Get all files from directory path
files = filter( lambda file_name: file_name.startswith("Comments"), files )
files = map( lambda file_name: path + file_name, files ) ## Add path to file names

documents = []

for file in files:
    for entry in csv.DictReader( open( file ) ):
        
        if random.random() > .99: ## Choose content randomly
            comment = entry['commentBody']

            documents.append( comment )
            
            
print("Data sample size", len(documents) )

Data sample size 21825


## From text data to document-term matrix

To analyse textual data we transform them to a document term matrix, where rows correspond to documents (= reader comments) and columns correspond to words in the dataset.

Note how we **preprocess** below the texts for analysis. We remove stopwords (through a set of common English stopwords; we could also create our own lists), stem the content of comments to ensure language is treated well and lowercase everything in the content. Thus, the `document_terms` that preprocessing produces is a huge sparse matrix in the end. Preprocessing is its own kind of art, as it can [influence results](https://www.cambridge.org/core/product/identifier/S1047198717000444/type/journal_article).

In [47]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import EnglishStemmer

# Let's use nltk's in-built stopword list and stemmer
nltk.download('stopwords')
stemmer = EnglishStemmer()

# Add to or replace this list to use custom stopwords
stopwords = stopwords.words('english')

# Function for stemming texts
def stem( text ):
    words = nltk.word_tokenize(text)
    return [ stemmer.stem(w) for w in words ]

# Stem both documents and stopwords
documents_stemmed = [' '.join( stem(d) ) for d in documents]
stopwords_stemmed = stem( ' '.join( stopwords ) )

tf_vectorizer = CountVectorizer(
    max_df=0.90, min_df=10, 
    stop_words=stopwords_stemmed, analyzer = "word", lowercase = True
)

document_terms = tf_vectorizer.fit_transform(documents_stemmed)
document_terms_names = tf_vectorizer.get_feature_names_out()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juhopaak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## From document-term matrix to analysis

Finally we run the Latent Dirichlet Allocation process to the document-term matrix to create topics.
Similarly to k-means, we need to choose the number of topics; there are also other parameters which could be used to _fine tune_ topic models, see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for details.
However, [topic models work on a different abstration level than humans](http://doi.wiley.com/10.1002/asi.23786) and thus interpretation and validation of the results is always needed when using the results.

In [51]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation( n_components = 5 )
model = lda.fit( document_terms )

In [52]:
for topic_number, words in enumerate( model.components_ ):
        print( "Topic", topic_number+1 )
        for word in words.argsort()[:-6:-1]:
            print( "\t", document_terms_names[word] )

Topic 1
	 br
	 peopl
	 year
	 get
	 work
Topic 2
	 trump
	 br
	 republican
	 presid
	 vote
Topic 3
	 br
	 peopl
	 us
	 state
	 use
Topic 4
	 br
	 com
	 war
	 www
	 gun
Topic 5
	 br
	 trump
	 like
	 one
	 peopl


In [53]:
## Check the distribution of topics in a single document
model.transform( document_terms[0] )

array([[0.0052428 , 0.00526227, 0.16520212, 0.00519906, 0.81909375]])

## Tasks

* If the model terms seem to contain unwanted words or characters, rerun preprocessing to remove these.
* Compute the distribution of each topic for each document. Where could you use this?
* Modify the code and examine a few potential topic numbers. What differences can you detect?
* Modify the preprocessing to remove all words which shorter than four characters. What do you learn now?

## Model evaluation

There are many different approaches to evaluating topic models (see, [1](http://doi.acm.org/10.1145/1553374.1553515), [2](https://journal.fi/politiikka/article/view/79629) for examples).
We can evaluate the suitability of topic models using statistical measurements such as loglikelihood, but [some say](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf) that this might be bad practice - and [others](https://journal.fi/politiikka/article/view/79629) recommend it.
You can get the loglikelihood for a model by running the following code.

In [54]:
model.score( document_terms )

-5872652.345350089

## Tasks

* Evaluate a set of different models based on loglikelihood. Which one would you choose?