# Topic models

## Dataset for the exercise

* [New York Times Comments](https://www.kaggle.com/aashita/nyt-comments/data), set of reders' comments for articles published in the New York Times

## Overarching research question

The comments allow a perspective to study what kind of concerns people raise when commenting to online articles.
Examine if meaninful themes emerge from the data set.

In [2]:
## data collection from files.
## to keep the dataset fairly small, we conduct random data selection here.
## this is *ONLY* to ensure that the model is suitable for teaching purposes

import os
import csv
import random

random.seed(1)

path = 'data/nyt/'
files = os.listdir( path ) ## see all files in directory
files = filter( lambda file_name: file_name.startswith("CommentsApril"), files ) ## choose only data files for April
files = map( lambda file_name: path + file_name, files ) ## add path to file names

documents = []

for file in files:
    for entry in csv.DictReader( open( file, encoding='utf-8' ) ):
        
        if random.random() > .99: ## choose content randomly
            comment = entry['commentBody']

            documents.append( comment )
            
print("Data sample size", len(documents) )

Data sample size 5032


## From text data to document-term matrix

To analyse textual data we transform it to a document term matrix, where in rows we have documents (different comments) and columns represent each word in the dataset.

Note how we **preprocess** the text during this quantification. We remove stopwords (through a set of common English stopwords; we could also create our own lists), stem the content of comments to ensure language is treated well and lower case everything in the content. Thus, the `document_terms` is a huge sparse matrix in the end. Preprocessing is its own kind of art, as it can [influence results](https://www.cambridge.org/core/product/identifier/S1047198717000444/type/journal_article).

In [19]:
## For some reason the original code did not remove the stop words. Hence, I had to modify the code.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
import nltk
from nltk.stem.snowball import EnglishStemmer
from nltk.tokenize import RegexpTokenizer

added_stop_words = text.ENGLISH_STOP_WORDS.union(['http', 'www', 'https', 'html', 'href', 
                                                 'target', 'title', 'blank']) # Adding these words to stop words since 
                                                                                   # they appeared in the topics but seemed 
                                                                                   # to be something that did not have
                                                                                   # any meaning
            
stemmer = EnglishStemmer()
regex_tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
processed_documents = []

for document in documents:
    stemmed_document = ''
    
    tokenized_document = regex_tokenizer.tokenize(document)
    
    for word in tokenized_document:
        if len(word) >= 4:
            stemmed_document += stemmer.stem(word) + ' '
    
    processed_documents.append(stemmed_document)
    
tf_vectorizer = CountVectorizer(max_df=0.90, min_df=10, stop_words=added_stop_words, lowercase = True)

document_terms = tf_vectorizer.fit_transform(processed_documents)
document_terms_names = tf_vectorizer.get_feature_names()

## From document-term matrix to analysis

Finally we run the Latent Dirichlet Allocation process to our matrix to create topics.
Similar to k-means, we choose the number of topics; there are also other parameters which could be used to _fine tune_ topic models, see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for details.
When I use these we with their default parameters as none of them solves the challenge that [topic models work on a different abstration level than humans](http://doi.wiley.com/10.1002/asi.23786).

In [5]:
from sklearn.decomposition import LatentDirichletAllocation

lda_five = LatentDirichletAllocation( n_components = 5 )
model_five = lda_five.fit( document_terms )

lda_ten = LatentDirichletAllocation( n_components = 10 )
model_ten = lda_ten.fit( document_terms )

In [8]:
for topic_number, words in enumerate( model_five.components_ ):
        print( "Topic", topic_number+1 )
        for word in words.argsort()[:-6:-1]:
            print( document_terms_names[word], end=' | ' )
        print()

print()
        
for topic_number, words in enumerate( model_ten.components_ ):
        print( "Topic", topic_number+1 )
        for word in words.argsort()[:-6:-1]:
            print( document_terms_names[word], end=' | ' )
        print()

Topic 1
like | peopl | time | just | think | 
Topic 2
peopl | mani | live | onli | need | 
Topic 3
trump | presid | elect | vote | republican | 
Topic 4
work | peopl | school | year | need | 
Topic 5
right | peopl | make | white | like | 

Topic 1
trump | republican | presid | vote | democrat | 
Topic 2
peopl | comey | person | white | news | 
Topic 3
state | trump | korea | north | countri | 
Topic 4
peopl | make | mani | care | right | 
Topic 5
trump | like | presid | know | believ | 
Topic 6
year | love | time | work | alway | 
Topic 7
just | think | watch | word | right | 
Topic 8
nytim | target | titl | blank | href | 
Topic 9
russia | syria | trump | weapon | assad | 
Topic 10
peopl | school | work | like | year | 


In [6]:
## see the distribution of a document to different topics
model.transform( document_terms[0] )

array([[0.34768398, 0.0066205 , 0.38051711, 0.17378746, 0.09139096]])

## Tasks

* Compute the distribution of all documents to each topic. Where could you use this?

In [61]:
import numpy as np

n_of_topics = len(model.transform(document_terms[0])[0])

documents_on_topic = {}

## First let's see the number of documents for each topic (meaning that if Topic 1 is prevelent in a document
## it is classified to belong to the class of Topic 1)

for i in range(0, n_of_topics):
    key = 'Topic {}'.format(i+1)
    documents_on_topic[key] = 0
    
for line in model.transform(document_terms):
    index_of_max = np.where(line == np.max(line))
    index_of_max = index_of_max[0][0]
    add_to_key = 'Topic {}'.format(index_of_max+1)
    documents_on_topic[add_to_key] += 1
    
print(documents_on_topic, '\n')

## Then how the topics are distributed in all the documents:

freq_on_topic = {}
n_of_topics = len(model.transform(document_terms[0])[0])

n_of_documents = len(model.transform(document_terms))

for i in range(0, n_of_topics):
    key = 'Topic {}'.format(i+1)
    freq_on_topic[key] = 0

for line in model.transform(document_terms):
    for i in range(0, n_of_topics):
        add_this = line[i]
        to_key = add_to_key = 'Topic {}'.format(i+1)
        freq_on_topic[to_key] += add_this
        
freq_on_topic.update((key, value / n_of_documents) for key, value in freq_on_topic.items())

print(freq_on_topic)

{'Topic 1': 1256, 'Topic 2': 473, 'Topic 3': 839, 'Topic 4': 1444, 'Topic 5': 1020} 

{'Topic 1': 0.23836363966915844, 'Topic 2': 0.11719937345645313, 'Topic 3': 0.17815369397478192, 'Topic 4': 0.2616059700449289, 'Topic 5': 0.20467732285467805}


We get the understanding of which topic is the most common topic in the data. This would mean that we can, for example, get some kind of understanding of what was the most discussed topic at a certain point of time. 

* Modify the code and examine a few potential topic numbers. What differences can you detect?

Although the topics were relatively difficult to interpret, with ten topics there emerged some kind of understanding of what the topics could be about (although they are still quite messy). 

* Modify the preprocessing and remove all words which are shorter than four characters. What do you learn now?

Removing the words that are shorter than four characters makes the topics more clearly defined and it is possible to find much clearly defined topics. For example, with five topics there are three topics about Trump, something about school and something about money. 


## Model evaluation

There are many different approaches to evaluate topic models (see, [1](http://doi.acm.org/10.1145/1553374.1553515), [2](https://journal.fi/politiikka/article/view/79629) for examples).
We can evaluate the suitability of topic models using statistical measurements like loglikelihood, but [some say](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf) that this might be a bad practice - and [others](https://journal.fi/politiikka/article/view/79629) recommend it.
Here we show how to do it.

In [62]:
model.score( document_terms )

-947498.2873321398

In [None]:
With five topics the score was -949756.89

## Tasks


* Evaluate a set of different topics based on this score. Which one would you choose?

In [22]:
# Try it with different number of topics:

model_results = {}

topic_n = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50]

for i in topic_n:
    lda = LatentDirichletAllocation( n_components = i )
    model = lda.fit( document_terms )
    model_results[i] = model.score( document_terms )


In [23]:
# The ten best number of topics with their log-likelihoods: 

print(sorted(model_results.items(), key=lambda x: x[1], reverse=True)[0:10])

[(3, -935891.0939410102), (2, -938158.3617629377), (4, -938746.3039306123), (5, -943997.9996749549), (6, -945350.2084943934), (7, -946434.3938784814), (8, -949231.9917409667), (9, -950210.0164142104), (10, -950772.5202008706), (16, -960606.8646228923)]


It appears that there does not seem to be that much difference between the models. They just seem to be getting gradually worse when more topics are added. Interestingly a model with 16 topics is the 10th best model according to log-likelihood. However, if we have to choose some of these models, I do not think that these results are that helpful. What follows is an attempt to evaluate different number of topics manually. I will compare the models with the numbers of topics 2, 3, 4, 5, 7, 10 and 16. 


In [24]:
topic_n = [2, 3, 4, 5, 7, 10, 16]

for i in topic_n:
    print('\nTopics:', i)
    lda = LatentDirichletAllocation( n_components = i )
    model = lda.fit( document_terms )
    for topic_number, words in enumerate( model.components_ ):
        print( "Topic", topic_number+1 )
        for word in words.argsort()[:-11:-1]:
            print( document_terms_names[word], end=' | ')
        print()


Topics: 2
Topic 1
peopl | work | like | year | just | time | mani | think | need | make | 
Topic 2
trump | presid | like | peopl | republican | vote | elect | onli | just | democrat | 

Topics: 3
Topic 1
work | year | like | peopl | just | time | make | becaus | live | need | 
Topic 2
trump | presid | like | vote | elect | republican | peopl | just | democrat | know | 
Topic 3
peopl | like | think | right | state | mani | make | world | want | just | 

Topics: 4
Topic 1
peopl | need | make | work | money | care | becaus | year | mani | like | 
Topic 2
like | time | just | nytim | world | read | love | thing | year | think | 
Topic 3
trump | presid | republican | vote | elect | democrat | like | parti | peopl | polit | 
Topic 4
trump | peopl | work | just | year | like | know | onli | think | time | 

Topics: 5
Topic 1
trump | presid | republican | like | elect | vote | democrat | peopl | just | support | 
Topic 2
peopl | work | like | time | year | just | mani | need | becaus | women 

At least for me, it was quite difficult to make much sens of most of these topics (probably more of some kind of "domain knowledge" would have helped in their interpretation). However, it seems that Trump pops out in many of the topics. Hence, we can say with a confidence that Trump was a central topic in the comments in the NYT website. Of these models, the easiest to interpret was the one that had seven topics. According to it the concerns raised by the commentators could be interpreted as follows: 

Topic 1
trump | democrat | republican | like | presid | right | polit | parti | vote | elect | 

* Topic 1: Trump and elections

Topic 2
need | peopl | like | life | work | just | school | women | children | becaus | 

* Topic 2: Probably about some kind of work/life/family balance especially with women

Topic 3
trump | presid | russia | syria | elect | russian | korea | militari | attack | time | 

* Topic 3: Trump and foreign policy

Topic 4
peopl | work | just | like | time | mani | famili | veri | onli | live | 

* Topic 4: Probably also about some kind of work/life/family balance 

Topic 5
nytim | thank | time | read | love | titl | world | year | watch | hope | 

* Topic 5: About New York Times the magazine

Topic 6
state | year | govern | money | peopl | cost | need | unit | care | make | 

* Topic 6: Could be about state, money and healthcare

Topic 7
peopl | think | trump | like | becaus | white | thing | just | onli | american | 

* Topic 7: Could be about alt-right, trump and the internet

It is quite probable that in media there were lots of discussion on Trump (as it has been for the last 3,5 years). I am not sure if the other topics (work/life/family, magazine or state, monay and healthcare) were somehow relevant when the data was created. Other aspect of the data used, that also makes the interpretation difficult, is that it contains comments from April 2017 and also from April 2018. Hence comments that are an year apart from each other are treated as similar. This would not be a thing worth mentioning if we had a larger data, but when we have (only part of the) comments from two months, it can influence the topic modeling in a bad way. The discussion on "work/life/family" balance could emerge during that time of the years since the summer holidays are closing in and there _could_ be articles that discuss these themes.

## Some reflections

Topic modeling was  method that I was most exited about since was the method that was most clearly associated with something that is familiar to me, that is, analysing texts. However, similarily to other exercises that were done with textual data, this exercise once illustrated how difficult it is to use textual data and how the preprocessing is some kind of handcraft (or magic) that has a lot of tacit knowledge involved. In addition, topic models raised the question of how well should we know the data we using to build the topic model. It is quite obvious that we should know something about the context (e.g. when, where and how it was created). However, how much should we know about the data itself. If we know the data too well, we probably end up selecting the number of topics that mirrors our own classification scheme (which is probably based on relatively small sample of the data). This then sidelines one of the interesting aspects of topic modeling, that it can produce surprising results (or at least I think that this is one interesting possibility). I think that there is some kind of tension between "knowing the data you use" and "giving the algorithm a chance to give new perspective to the textual data". 

Obviously, the hardest part in this exercise was to interpret the topics that the algorithm found (in my case the model with seven topics was the most easiest to interpret). This aspect of the topic modeling also quite difficult (especially since I do not have that much understanding of what we should expect from the comments section of NYT). For example, I kept wondering how many words we should use to define a topic. The original code showed only five words. I modified it to show ten words (since with five words it was almost impossible to interpret the topics). However, a question arised about "how many words we should use to define a topic". Should we just look for more and more words that characterise the topic or should we just give up at some point and admit that perhaps we found topics that are not actually that easy to interpret. I wondered if there were any rules of thumb for this. 

The exercise also raised a question about the relationship between researcher and the tool (topic modeling). Should we think about the tool as something that is forced to follow our intuition (i.e. if we gain a hunch that some kind of topics could arise from the data, and we are kind of allowed to follow this intuition and process the data to remove the "noise" from the data to make the topics we found much clearer and clearer) or should we treat topic modeling as a tool that summarises the data in a way that we could not ever been able to do. Hence, is it a tool for clarifying our own ideas or a tool which guidance we should just follow. 

Similarily to support vector machines, this exercise raised questions about the computational resources needed to run topic modeling if we did real research. 