# Topic modelling

Here is an indroductory video - https://www.youtube.com/watch?v=3mHy4OSyRf0

Topic modelling is used to split the words in some corpus into thematically orginized clusters (topics). The topics are inferred automatically (without supervision) from given texts. These topics can be used to analyze large collections of textual data or to label documents. 

As a result of topic modelling we can often get document embeddings which can be used to find similar documents (this might be better than simple tfidf cosine distance because thematically similar documents don't need to contain exactly the same words).


Topic modelling can be applied not just to texts but also to search queries, purchases (grouped by customers), bank transactions (grouped by client), songs (grouped by listener)  and even proteins (grouped by DNAs).

Commonly used methods for topic modelling are based on:

1) bag-of-words model (word order is not important)  
2) documents in the corpus are independent  (word W in a document D_1 doesn't influence words in D_2)  
3) distributive hypothesis (similar words are used together)



From this notebook you'll learn how to use gensim's LDA and sklearn's NMF

In [2]:
import gensim
import json
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pyLDAvis.gensim
import string
from collections import Counter
import warnings
warnings.filterwarnings("ignore")


## Data

We will work with this dataset - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GMFCTR

It consits of 3791 news articles in English. That is not a big corpus but it'll be enough for educational purposes.

In [25]:
data = pd.read_csv('NewsArticles.csv', engine='python')

In [26]:
texts = data.text.dropna().tolist()

In [27]:
len(texts)

3791

No need for a fancy preprocessing, just removing the stopwords and deleting punctuation

In [31]:
stops = set(stopwords.words('english'))

def tokenize(text):
    words = [word.strip(string.punctuation).lower() for word in text.split()]
    words = [word for word in words if word and word not in stops]
    
    return words

In [32]:
norm_texts = [tokenize(text) for text in texts]

In [33]:
norm_texts[0][:10]

['michigan',
 'billionaire',
 'education',
 'activist',
 'betsy',
 'devos',
 'confirmed',
 'today',
 'serve',
 'secretary']

In [34]:
# for ngramms
# ph = gensim.models.Phrases(norm_texts, scoring='npmi', threshold=0.4) # threshold можно подбирать
# p = gensim.models.phrases.Phraser(ph)
# ngrammed_texts = p[norm_texts]

### Topic modelling in gensim

First we need to build a vocabulary.

In [51]:
dictionary = gensim.corpora.Dictionary(norm_texts)

And optionally remove some of the most common and least common words

In [52]:
# we filter out words that occur in more than 10 % of the documents and 
# words that ocuur less than 20 times
dictionary.filter_extremes(no_above=0.1, no_below=20)
dictionary.compactify()

So we have 6014 unique words as a result

In [53]:
print(dictionary)

Dictionary(6014 unique tokens: ['30', '48', 'abc', 'accounts', 'activist']...)


Now we need to one-hot encode our texts with our dictionary

In [54]:
# bow here stands for bag-of-words
corpus = [dictionary.doc2bow(text) for text in norm_texts]

In [55]:
# gensim uses a memory efficient representation of document vectors
# every document is a list of tuples where
# the first element is an index of a word in the dictionary and
# the second element is a number of times it occured in this document
corpus[10][:10]

[(6, 1),
 (112, 1),
 (113, 1),
 (201, 1),
 (205, 1),
 (221, 1),
 (260, 1),
 (273, 1),
 (283, 6),
 (307, 1)]

We can always translate these indicies back to strings

In [61]:
[(dictionary[index], freq) for index, freq in corpus[10][:10]]

[('additional', 1),
 ('advised', 1),
 ('affected', 1),
 ('millions', 1),
 ('noted', 1),
 ('previous', 1),
 ('substantial', 1),
 ('unusual', 1),
 ('york', 6),
 ('center', 1)]

Now let's get to topic modelling!

Main model for topic modelling in gensim is gensim.models.LdaModel (or gensim.models.LdaMulticore which is faster but doesn't always work)

Its main parameters are num_topics, alpha, eta and passes. 

**num_topics** - is a number of topics. This is the most important parameter and also the most intuitive one. Its value depends on the task but you can set it to 200 if you are not sure. You can try smaller values if you think that the corpus is not very diverse or if you want a faster convergence.

**alpha** и **eta** - are topic and document sparcity parameters. These are not intuitive parameters! You can try to read more on Dirichlet distribution if you want to know how to set them. There are three built-in strategies for Alpha  which you can choose from (symmetric, assymetric and auto). 

**passes** - is a number of iterations through the entire dataset. The tradeoff here is simple - the longer you train the better. However, the quality increase get smaller after each pass, when the time cost stays the same.

In [66]:
lda = gensim.models.LdaMulticore(corpus, 200, id2word=dictinary, passes=10) # if it doesn't work use the second line
# lda = gensim.models.LdaModel(200, id2word=dictinary, passes=5)

Let's have a look at the topics.

In [156]:
# in each tuple we have:
# 1) number of the topic
# 2) list of most probable words for this topic and their probabilities
lda.print_topics()

[(110,
  '0.010*"dortmund" + 0.009*"companies" + 0.008*"side" + 0.006*"queen" + 0.005*"credit" + 0.005*"parents" + 0.005*"break" + 0.005*"munich" + 0.005*"pen" + 0.005*"game"'),
 (15,
  '0.017*"ireland" + 0.017*"brexit" + 0.012*"uk" + 0.011*"eu" + 0.010*"taoiseach" + 0.010*"kenny" + 0.009*"border" + 0.009*"irish" + 0.008*"fine" + 0.007*"customs"'),
 (17,
  '0.011*"pakistan" + 0.009*"billion" + 0.008*"budget" + 0.007*"pakistan\'s" + 0.006*"afghan" + 0.005*"body" + 0.005*"name" + 0.005*"cities" + 0.005*"population" + 0.005*"u.s"'),
 (157,
  '0.052*"africa" + 0.029*"african" + 0.011*"german" + 0.010*"africa\'s" + 0.009*"warrant" + 0.009*"abuse" + 0.008*"continent" + 0.006*"arrest" + 0.005*"elected" + 0.005*"child"'),
 (138,
  '0.011*"bbc" + 0.007*"marine" + 0.006*"spending" + 0.006*"conservative" + 0.005*"language" + 0.005*"drug" + 0.005*"john" + 0.005*"season" + 0.004*"reporting" + 0.004*"sent"'),
 (85,
  '0.018*"culture" + 0.013*"hate" + 0.013*"christmas" + 0.012*"groups" + 0.011*"artis

There are definetly some good ones (49 is about football, 130 is about Netherlands)

There's also a visualization tool for TM.

In [157]:
pyLDAvis.enable_notebook()

In [158]:
pyLDAvis.gensim.prepare(lda, corpus, dictinary)

You should look for not intersecting medium sized bubbles. Large bubbles mean that the topics are too broad and can be split into smaller ones. Intersecting or even embedded bubbles mean that two or more topics are too similar and it's better to merge them. However, you can't select which topics to merge or to split, you can only control a number of topics in general. So don't expect too much from this tool

There's also two metrics for evaluating quality of topic models

In [73]:
import numpy as np

Perplexity shows how good our model fits the data. The closer it to 0 the better

In [75]:
lda.log_perplexity(corpus[:10000])

-9.232168664442439

Coherence measures the quality of the topics. It checks if the topics consist of different words and if the topics are small. It usually corellates well with human judgement because a diverse set of small specialized topics is usually what is needed.

In [80]:
coherence_model_lda = gensim.models.CoherenceModel(model=lda, 
                                                  texts=norm_texts, 
                                                   dictionary=dictinary, coherence='c_v')

The larger the coherence the better.

In [81]:
topics = []
for topic_id, topic in lda.show_topics(num_topics=100, formatted=False):
    topic = [word for word, _ in topic]
    topics.append(topic)

In [82]:
coherence_model_lda = gensim.models.CoherenceModel(topics=topics, 
                                                   texts=norm_texts, 
                                                   dictionary=dictinary, coherence='c_v')

In [83]:
coherence_model_lda.get_coherence()

0.4830282241143934

But these metric should not substitute human judgement (how easily we can interpret the topic - make a name for it, for example) or extrinsic evaluation (how well we can solve another task using such topic model). 

### NMF 

In [160]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://miro.medium.com/max/1768/1*j-Gx8v5otnhBiCIr1f1qvw.png")

Non-negative matrix factorization is a decomposition algorithm where an input matrix A of shape WxD is decomposed into two matrices of shapes WxT and TxD, all values of which are non-negative (zeros or positive).

If we apply this algorithm to documents-words matrix (the output of TfidfVectorizer, for example) we will get a documents-topics matrix and topics-words matrix. And that is a topic model!

The main difference between LDA and NMF is that in NMF we don't have probabilities. However, that's rarely a problem, because we can still sort the values and get top words for a topic (or top topics for a document).


There's an implementation of NMF in sklearn. 

In [85]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

First we need to build a documents-words matrix 

In [86]:
vectorizer = TfidfVectorizer(max_features=1000, min_df=10, max_df=0.3, ngram_range=(1,3))
X = vectorizer.fit_transform(texts)

In [96]:
X.shape

(3791, 1000)

And than decompose it

In [140]:
# n_components - is the main parameter, it is equivalent to num_topics in LDA
# Try smaller values if it takes too long
model = NMF(n_components=100)

In [141]:
model.fit(X)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=100, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [142]:
model.components_.T.shape # words-topics matrix (we need to transpose the matrix, because it is stored in different format in sklearn)

(1000, 100)

In [143]:
model.transform(X).T.shape # topics-documents matrix

(100, 3791)

In [144]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

Let's look at the topics

In [145]:
feat_names = vectorizer.get_feature_names()

It is easier to work with original sklearn format (topics, words) without trasposition. 

We sort by rows (topics), so if we take first N elements in every row we will get top words for every topic. But numpy sort in increasing order, so we need to take last N elements in every row.

In [146]:
top_words = model.components_.argsort()[:,:-6:-1] #slice last 5 elements in every row

for i in range(top_words.shape[0]):
    words = [feat_names[j] for j in top_words[i]]
    print(i, "  ".join(words))

0 my  me  love  like  don
1 china  in china  china daily  daily  beijing
2 russian  the russian  ambassador  tass  anti
3 trump  obama  the president  donald  donald trump
4 her  she  she was  she said  woman
5 korea  north korea  north  tillerson  missile
6 nuclear  power  energy  missile  sea
7 man  found  arrested  in his  body
8 court  the court  case  judge  the case
9 police  officers  officer  investigation  authorities
10 mr  mr trump  said he  he said  he had
11 senate  democrats  vote  republicans  republican
12 tass  moscow  february  march  said on
13 election  campaign  presidential  former  allegations
14 garda  road  contact  line  co
15 israel  israeli  palestinian  land  west
16 pic  pic twitter com  pic twitter  twitter com  twitter
17 health  care  insurance  act  plan
18 white house  white  house  the white house  the white
19 syrian  syria  the syrian  in syria  al
20 company  the company  business  technology  services
21 abc  abc news  officials  department  morn

There's a intrinsic metric for NMF. It is similar to perplexity for LDA. It shows how good our two matrices approximate the original one (if we multiply them).

In [149]:
# the smaller the better
model.reconstruction_err_

44.891895511081124

But human judgement and extrinsic evaluation should be the main criteria when measuring the quality of a topic model

## Homework

Task in general - **build a good topic model using LDA in gensim and NMF in sklearn**. 

Detailed task:

1) improve preprocessing (try other tokenization methods, add normalization)

2) Use ngramms (there a commented cell in this notebook on how to do this in gensim)

3) Build a good vocabulary (try other values for no_above and no_below, manually inspect the dictionary and remove bad words); 

4) Build a couple of LDA models (try different num_topics, you can try changing alpha,eta and passes too, but it's optional). Adjust preprocessing (steps 1,2,3) if you get bad topics

5) Choose the best model, choose three good topics in this model and decribe them (try to give them a name);

6) Try adding a tfidf layer between vocabulary building and model training (`gensim.models.TfidfModel(corpus, id2word=dictionary); corpus = tfidf[corpus]`);

7) Repeat the steps 4 and 5 with TfidfModel

8) Analyze the difference in topics with and without TfidfModel (what's better/worse, what's the same) 

9) Build a topic model using NMF. Try different vectorizers  (Count or Tfidf Vectorizer), try different vectorization parameters (max_features, min_df, max_df, ngram_range), try different n_components in NMF.

10) Choose the best NMF model, choose three good topics and describe them.


Answer the question: what is better in your opinion NMF or LDA and why?
