# Topic Modeling

Topic modeling is a technique to extract hidden **themes** or **topics** from unstructured data (texts). **unsupervised document clustering**

- a topic (hospital) may be a collection of words such as hospital, doctor, medicine, cure, and patient.
- used for recommendation systems (e.g., New York Times, Job search engines)
- used to organize customer reviews, social media posts, ...

> A topic is a **probality distribution over different words or tokens**. These words often co-appear in the same documents or articles (e.g., health, doctor, patient, medicine)

> For example, if you were to create three topics from restaurant reviews, you would have the following topics (**Mexican**, **Food Taste**, and **Japanese**) and words with **different probability distribution**.

> A customer review (**"The burrito was terrible. I never try that again. Instead, Sushi was great! I love soy source and Gyoza"** may contains three topics: ** Mexican (40%)**, **Food Taste (20%)**, and **Japanese (40%)**

Here is another example. If you were to find topics from a journal article, you would have the following topics and words with different probability distribution.

<img src = "http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png">

# Latent Dirichlet Allocation = LDA

There are different algorithms used for topic modeling. **Latent Dirichlet allocation (LDA)** is considered the most popular algorithm for topic modeling. LDA automatically discovers hidden topics from texts. 

<img src= "images/lda.png">

> * For each document in a corpus (box D):
    - Determine the a topic mixture (**θ**) or the topic proportion 
    - For each word in the document (box N):
        * Assign a topic (**Z**) from the topic mixture (**θ**)
        * Choose the word from the distribution of words (**β**) of the topic (box K)
        
> **α** represents the dirichlet parameter on the per-document topic; **η** represents dirichlet parameter on the per-topic word; **dirichlet** = **multinominal distribution**


**I strongly recommend the following articles and video for more information about LDA**. 

 - An introduction of topic modeling to a layman https://www.quora.com/What-is-the-best-way-to-explain-topic-modeling-to-a-layman
 - Beginners Guide to Topic Modeling in Python https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
 - "Probabilistic Topic Models" By David M. Blei, Communications of the ACM, Vol. 55 No. 4, Pages 77-84. You can download this article from our course website
 - https://www.youtube.com/watch?v=3mHy4OSyRf0 (A YouTube video explaining how LDA works)

References

- http://www.kdnuggets.com/2016/07/americas-next-topic-model.html
- http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
- https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

# Tools

- Gensim: a Python package for topic modeling
- R & topicmodels (R package for topic modeliing)
- MALLET
- More ...

# Topic Modeling with Gensim

- https://pypi.python.org/pypi/gensim/0.13.1
- download **gensim-0.13.1.win-amd64-py2.7.exe (md5)** and install **gensim** by doubleclick

In [2]:
!pip install gensim

Collecting gensim
  Downloading gensim-3.0.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.1MB)
[K    100% |████████████████████████████████| 13.1MB 45kB/s  eta 0:00:01
Collecting smart-open>=1.2.1 (from gensim)
  Downloading smart_open-1.5.3.tar.gz
Collecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading bz2file-0.98.tar.gz
Building wheels for collected packages: smart-open, bz2file
  Running setup.py bdist_wheel for smart-open ... [?25ldone
[?25h  Stored in directory: /Users/linlyn/Library/Caches/pip/wheels/b0/81/ad/856aade935fceaab491a800ec4de58edb8642afa4c4ba91a00
  Running setup.py bdist_wheel for bz2file ... [?25ldone
[?25h  Stored in directory: /Users/linlyn/Library/Caches/pip/wheels/31/9c/20/996d65ca104cbca940b1b053299b68459391c01c774d073126
Successfully built smart-open bz2file
Installing collected packages: bz2file, smart-open, gensim
Successfully installed bz2file-0.98 gensim-3.0.0 smart-open-1.

In [3]:
# Below is the version of Gensim (Python package for Topic Modeling) I am using in this example
import gensim
print gensim.__version__

3.0.0


In [4]:
import pandas as pd

from gensim.corpora import Dictionary
from gensim.models import ldamodel
import numpy
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

We're setting up our corpus now. In the toy corpus presented, there are 11 documents.

In [5]:
# data is already clean
texts = [['bank','river','shore','water'],
        ['river','water','flow','fast','tree'],
        ['bank','water','fall','flow'],
        ['bank','bank','water','rain','river'],
        ['river','water','mud','tree'],
        ['money','transaction','bank','finance'],
        ['bank','borrow','money'], 
        ['bank','finance'],
        ['finance','money','sell','bank'],
        ['borrow','sell', 'finance'],
        ['bank','loan','sell']]

In [6]:
# this is text processing required for topic modeling with Gensim
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [7]:
# dictionary is a list of integer values (vectors), which correspond to each token (or word)
for i in dictionary:
    print i

14
11
10
12
4
15
5
13
6
0
9
1
8
7
2
3


In [9]:
# view dictionary
dictionary.token2id

{u'bank': 3,
 u'borrow': 13,
 u'fall': 7,
 u'fast': 6,
 u'finance': 12,
 u'flow': 4,
 u'loan': 15,
 u'money': 10,
 u'mud': 9,
 u'rain': 8,
 u'river': 2,
 u'sell': 14,
 u'shore': 1,
 u'transaction': 11,
 u'tree': 5,
 u'water': 0}

the word "bank" is now recognized as the integer value "3"

In [10]:
# view corpus
for i in corpus:
    print i

[(0, 1), (1, 1), (2, 1), (3, 1)]
[(0, 1), (2, 1), (4, 1), (5, 1), (6, 1)]
[(0, 1), (3, 1), (4, 1), (7, 1)]
[(0, 1), (2, 1), (3, 2), (8, 1)]
[(0, 1), (2, 1), (5, 1), (9, 1)]
[(3, 1), (10, 1), (11, 1), (12, 1)]
[(3, 1), (10, 1), (13, 1)]
[(3, 1), (12, 1)]
[(3, 1), (10, 1), (12, 1), (14, 1)]
[(12, 1), (13, 1), (14, 1)]
[(3, 1), (14, 1), (15, 1)]


[(0, 1), (2, 1), **(3, 2)**, (8, 1)] : the word "bank" (or the integer value 3) appears twice in sentence #4. 

> ### Document-term matrix (vector-space representation of data)

In [11]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://facweb.cs.depaul.edu/mobasher/classes/csc575/assignments/a4-q1q2.jpg")

# LDA Model Building

We set up the LDA model in the corpus. We set the number of topics to be 2, and expect to see one which is to do with river banks, and one to do with financial banks. 

In [13]:
numpy.random.seed(1) # setting random seed to get the same results each time. 
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, passes=20)

# Prints the topics. term-topic-distribution

In [14]:
model.show_topics()

[(0,
  u'0.184*"water" + 0.150*"river" + 0.148*"bank" + 0.083*"tree" + 0.083*"flow" + 0.050*"fast" + 0.050*"mud" + 0.050*"fall" + 0.050*"rain" + 0.050*"shore"'),
 (1,
  u'0.206*"bank" + 0.166*"finance" + 0.129*"money" + 0.129*"sell" + 0.092*"borrow" + 0.055*"transaction" + 0.055*"loan" + 0.019*"shore" + 0.019*"rain" + 0.019*"fall"')]

Topic_0 is "river" characterized by the words such as water, river, bank, tree, flow, fast, and mud

Topic_1 is about "finance" (bank, finance, etc.)

In [None]:
# print words without probability
for i in range(0,2):
    topics = model.show_topic(i, 10)
    print ','.join([str(word[0]) for word in topics])

This exercise has shown you how to perform topic modeling with Gensim. The results show two hiddent topics in the data. One is about **river** and the other **finance**.

# Assigns the topics to the documents in corpus

In [15]:
lda_corpus = model[corpus]

results = []
for i in lda_corpus:
    print i
    results.append(i)
print 

[(0, 0.88427368618480195), (1, 0.11572631381519809)]
[(0, 0.91476633640841631), (1, 0.085233663591583644)]
[(0, 0.88369469699756775), (1, 0.11630530300243222)]
[(0, 0.89131469242835448), (1, 0.10868530757164548)]
[(0, 0.89772096004832613), (1, 0.10227903995167383)]
[(0, 0.10742112643066244), (1, 0.89257887356933752)]
[(0, 0.13644008705346489), (1, 0.863559912946535)]
[(0, 0.18890335729688856), (1, 0.81109664270311144)]
[(0, 0.10661051310024225), (1, 0.89338948689975783)]
[(0, 0.12671990160814622), (1, 0.87328009839185383)]
[(0, 0.13757016927994173), (1, 0.8624298307200583)]



Document #1 is 88% topic 1 and 12% topic 2.

In [16]:
# finding highest value from each row
toptopic = [max(collection, key=lambda x: x[1])[0] for collection in results]
toptopic

[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

And like we expected, the LDA model has given us near perfect results. The first five documents are "river" and the other six documents are "finance". 5 `river` related and 6 `finance` related.

# Appendix 1

In [17]:
print dictionary.token2id

{u'sell': 14, u'transaction': 11, u'money': 10, u'finance': 12, u'flow': 4, u'loan': 15, u'tree': 5, u'borrow': 13, u'fast': 6, u'water': 0, u'mud': 9, u'shore': 1, u'rain': 8, u'fall': 7, u'river': 2, u'bank': 3}


In [18]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(0, 1), (2, 1), (4, 1), (5, 1), (6, 1)],
 [(0, 1), (3, 1), (4, 1), (7, 1)],
 [(0, 1), (2, 1), (3, 2), (8, 1)],
 [(0, 1), (2, 1), (5, 1), (9, 1)],
 [(3, 1), (10, 1), (11, 1), (12, 1)],
 [(3, 1), (10, 1), (13, 1)],
 [(3, 1), (12, 1)],
 [(3, 1), (10, 1), (12, 1), (14, 1)],
 [(12, 1), (13, 1), (14, 1)],
 [(3, 1), (14, 1), (15, 1)]]

Let's understand the matrix above. For example, the last line

[(3, 1), (14, 1), (15, 1)]

- The word ('bank') whose integer ID is 3 appears only once in document 11
- The word ('sell') whose integer ID is 14 appears once in document 11

The function doc2bow() simply counts the number of occurences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector https://radimrehurek.com/gensim/tut1.html

# Appendix 2

We want to show off the new `get_term_topics` and `get_document_topics` functionalities, and a good way to do so is to play around with words which might have different meanings in different context.

The word `bank` is a good candidate here, where it can mean either the financial institution or a river bank.
In the toy corpus presented, there are 11 documents, 5 `river` related and 6 `finance` related. 

### get_term_topics

The function `get_term_topics` returns the odds of that particular word belonging to a particular topic. 
A few examples:

In [19]:
model.get_term_topics('water')

[(0, 0.17002494151715156)]

Makes sense, the value for it belonging to `topic_0` is a lot more.

In [20]:
model.get_term_topics('finance')

[(1, 0.15048474280196972)]

In [21]:
model.get_term_topics('bank')

[(0, 0.13386514906929059), (1, 0.19109000946118354)]

This also works out well, the word finance is more likely to be in topic_1 to do with financial banks.

And this is particularly interesting. Since the word bank is likely to be in both the topics, the values returned are also very similar.

### get_document_topics 

`get_document_topics` is an already existing gensim functionality which uses the `inference` function to get the sufficient statistics and figure out the topic distribution of the document.

The addition to this is the ability for us to now know the topic distribution for each word in the document. 
Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.

The `get_document_topics` method returns (along with the standard document topic proprtion) the word_type followed by a list sorted with the most likely topic ids, when `per_word_topics` is set as true.

In [23]:
#let's declare two new documents and find out the likely topic for each document
bow_water = ['bank','water','bank']
bow_finance = ['bank','finance','bank']


In [24]:
bow = model.id2word.doc2bow(bow_water) # convert to bag of words format first
print bow

[(0, 1), (3, 2)]


0 is "water", 3 is "bank"

In [25]:
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
word_topics

[(0, [0]), (3, [0, 1])]

In [26]:
doc_topics

[(0, 0.75955178281199542), (1, 0.24044821718800463)]

In [27]:
bow = model.id2word.doc2bow(bow_finance) # convert to bag of words format first
print bow

[(3, 2), (12, 1)]


Now what does that output mean? It means that like `word_type 1`, our `word_type` `3`, which is the word `bank`, is more likely to be in `topic_0` than `topic_1`.

In [28]:
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
word_topics

[(3, [1, 0]), (12, [1])]

You must have noticed that while we unpacked into `doc_topics` and `word_topics`, there is another variable - `phi_values`. Like the name suggests, phi_values contains the phi values for each topic for that particular word, scaled by feature length. **Phi** is essentially **the probability of that word in that document belonging to a particular topic**. The next few lines should illustrate this. 

In [29]:
doc_topics

[(0, 0.1502293463768338), (1, 0.84977065362316628)]

In [30]:
phi_values

[(3, [(0, 0.098302410617005551), (1, 1.9016975893829946)]),
 (12, [(1, 0.99756820057045625)])]

This means that `word_type` 0 has the following phi_values for each of the topics. 
What is intresting to note is `word_type` 3 - because it has 2 occurences (i.e, the word `bank` appears twice in the bow), we can see that the scaling by feature length is very evident. The sum of the phi_values is 2, and not 1.

Now that we know exactly what `get_document_topics` does, let us now do the same with our second document, `bow_finance`.

Because the word bank is now used in the financial context, it immedietly swaps to being more likely associated with `topic_1`.

We've seen quite clearly that based on the context, the most likely topic associated with a word can change. 
This differs from our previous method, `get_term_topics`, where it is a 'static' topic distribution. 

It must also be noted that because the gensim implementation of LDA uses Variational Bayes sampling, a `word_type` in a document is only given one topic distribution. For example, the sentence 'the bank by the river bank' is likely to be assigned to `topic_0`, and each of the bank word instances have the same distribution.

> In conclusion, 
- bow_water = ['bank','water','bank'] is close to topic_0
- bow_finance = ['bank','finance','bank'] is close to topic_1

# Appendix 3

**pyldavis** is a Python package visualizing the results of topics modeling (https://github.com/bmabey/pyLDAvis)

To install, **pip install pyldavis**

In [37]:
!pip install pyldavis

Collecting pyldavis
  Downloading pyLDAvis-2.1.1.tar.gz (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 310kB/s eta 0:00:01
Collecting future (from pyldavis)
  Downloading future-0.16.0.tar.gz (824kB)
[K    100% |████████████████████████████████| 829kB 586kB/s ta 0:00:011
[?25hCollecting funcy (from pyldavis)
  Downloading funcy-1.9.1.tar.gz
Building wheels for collected packages: pyldavis, future, funcy
  Running setup.py bdist_wheel for pyldavis ... [?25ldone
[?25h  Stored in directory: /Users/linlyn/Library/Caches/pip/wheels/de/41/af/cba16e4c15ff942728f3345c8f165831b03ad7f4d87cff8b6e
  Running setup.py bdist_wheel for future ... [?25ldone
[?25h  Stored in directory: /Users/linlyn/Library/Caches/pip/wheels/c2/50/7c/0d83b4baac4f63ff7a765bd16390d2ab43c93587fac9d6017a
  Running setup.py bdist_wheel for funcy ... [?25ldone
[?25h  Stored in directory: /Users/linlyn/Library/Caches/pip/wheels/0a/7e/0f/7e6946d9291cf9963926645a1dee5dd5a9c522daa5a659e0a2
Successfully built

In [38]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


# Appendix 4 How to Determine Optional K Value

- Human judgement
- Semantic coherence

In [34]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.wrappers import LdaVowpalWabbit, LdaMallet

Two different LDA Topic models are initialized. A good one and bad one. A "good" topic model (k = 2); a "bad" topic model (k = 6).

The method below will determine which k value delivers **highly coherent topics**. For example, a topic with words such as "river", "water", and "bank" is considered a coherent topic. On the other hand, a topic with words such as "river", "finance", and "biology" may not be coherent.

This measure is called **semantic coherence**.

References 

- http://dirichlet.net/pdf/mimno11optimizing.pdf
- https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_coherence_tutorial.ipynb

In [35]:
numpy.random.seed(1) # setting random seed to get the same results each time. We normally do not include this line.
goodLdaModel = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=20)
badLdaModel = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=6, passes=20)

### u_mass

In [36]:
goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')
badcm = CoherenceModel(model=badLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')
print goodcm.get_coherence()
print badcm.get_coherence()

-17.5796583466
-17.648462136


Good model (2 topics) is better than bad model (6 topics) in terms of semantic coherence

Now, let's develop several models and find out their coherence score

In [None]:
numpy.random.seed(1) # setting random seed to get the same results each time. We normally do not include this line.
# we are considering 2 through 7; For a large dataset (or corpus), we would consider 2 through 100). This process would take a very long time
for k in range(2,7):
    goodLdaModel = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, passes=20)
    goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')
    print goodcm.get_coherence()

This shows 2 as the optimal toic number 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline



# Appendix 5 Parameters of LDA

Number of Topics – Number of topics to be extracted from the corpus. 

Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/