## Topic Modeling through Latent Dirichlet Allocation (LDA)

**Outline**

* [Introduction](#intro)
* [Example](#exp)
* [Airbnb use case](#use case)
* [References](#ref)

In [20]:
import os, glob
import pandas as pd
import numpy as np
import re, string
# construct the dictionary without loading all texts into memory
from six import iteritems

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim.models.ldamodel import LdaModel
import pickle
from gensim.test.utils import datapath

## <a id="data">Introduction and Dataset</a>

**What is LDA model?**
LDA represents documents as **mixture of topics** that spits out words with certain probabilities. It is a bag-of-words model.

**What does LDA do?**
It is a way of automatically discovering **topics** that sentences contain and is unsuperivised learning.

**LDA model assumption of how documents are written**
It assumes that when writing each document, you:

* decide on the number of words N the document will have (say, to a Poisson distribution)
* choose a topic mixture for the document ex: have a document to be about 1/3 food and 2/3 animals
* generate each word w_i in the document by:
    + first picking a topic (based on the topic's multinomial distribution)
    + use the topic to generate the word itself

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

**How does LDA learn?**

Below is an example where we have 5 documents, and we wanted to choose 2 topics to discover, then use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this?

One method is using collapsed Gibbs sampling:

* go through each document, and randomly assign each word in the document to one of the K topics.
    * This random assignment already gives you both topic representations to all the documents and word distributions of all the topics (although not very good)
* to improve on the random guessing, for each document d_i:
    * go through each word w_i in d_i:
        * for each topic t, compute two things:
            * p(topic t | document d_i) = prop of words in d_i that are currently assigned to topic t
            * p(word w_i | topic t) = prop of assignments to topic t over all documents that come from this word w_i
            * Reassign with a new topic, where we choose topic t with probability p(topic t | document d_i) * p(word w_i | topic t)
            * In other words, we are assuming that all topic assignments except for the current word in question are correct, then update the assignment of the current word using our model of how docs are generated.
            
* Repeat the previous step a large number of times and you will eventually reach a steady state where your assignments are pretty good. Then these assignments will be used to estimate the topic mixtures of each document.

Reference: [Edward Chen's blog on intro to LDA](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/)

![LDA model representation](img/LDA_viz.png)

Source: [Topic Model by David M. Blei](http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf)

## <a id="exp">Example</a>

Here is a quick example to show how LDA works to discover topics in given documents.

Steps:

1. Compile sample documents into a list
2. Clean corpus => apply stop words, lemmatize verbs and nouns, return clean corpus
3. Store clean corpus in dictionary and assign unique id for each unique term
    {term: id}
4. Apply doc2bow to convert tokenized documents into a document-term matrix, filter out non-frequent words to reduce size of matrix 
    + Each document is now a bag of words(bow)
    + Since corpora size is already small, we will not filter out non-frequent words to demonstrate model*
5. Run LDA model
6. Tune parameters
7. Evaluate model on new texts and save model

**Steps 1 & 2. Compile sample docs and clean corpus**

In [29]:
# create sample documents
doc_1 = "Don't feed monkeys fish. Monkeys love bananas. Cats love fish."
doc_2 = "Northwestern University is one of the most beautiful universities."
doc_3 = "2019 is the year of pig."
doc_4 = "Animals are our friends."
doc_5 = "There are many master programs at Northwestern University. MSIA is a strong program at Northwestern University."

# compile sample documents into a list
doc_list = [doc_1, doc_2, doc_3, doc_4, doc_5]

In [30]:
# create English stop words list (you can always define your own stopwords)
stop_words = set(stopwords.words('english'))
# stop_words.add('.')
#print(stop_words)

# Create a WordNetLemmatizer object for lemmatization as needed
lemmatizer = WordNetLemmatizer()

In [31]:
# Function to remove stop words from sentences & lemmatize verbs and nouns. 
def clean(doc):
    tokenized = word_tokenize(doc.lower())
    stop_free = [x for x in tokenized if not re.fullmatch('[' + string.punctuation + ']+', x) and x not in stop_words]
    lemma_verb = [lemmatizer.lemmatize(word,'v') for word in stop_free]
    lemma_noun = [lemmatizer.lemmatize(word,'n') for word in lemma_verb]
    y = [s for s in lemma_noun if len(s) > 2]
    return y

In [32]:
corpus_clean = [clean(doc.strip()) for doc in doc_list]
corpus_clean

[["n't",
  'fee',
  'monkey',
  'fish',
  'monkey',
  'love',
  'banana',
  'cat',
  'love',
  'fish'],
 ['northwestern', 'university', 'one', 'beautiful', 'university'],
 ['2019', 'year', 'pig'],
 ['animal', 'friend'],
 ['many',
  'master',
  'program',
  'northwestern',
  'university',
  'msia',
  'strong',
  'program',
  'northwestern',
  'university']]

**Steps 3 & 4. Store clean corpus and apply doc2bow **

In [36]:
# find a unique id for each unique term {term : id}
dictionary = corpora.Dictionary(corpus_clean)
# term : id
print(dictionary.token2id)

{'banana': 0, 'cat': 1, 'fee': 2, 'fish': 3, 'love': 4, 'monkey': 5, "n't": 6, 'beautiful': 7, 'northwestern': 8, 'one': 9, 'university': 10, '2019': 11, 'pig': 12, 'year': 13, 'animal': 14, 'friend': 15, 'many': 16, 'master': 17, 'msia': 18, 'program': 19, 'strong': 20}


In [34]:
# optional: Filter out non-frequent words
# low_freq_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq ==1]
# low_freq_ids

[6, 2, 5, 3, 4, 0, 1, 9, 7, 11, 13, 12, 14, 15, 16, 17, 19, 18, 20]

In [35]:
# dictionary.filter_tokens(low_freq_ids)
# dictionary.compactify() # remove gaps in id sequence after words that were removed
# print(dictionary)

Dictionary(2 unique tokens: ['northwestern', 'university'])


In [37]:
# convert tokenized documents into a document-term matrix
# you will need to filter out non-frequent words
corpus = [dictionary.doc2bow(doc_clean) for doc_clean in corpus_clean]
# token, token freq
corpus

[[(0, 1), (1, 1), (2, 1), (3, 2), (4, 2), (5, 2), (6, 1)],
 [(7, 1), (8, 1), (9, 1), (10, 2)],
 [(11, 1), (12, 1), (13, 1)],
 [(14, 1), (15, 1)],
 [(8, 2), (10, 2), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1)]]

**Step 5. Run LDA model **

In [46]:
# LDA model needs many iterations/passes and a large corpus to work well
# must define the number of topics you want to extract from the corpus
ldamodel = LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20, iterations=2000)

**Step 6. Model parameters **

* passes: Number of passes through the corpus during training. 
The *passes parameter* is indeed unique to gensim. It essentially allows LDA to see your corpus multiple times and is very handy for smaller corpora. LDA splits the corpus into mini-batches (for easier convergence - it is easier to work with smaller subsets) - so passes refers to how many times each mini-batch will be given to LDA.

* iterations: Maximum number of iterations through the corpus when inferring the topic distribution of a corpus. The iterations parameter puts a limit on how many times LDA will execute the E-Step for each document meaning that some documents may not converge in time. One can set this as high as they like (or have time for).

*hyperparameters such as alpha and eta, which tune the prior distributions for the document-topic multimual distribution and the topic-word multimual distribution.*

* alpha: document-topic density, higher the value of alpha, documents are composed of more topics.

* eta: A-priori belief on word probability/topic-word density, higher the eta, topics are composed of a large number of words in the corpus

* decay (float, optional) – A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined.

Reference: https://radimrehurek.com/gensim/models/ldamodel.html

**Step 7. Evaluate model and save**

In [47]:
# print out the top words in each topic
print(ldamodel.print_topics(num_topics=2, num_words=6))

[(0, '0.163*"university" + 0.127*"northwestern" + 0.091*"program" + 0.055*"master" + 0.055*"msia" + 0.055*"strong"'), (1, '0.106*"love" + 0.106*"monkey" + 0.106*"fish" + 0.063*"fee" + 0.063*"n\'t" + 0.063*"banana"')]


In [57]:
new_texts = [
    ['chicken', 'duck', 'farm', 'rice'],
    ['northwestern', 'college', 'MSIA']
]
other_corpus = [dictionary.doc2bow(text) for text in new_texts]

unseen_doc1 = other_corpus[0]

# get topic probability distribution for an unseen_doc
vector = ldamodel[unseen_doc1]
print(vector)

[(0, 0.5), (1, 0.5)]


In [58]:
unseen_doc2 = other_corpus[1]

# get topic probability distribution for an unseen_doc
vector = ldamodel[unseen_doc2]
print(vector)

[(0, 0.7458668), (1, 0.25413322)]


**The model didn't learn from enough texts to correctly classify the animals vector, but did correctly classify the school vector to topic 0.**

In [None]:
# dump model
lda_model_file = open('lda_model.pkl','wb')
pickle.dump(ldamodel,lda_model_file)
lda_model_file.close()

# load
lda_model_file = open('lda_modelsilch.pkl','rb')
ldamodel = pickle.loads(lda_model_file.read())

## <a id="use case">Airbnb use case</a>

Source: [Discovering and Classifying In-app Message Intent at Airbnb](https://medium.com/airbnb-engineering/discovering-and-classifying-in-app-message-intent-at-airbnb-6a55f5400a0c)

**Background**
There are emergency scenarios during booking can cause anxiety and confusion and answering questions in a real-time fasion is especially desirable. Airbnb developed a ML framework to mitigate the problem and it was applied on the in-app messaging platform. 

**Solution**

**Identify message intent**automatically classify message intent to help shorten the response time for guests and reduce overall workload required for hosts.

Phase 1: LDA to discover potential topics (intents) in the large message corpus
Phase 2: moved to supervised learning techniques, use the topics derived from Phase 1 as intent labels for each message. *A multi-class classificaiton model using CNN was built 
   * There are messages with multi-intent, Airbnb treated sentences assigned with one single intent as an independent training sample when building the intent classification model.
   * CNN was used due to implementation simplicity, reported high accuracy and fast speed at both training and inference time. 

The two phases worked together to accurately understand text data on Airbnb's messaging platform.

** Model Productionization Flow Chart**
![Airbnb offline training & online serving workflow of Phase 2](img/airbnb_model_flow.png)

## <a id="ref">References</a>
* [Edward Chen's blog on intro to LDA](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/)
* [Topic Model by David M. Blei](http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf)
* [Discovering and Classifying In-app Message Intent at Airbnb](https://medium.com/airbnb-engineering/discovering-and-classifying-in-app-message-intent-at-airbnb-6a55f5400a0c)