## Latent Dirichlet Allocation (LDA)

### Overview

Starting with 30 sentences on various subjects (science, geography, art), we wish to output 10 topics of length 10 using LDA. A few of the sentences I used are:

- Europe's longest river in terms of discharge and drainage basin is the Volga.
- Mathematical structures are good models of real phenomena.
- The Western tradition of sculpture began in ancient Greece.

The first step is preprocessing the data (removing punctuation, stemming the words, changing uppercase characters into lowercase and removing stopwords). Afterwards, I created a dictionary to keep track of each unique word, and replaced each word with its corresponding number from the dictionary. The model for LDA was built as specified in the project's description, using the preprocessed documents as data. 

### Extras

1. Can the topic model be used to define a topic-based similarity measure between documents? 

The Jensen Shannon Divergence is a method of measuring the similarity between two probability distributions. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance (JSD). It is derived from another measure of statistical distance called the Kullback-Leiber Divergence (KLD), but it is symmetric and it always has a finite value. 
[Source](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)

To calculate the similarity between the topics of the two documents i and j, we can calculate:

$$JSD(\theta(i), \theta(j))$$

The value will be in the [0,1] interval, 0 meaning no similarity, and 1 meaning identical topic distribution.

In [157]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import scipy
import numpy as np
import pymc as pm
import string
import nltk
import re
nltk.download('stopwords')

#read 30 documents from input file
documents = np.zeros([30,1], dtype=object)
index = 0
f = open ("docs.txt","r")
f1 = f.readlines()
for x in f1:
    documents[index] = x
    index = index+1
    
#Preprocessing 
#remove \n from documents
for doc in documents:
    doc[-1] = doc[-1].strip()
#remove punctuation 
for index in range(len(documents)):
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    documents[index] = str(documents[index]).translate(translator) 
#stem the words and make lowercase
ps = PorterStemmer()
dataset = [[ps.stem(word.lower()) for word in document[0].split(" ")] for document in documents]
dataset = [' '.join(fragment) for fragment in dataset]
#remove stopwords
stop_words = set(stopwords.words('english')) 
filtered_documents = []
for document in dataset:
    doc = []
    for w in document.split(" "):
        if not w in stop_words:
            doc.append(w)
    filtered_documents.append(' '.join(doc)) 

#create dictionary
count = 0
dictionary = {}
for document in filtered_documents: 
    split_document = document.split(' ')
    for word in split_document: 
        if (word not in dictionary.keys() and (not word.isspace()) and word):
            dictionary[word] = count
            count = count+1
            
#replace each word with the number from the dictionary
data = [] 
for document in filtered_documents:
    document_data = []
    split_document = document.split(' ')
    for word in split_document:
        if ((not word.isspace()) and word):
            document_data.append(dictionary[word])
    data.append(document_data) 
#Show data     
#for index in range(len(data)):
    #print (data[index])

K, N, D = 10, len(dictionary), len(data) # number of topics, words, documents

"""the model trains to output psi(the distribution of words for each topic K) 
and phi (the distribution of topics for each document i)"""
alpha = np.ones(K)
#alpha = prior concentration parameter of the per-document topic distribution
beta = np.ones(N)
#beta = prior concentration parameter of the per-topic word distribution

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(N)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])

#theta(i) = topic distribution for document i
#phi(k) = word distribution for topic k
#theta, phi are Dirichlet distributions       
Wd = [len(doc) for doc in data]
Z = pm.Container([pm.Categorical("z_%i" % d,
                                 p=theta[d],
                                 size=Wd[d],
                                 value=np.random.randint(K,size=Wd[d]))
                               for d in range(D)])
#z(i,j) is the topic assignment for w(i,j)
W = pm.Container([pm.Categorical("w_%i,%i" % (d,i),
                                p=pm.Lambda("phi_z_%i_%i" % (d,i), 
                                        lambda z=Z[d][i], phi=phi: phi[z]),
                                value=data[d][i],
                                observed=True)
                               for d in range(D) for i in range(Wd[d])])
#w(i,j) is the j-th word of the i-th document
        
model = pm.Model([theta, phi, Z, W])
mcmc = pm.MCMC(model)
mcmc.sample(10000, 1000)

#show the topic assignment for each word, using the last trace  
for d in range(D):  
    print(mcmc.trace('z_%i'%d)[8999])  
    
def sortFirst(val):
    return val[0]


for j in range(K):
    t = (mcmc.trace('phi_%s'%j)[8999])
    li = list(zip(t[0], (list(dictionary.keys()))))
    li.sort(key=sortFirst, reverse=True)
    print ("Topic %i:" %j, li[:10])
    print ('\n')

def JSD(p, q):
    """
    method to compute the Jenson-Shannon Distance 
    between two probability distributions
    """
    # calculate m
    m = (p + q) / 2
    # entropy measure in scipy is implemented using the KLD
    # compute Jensen Shannon Divergence
    divergence = (scipy.stats.entropy(p, m) + scipy.stats.entropy(q, m)) / 2

    # compute the Jensen Shannon Distance
    distance = np.sqrt(divergence)

    return distance 

print('JSD:')
print (JSD(theta[1], theta[1]))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


 [-----------------100%-----------------] 10000 of 10000 complete in 132.6 sec[8 9 9 7 9 2 9 9]
[6 2 6 7 3 1 3]
[4 3 8 5 3 7 3 3 7 1]
[0 4 1 1 3 1 0]
[7 3 7 6 7 9]
[7 1 9 7 3]
[0 5 1 0 6 0]
[1 2 4 1 8 4 3 5]
[8 5 4 6 2 5 6 2 4 5 2 5 1]
[7 8 7 2 2 7 8]
[9 9 1 7]
[9 9 5 6 9 5]
[6 4 6 9 1 8 7]
[3 5 5 8 8 4]
[2 8 4 5 5 7]
[9 3 6 0 6]
[5 6 5 3 9 5 6 8]
[7 2 4 7 8 1]
[4 5 2 2 7 7 4 3 2 4 7 7]
[9 1 2 9 3 2 5 9 5 1 3]
[7 1 4 9 1 6]
[9 2 9 7 4 4 9 5 1]
[7 7 0 1 4 7]
[9 5 1 7 5 7 7]
[5 0 9 5 7 8 8 8 8 3 8 9]
[1 1 8 4 6 3]
[8 4 2 8 9 7 2 8 0]
[8 8 8 0]
[3 7 9 5 2 9]
[9 0 4 6 0 5]
Topic 0: [(0.03249283914526559, 'program'), (0.02896571696280665, 'peopl'), (0.02636600157357746, 'print'), (0.02636418368754826, 'visual'), (0.022840816825061835, 'empir'), (0.021603370556204676, 'interpret'), (0.020770102839029327, 'scienc'), (0.02000519702070068, 'refer'), (0.019107024359427262, 'made'), (0.0177296909183632, 'moscow')]


Topic 1: [(0.03618141643975489, 'phenomena'), (0.03363939113314517, 'econom'), (0

TypeError: len() of unsized object

2. What about a new document? How can topics be assigned to it?

We take a new document, add it to the corpus, and then run Gibbs sampling just on the words in that new document, keeping the topic assignments of the old documents the same.
I used a news headline dataset from Kaggle.
[Source](https://www.kaggle.com/ibadia/dawn-news-headlines/data)

New document examples:
- As US foreign policy remains shrouded in uncertainty, its best Pakistan stays at a distance
- Editorial: Inability of the Buzdar setup is commonly listed as a significant failure of Imran Khan's govt
- Minister opens dam in Balochistan

In [150]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import nltk
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
documents = pd.read_csv(r"C:\Users\Asus\Desktop\pp\categories_data.csv")

# Ignore columns that we won't use, we only keep the news headlines
documents = documents.drop(columns=['PageType', 'Link_paper', 'Date','Link_news'], axis=1)
# See a few headlines from the document
print(documents.head())
# Preprocess the text, remove stopwords and short words
def preprocess(text):
    result = []
    for i in gensim.utils.simple_preprocess(text):
        if i not in gensim.parsing.preprocessing.STOPWORDS and len(i) > 3:
            result.append(i)
    return result

processed_docs = documents['Headline'].map(preprocess)
#processed_docs[:10]

#Bag of Words on the dataset
dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
#print(bow_corpus[2700])

# Run LDA using Bag of Words
lda_model = gensim.models.LdaModel(bow_corpus, num_topics=10, id2word=dictionary)
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \n {}'.format(idx, topic))
    print('\n')
    
# Classifying a sample document 
#print(processed_docs[1563])
#for index, score in sorted(lda_model[bow_corpus[1563]], key=lambda tup: -1*tup[1]):
    #print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))
    
# Testing model on a new document
unseen_document = 'As US foreign policy remains shrouded in uncertainty, its best Pakistan stays at a distance'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
print('Testing the model on a new document:')
print (unseen_document)
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 10)))
    print ('\n')

                                            Headline
0          Unprecedented briefing planned by Taliban
1        Economic decisions taken by cabinet ignored
2  Crisis called Pakistan`s Katrina Kerry urges w...
3       The unbearable lightness of Pakistan cricket
4           Pakistan cricket dealt another hard blow
Topic: 0 
 0.029*"kashmir" + 0.028*"fata" + 0.027*"amid" + 0.024*"plea" + 0.021*"musharraf" + 0.018*"gets" + 0.017*"protest" + 0.016*"census" + 0.016*"year" + 0.015*"case"


Topic: 1 
 0.107*"killed" + 0.067*"attack" + 0.038*"militants" + 0.026*"blast" + 0.022*"courts" + 0.021*"quetta" + 0.020*"suicide" + 0.020*"indian" + 0.017*"kills" + 0.017*"drone"


Topic: 2 
 0.062*"sharif" + 0.056*"says" + 0.049*"army" + 0.033*"balochistan" + 0.028*"chief" + 0.025*"security" + 0.022*"cabinet" + 0.020*"cpec" + 0.020*"meeting" + 0.020*"issue"


Topic: 3 
 0.100*"govt" + 0.037*"sindh" + 0.032*"power" + 0.026*"cases" + 0.022*"crisis" + 0.020*"body" + 0.019*"parties" + 0.018*"budget" + 0.0

Sorting by score, it is most likely that the new sentence belongs to the first topic:

0.055*"told" + 0.046*"nawaz" + 0.044*"asks" + 0.039*"sharifs" + 0.033*"govt" + 0.021*"support" + 0.018*"plans" + 0.018*"waziristan" + 0.017*"media" + 0.016*"peshawar"
