## TOPIC COHERENCE

As I mentioned in `LDA_perplexity_demo.ipynb` [Chang et al., 2009](https://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf) found that perplexity is not always correlated with topic intrepretability. In this context [Röder et al, 2015](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) proposed a pipeline to estimate the so called `topic Coherence` that proved to be better correlated with human generated topic rankings. The `Coherence` metric is implemented in gensim through the `CoherenceModel` method.

Just as a reminder, the way I understand the process described in their paper is as follows: 

1. Segmentation of a set into a smaller sets (e.g word pairs) 
2. Calculate the Confirmation Measures that scores the agreement of a given pair (e.g. the pointwise mutual information (PMI) or the normalized PMI (NPMI)). Confirmation Measures use word and word co-occurrence probailities (see [Röder et al, 2015](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) their expression (1) and (2) for example)
3. Aggregation of the Confirmation Measures

Remember that I run `lda_perplexity.py` with 10, 20 and 50 topics for the 3 experiments that I described in `LDA_perplexity_demo.ipynb`. I saved the results in picke files in a directory called `data_processed`: 

In [12]:
from __future__ import print_function
import pandas as pd
import cPickle as pickle

# dictionaries with the perplexity values per set-up
pp_10 = pickle.load(open("data_processed/perplexity_10.p", "rb"))
pp_20 = pickle.load(open("data_processed/perplexity_20.p", "rb"))
pp_50 = pickle.load(open("data_processed/perplexity_50.p", "rb"))

# dictionaries with dataframes with words per topic per set-up
tw_10 = pickle.load(open("data_processed/topic_words_df_10.p", "rb"))
tw_20 = pickle.load(open("data_processed/topic_words_df_20.p", "rb"))
tw_50 = pickle.load(open("data_processed/topic_words_df_50.p", "rb"))

# Let's have a look
pp_10

{'basic_exp1': 0.31471618171353233,
 'basic_exp2': 0.31429573959863427,
 'glove': 0.32085378021600947,
 'lemma_exp1': 0.26659287034652801,
 'lemma_exp2': 0.25899912673071496,
 'stem_exp1': 0.26658473872097244,
 'stem_exp2': 0.25414921201102364,
 'words': 0.34226687877222595}

Note that the keys are the different set-ups we used. These are

`basic_exp1`: `SimpleTokenizer` without bigrams

`basic_exp2`: `SimpleTokenizer` with bigrams

`stem_exp1`: `StemTokenizer` without bigrams

`stem_exp2`: `StemTokenizer` with bigrams

`lemma_exp1`: `LemmaTokenizer` without bigrams

`lemma_exp2`: `LemmaTokenizer` with bigrams

`words`: filtering words using nltk.words vocabulary

`glove`: filering words using the 400k tokens corresponding to the well known glove vectors

Let's have a look to some of the dataframes with words per topic

In [13]:
tw_20['lemma_exp2']

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,edu,state,power,pay,team,edu,people,pit,god,max,drug,edu,drive,appear,think,edu,israel,edu,key,file
1,article,government,water,insurance,san,article,gun,det,jesus,game,medical,com,edu,party,people,article,jews,article,value,program
2,apr,people,light,edu,hockey,blue,edu,pts,christ,max_max,cause,article,windows,art,god,uiuc,israeli,black,chip,use
3,harvard,right,grind,abortion,game,org,say,chi,say,team,know,car,work,men,know,henry,article,com,encryption,edu
4,turkey,president,like,health,edu,com,article,period,lord,play,article,like,know,new,believe,umd,jewish,apr,objective,image
5,adam,think,heat,tax,new,pitt,kill,bos,sin,year,doctor,apr,use,wolverine,mean,uiuc_edu,com,hole,com,space
6,com,work,earth,berkeley,league,pitt_edu,think,van,come,think,use,new,com,university,say,toronto,arab,muslims,article,mail
7,bnr,new,energy,coverage,nhl,duke,right,tor,life,edu,disease,think,card,political,edu,cso,edu,stanford,clipper,available
8,arromdee,law,radar,cost,season,like,know,buf,know,win,time,bike,like,man,like,space,right,know,frank,include
9,article_apr,time,star,private,win,tek,fbi,min,love,time,effect,know,run,annual,article,cobb,people,time,science,data


In [14]:
tw_10['stem_exp2']

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
0,file,edu,edu,game,edu,govern,articl,max,space,god
1,window,articl,drive,team,com,peopl,com,max_max,nasa,christian
2,program,pay,work,play,articl,right,edu,edu,year,peopl
3,use,insur,know,edu,mail,state,like,year,launch,believ
4,edu,com,like,win,post,law,car,game,orbit,know
5,imag,abort,card,hockey,univers,key,think,hit,medic,jesu
6,com,uiuc,use,player,group,gun,peopl,articl,use,think
7,mail,tax,com,year,inform,think,thing,bhj,research,edu
8,run,health,problem,articl,new,armenian,know,basebal,time,like
9,avail,want,thank,season,address,articl,good,pitch,effect,time


Let's select the top 10 set-ups based on their perplexity and see how coherent they are. 

In [15]:
import numpy as np
all_perplexities = pp_10.values() + pp_20.values() + pp_50.values()
top10_cut = round(np.percentile(all_perplexities, 10), 2)
top10_models = {}
for n,p in zip([10,20,50], [pp_10,pp_20,pp_50]):
    for model, perplexity in p.iteritems():
        if perplexity <= top10_cut:
            top10_models["_".join([model,str(n)])] = perplexity
top10_models

{'lemma_exp2_10': 0.25899912673071496, 'stem_exp2_10': 0.25414921201102364}

These 2 set-ups lead to similar perplexity. I have to say that normally, my (limited) experience is that Lemmatization and bigrams always leads to the best models (here with 10 topics).  

Let's use `gensim` to get the coherence:

In [17]:
from gensim.models import CoherenceModel, LdaMulticore
from gensim.matutils import Sparse2Corpus
from gensim.corpora import Dictionary
from nlp_utils import SimpleTokenizer, StemTokenizer, LemmaTokenizer
from nlp_utils import Bigram, read_docs

MAX_NB_WORDS = 20000
TEXT_DATA_DIR = '/home/ubuntu/working/text_classification/20_newsgroup/'
docs, doc_classes = read_docs(TEXT_DATA_DIR)

# If you run this as a script: python lda_coherence.py 
# you might want loggings on.

# logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.DEBUG)
# logging.root.level = logging.DEBUG
def model_builder(docs, tokenizer_, phaser_, nb_topics):
    """Simple helper so I don't have to repeat code
    """
    doc_tokens  = [tokenizer_(doc) for doc in docs]
    doc_tokens  = phaser_(doc_tokens)

    id2word = Dictionary(doc_tokens)
    id2word.filter_extremes(no_below=10, no_above=0.5, keep_n=MAX_NB_WORDS)
    corpus = [id2word.doc2bow(doc) for doc in doc_tokens]

    model = LdaMulticore(
       corpus=corpus,
       id2word=id2word,
       decay=0.7,
       offset=10.0,
       num_topics=nb_topics,
       passes=5,
       batch=False,
       chunksize=2000,
       iterations=50)

    return doc_tokens, id2word, model

In [18]:
# Warning! this will take some time (even on an AWS p2 instance)
stem_10_texts, stem_10_dict, stem_10_model = model_builder(docs, StemTokenizer(), Bigram(), 10)
lemma_10_texts, lemma_10_dict, lemma_10_model = model_builder(docs, LemmaTokenizer(), Bigram(), 10)

In [19]:
stem_10_CM = CoherenceModel(model=stem_10_model, texts=stem_10_texts, dictionary=stem_10_dict, coherence='c_v')
stem_10_coherence = stem_10_CM.get_coherence()

lemma_10_CM = CoherenceModel(model=lemma_10_model, texts=lemma_10_texts, dictionary=lemma_10_dict, coherence='c_v')
lemma_10_coherence = lemma_10_CM.get_coherence()

print("coherence with stemmization and 10 topics: {}".format(stem_10_coherence))
print("coherence with lemmatization and 10 topics: {}".format(lemma_10_coherence))

coherence with stemmization and 10 topics: 0.496838746061
coherence with lemmatization and 10 topics: 0.526520456048


so, the most coherent model is indeed that built after preprocessing the data with lemmatization and bigrams. 

The final LDA model was built running:

```
python lda_build_model.py
```