# Topics and Transformations

We will be using gensim for data pre-processing, computational linguistics and topic modelling.

Based on [Gensim Tutorials](https://radimrehurek.com/gensim/tutorial.html):
* [Corpora and Vector Spaces](https://radimrehurek.com/gensim/tut1.html)
* [Pre-processing and stop-words](https://radimrehurek.com/gensim/parsing/preprocessing.html)
* [Stemming Algorithms](https://radimrehurek.com/gensim/parsing/porter.html)

In [44]:
import os
import sys
import gensim

from gensim import corpora
from gensim import models
from gensim.models.coherencemodel import CoherenceModel

print('Python Version: %s' % (sys.version))

Python Version: 2.7.15 | packaged by conda-forge | (default, Feb 28 2019, 04:00:11) 
[GCC 7.3.0]


In [7]:
dictionary = corpora.Dictionary.load('documents.dict')
corpus = corpora.MmCorpus('documents.mm')
print(dictionary)
print(corpus)

Dictionary(7714 unique tokens: [u'francesco', u'csuci', u'univesidad', u'sation', u'efimenko']...)
MmCorpus(4 documents, 7714 features, 10760 non-zero entries)


In [11]:
import pickle
#with open('documents', 'wb') as f: #save
#    pickle.dump(mylist, f)

with open('documents', 'rb') as f: #load
    documents = pickle.load(f)

## Creating a transformation

The transformations are standard Python objects, typically initialized by means of a *training* corpus:

In [10]:
tfidf = models.TfidfModel(corpus)  # step 1 -- initialize a model

## Transforming vectors

From now on, `tfidf` is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):

In [15]:
corpus_tfidf = tfidf[corpus] # step 2 -- use the model to transform vectors

## LDA
`gensim` uses a fast implementation of online LDA parameter estimation based on `Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation`, modified to run in [distributed mode](https://radimrehurek.com/gensim/distributed.html) on a cluster of computers.

In [18]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

## View the topics in LDA model
The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using `lda_model.print_topics()` as shown next.

In [32]:
# Print the weight and token 
lda_model.print_topics() # all topics
lda_model.print_topic(1) # selected topic

u'0.003*"learn" + 0.003*"educ" + 0.002*"teach" + 0.001*"innov" + 0.001*"student" + 0.001*"higher" + 0.001*"institut" + 0.001*"chang" + 0.001*"http" + 0.001*"univers"'

## Compute Perplexity and Coherence Score
A lower perplexity score indicates better generalization performance 

*BLEI, D. M.; NG, A. Y.; JORDAN, M. I. Latent Dirichlet Allocation. Journal of Machine Learning Research, [s. l.], v. 3, n. Jan, p. 993–1022, 2003.*

In [52]:
print('Perplexity:', lda_model.log_perplexity(corpus))
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

('Perplexity:', -6.647293692955271)
('Coherence Score: ', nan)


In [56]:
lda_model.save('lda_model')