# Dynamic Topic Models

As we observed before, topic models are used to understand the themes across dataset. In many scenarios, corpus is collected over a period of time and we might want to analyse the themes across this time period, for ex. to see how the particular theme/topic evolve over time. Dynamic topic model is precisely aimed at analysing these time based datasets.

The following image illustrates an example from the original paper on [DTM by Blei and Lafferty](https://mimno.infosci.cornell.edu/info6150/readings/dynamic_topic_models.pdf). It clearly illustrates how the same broad classified topic starts looking more 'mature' as time goes on.

<img src="img/dtm_topics.png"/>

By having a time-based element to topics, context is preserved while key-words may change.

While most traditional topic mining algorithms do not expect time-tagged data or take into account any prior ordering, Dynamic Topic Models (DTM) leverages the knowledge of different documents belonging to a different time-slice in an attempt to map how the words in a topic change over time.

Another useful analysis in a corpus spanning over a couple of years could be to find semantically similar documents; one from the very beginning of the time-line, and one in the very end. The author of the DTM paper, David Blei, gave a good example of this in his [talk](https://www.youtube.com/watch?v=7BMsuyBPx90) a while back. After running DTM on a dataset of the Science Journal from 1880 onwards, he picks up this paper - [The Brain of the Orang (1880)](http://science.sciencemag.org/content/os-1/28/326). It's topics were concentrated on context related to Monkeys and Neuroscience or brains.

<img src="img/orang_brain.png"/>

He goes ahead to pick up another paper with likely very less common words, but in the same context - analysing monkey brains. In fact, this one is called - ["Representation of the visual field on the medial wall of occipital-parietal cortex in the owl monkey"](http://allmanlab.caltech.edu/PDFs/AllmanKaas1976.pdf). Like mentioned before, you wouldn't imagine too many common words in these two papers, about a 100 years apart.

<img src="img/owl_monkey.png"/>

But a Hellinger Distance based Document-Topic distribution gives a very high similarity value! The same topics evolved smoothly over time and the context remains. A document similarity match using other traditional techniques might not work very well on this! Blei defines this technique as - "Time corrected Document Similarity".

## Data

We'll use the same fake news data as earlier but as we understand from the above description, Dynamic topic models are used for the datasets which have a time element. So we'll now consider the "published" column also to get the time stamps of news articles.

In [None]:
import pandas as pd

df_fake = pd.read_csv('fake.csv')
df_fake = df_fake[['title', 'text', 'language', 'published']]
df_fake = df_fake.loc[(pd.notnull(df_fake.title)) & (pd.notnull(df_fake.text)) & \
                      (pd.notnull(df_fake.published)) & (df_fake.language == 'english')]
df_fake

## Preprocessing

In [None]:
import re
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation

import nltk
nltk.download('wordnet') # download wordnet to be used in lemmatization
from nltk.stem import WordNetLemmatizer

def preprocess(texts):
    # tokenization
    texts = [re.findall(r'\w+', line.lower()) for line in texts]
    # remove stopwords
    texts = [remove_stopwords(' '.join(line)).split() for line in texts]
    # remove punctuation
    texts = [strip_punctuation(' '.join(line)).split() for line in texts]
    # remove words that are only 1-2 character
    texts = [[token for token in line if len(token) > 2] for line in texts]
    # remove numbers
    texts = [[token for token in line if not token.isnumeric()] for line in texts]
    # lemmatization 
    lemmatizer = WordNetLemmatizer()
    texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]
    
    return texts

# pre-processing
processed_texts = preprocess(df_fake.text)

In [None]:
from gensim.models.phrases import Phrases, Phraser

# training for bigram collocation detection
phrases = Phrases(processed_texts, min_count=1, threshold=0.8, scoring='npmi')
bigram = Phraser(phrases)
# merging detected collocations with data
processed_texts = list(bigram[processed_texts])

In [None]:
from gensim import corpora

class DTMcorpus(corpora.textcorpus.TextCorpus):
    def get_texts(self):
        return self.input

    def __len__(self):
        return len(self.input)

corpus = DTMcorpus(processed_texts)
corpus

A very important input for DTM to work is the time_slice input. It should be a list which contains the number of documents in each time-slice. Our dataset consists of data collected across 2 months, so let's first count the no. of document in each month.

We'll need to convert the "published" column in our dataset to pandas datetime series.

In [None]:
df_fake["published"] = pd.to_datetime(df_fake["published"])
t1 = df_fake.loc[(df_fake["published"].dt.month == 10)].shape[0]
t2 = df_fake.loc[(df_fake["published"].dt.month == 11)].shape[0]

In [None]:
print(t1, t2)

As we can see the first month has 5710 articles and second month has 5967 articles. This means we'd need an input which looks like this: time_slice = [5710, 5967]. Technically, a time-slice can be a month, year, or any way you wish to split up the number of documents in your corpus, time-based.

In [None]:
time_slices = [t1, t2]

## Train

Now we generate the path to the DTM excecutable and use it through the DTM model wrapper available in gensim.

In [None]:
from gensim.models.wrappers.dtmmodel import DtmModel

dtm_path = "/Users/parul/Desktop/Information-extraction-tutorial/dtm/dtm/main"
model = DtmModel(dtm_path, corpus, time_slices, num_topics=10, id2word=corpus.dictionary)

In [None]:
model.save('models/dtm_model')

In [11]:
model = DtmModel.load('models/dtm_model')

## Results

Much like LDA, the points of interest in DTM would be to see what the topics are and how the documents are made up of these topics. In DTM we have the added interest of seeing how these topics evolve over time.

Let's go through some of the functions to print topics and analyse documents.

In [12]:
model.show_topic(topicid=1, time=0, topn=10)

[(0.040499197594614164, 'obama'),
 (0.008650059062554342, 'text'),
 (0.008003198096176644, 'new'),
 (0.007383433673202659, 'president'),
 (0.006768889725549842, 'like'),
 (0.006451716198963009, 'com'),
 (0.006345130287772089, 'facebook'),
 (0.006196045154164384, 'white_house'),
 (0.004807154702749049, 'news'),
 (0.004794995761013065, 'comment')]

### Topic Evolution

To see a topic evolve, let's print the topic distribution from all the time slices for each topic.

The current dataset doesn't provide us the diversity for this to be an effective example; but we will nevertheless illustrate how to do the same.

In [13]:
num_topics = 3
for topic_no in range(num_topics):
    print("\nTopic", str(topic_no))
    for time in range(len(time_slices)):
        print("Time slice", str(time))
        print(model.show_topic(topic_no, time, topn=10))


Topic 0
Time slice 0
[(0.022090493669678763, 'russia'), (0.013283764115503797, 'russian'), (0.009643676190337538, 'war'), (0.008508286791404808, 'military'), (0.007297172784891062, 'nuclear'), (0.00572334209767633, 'president'), (0.005413978016967515, 'china'), (0.005093736448535589, 'putin'), (0.004843615389800822, 'nato'), (0.004736446484506321, 'world')]
Time slice 1
[(0.022517878968858727, 'russia'), (0.013564577782603829, 'russian'), (0.009657683934341412, 'war'), (0.008557989640700375, 'military'), (0.00728074922182181, 'nuclear'), (0.005742327333796194, 'president'), (0.005401115832611818, 'china'), (0.005124790579795278, 'putin'), (0.004886095017586734, 'nato'), (0.004768617659198458, 'world')]

Topic 1
Time slice 0
[(0.040499197594614164, 'obama'), (0.008650059062554342, 'text'), (0.008003198096176644, 'new'), (0.007383433673202659, 'president'), (0.006768889725549842, 'like'), (0.006451716198963009, 'com'), (0.006345130287772089, 'facebook'), (0.006196045154164384, 'white_ho

### Distance between documents

One of the more handy use of Dynamic topic models is that we can compare the documents across different time-frames and see how similar they are topic-wise. When words may not necessarily overlap over these time-periods, this is very useful.

We can get the topic distribution of documents using the gamma property of the model.

In [None]:
doc = 0
model.gamma_[doc]

The distance between documents based on their topic distribution can be calculated using hellinger metric available in gensim.

In [None]:
from gensim.matutils import hellinger

doc1 = 4
doc2 = 5
hellinger(model.gamma_[doc1], model.gamma_[doc2])

## Visualization

Similar to LDA, Dynamic topic models can also be visualized using pyLDAvis. Each time slice can be explored separately using this visualization.

In [16]:
import pyLDAvis

doc_topic, topic_term, doc_lengths, term_frequency, vocab = model.dtm_vis(time=0, corpus=corpus)
vis_wrapper = pyLDAvis.prepare(topic_term_dists=topic_term, doc_topic_dists=doc_topic, doc_lengths=doc_lengths, vocab=vocab, term_frequency=term_frequency)
pyLDAvis.display(vis_wrapper)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [17]:
doc_topic, topic_term, doc_lengths, term_frequency, vocab = model.dtm_vis(time=1, corpus=corpus)
vis_wrapper = pyLDAvis.prepare(topic_term_dists=topic_term, doc_topic_dists=doc_topic, doc_lengths=doc_lengths, vocab=vocab, term_frequency=term_frequency)
pyLDAvis.display(vis_wrapper)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


## References 

1. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/dtm_example.ipynb
2. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb
3. http://repository.cmu.edu/cgi/viewcontent.cgi?article=2036&context=compsci