In [None]:
import gensim

import numpy as np
import pyLDAvis.gensim

import nltk
nltk.download('wordnet') # download wordnet to be used in lemmatization
from nltk.stem import WordNetLemmatizer

from gensim.models.word2vec import Text8Corpus

from gensim.corpora import Dictionary
from gensim.models.callbacks import CoherenceMetric, DiffMetric, PerplexityMetric, ConvergenceMetric
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel

import matplotlib.pyplot as plt
import plotly.offline as py
from plotly.graph_objs import *
import plotly.figure_factory as ff

py.init_notebook_mode()

## Data

We'll be using the [fake news dataset](https://www.kaggle.com/mrisdal/fake-news) from kaggle for this notebook. The dataset contains text and metadata scraped from 244 websites tagged as "bullshit" by the [BS Detector](https://github.com/selfagency/bs-detector) Chrome Extension by [Daniel Sieradski](https://github.com/selfagency).

In [1]:
import pandas as pd

df_fake = pd.read_csv('fake.csv')
df_fake = df_fake[['title', 'text', 'language']].head()
df_fake = df_fake.loc[(pd.notnull(df_fake.text)) & (df_fake.language == 'english')]
df_fake

Unnamed: 0,title,text,language
0,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english
1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,english
2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,english
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,english
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,english


## Pre-process

This is one of the most important step in analyzing the text data. If the preprocessing is not good, the algorithm can't do much since we are feeding it a lot of noise, in other words, **Garbage In Garbage Out**. So let's first clean our data using the following techniques:

1. Tokenization
2. Stopword removal
3. Strip punctuation
4. Lemmatization
5. Bigram collocation detection. (Bigrams are sets of two adjacent tokens. Collocations are frequently co-occurring tokens. Using bigrams, phrases like "machine_learning" can be discovered in our output which otherwise would have been treated as two separate words "machine" and "learning". Spaces are replaced with underscores in corpus)

Note: Stemming and lemmatization are two different processes for reducing the morphological variation of words. 
- Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

- Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

    The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

    The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.

    The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

In [45]:
import os, re
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation

import nltk
nltk.download('wordnet') # download wordnet to be used in lemmatization
from nltk.stem import WordNetLemmatizer

def preprocess(texts):
    # tokenization
    texts = [re.findall(r'\w+', line.lower()) for line in texts]
    # remove stopwords
    texts = [remove_stopwords(' '.join(line)).split() for line in texts]
    # remove punctuation
    texts = [strip_punctuation(' '.join(line)).split() for line in texts]
    # remove words that are only 1-2 character
    texts = [[token for token in line if len(token) > 1] for line in texts]
    # remove numbers
    texts = [[token for token in line if not token.isnumeric()] for line in texts]
    # lemmatization 
    lemmatizer = WordNetLemmatizer()
    texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]
    
    return texts

# pre-processing
processed_texts = preprocess(df_fake.text)

[nltk_data] Downloading package wordnet to /Users/parul/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [33]:
processed_texts[0]

['print',
 'pay',
 'money',
 'plus',
 'entire',
 'family',
 'came',
 'need',
 'deported',
 'asap',
 'years',
 'bust',
 'group',
 'stealing',
 'government',
 'taxpayers',
 'group',
 'somalis',
 'stole',
 'million',
 'government',
 'benefits',
 'months',
 've',
 'reported',
 'numerous',
 'cases',
 'like',
 'muslim',
 'refugees',
 'immigrants',
 'commit',
 'fraud',
 'scamming',
 'way',
 'control',
 'related']

In [46]:
from gensim.models.phrases import Phrases, Phraser

# training for bigram collocation detection
phrases = Phrases(processed_texts, min_count=1, threshold=0.8, scoring='npmi')

Then we create a performant Phraser object to transform any tokenized sentence.

In [47]:
bigram = Phraser(phrases)

In [52]:
bigram[['hillary', 'clinton', 'tree', 'money']]

['hillary_clinton', 'tree', 'money']

In [49]:
# merging detected collocations with data
processed_texts = list(bigram[processed_texts])

In [None]:
# split into training, holdout and test data
training_texts = texts[:5000]
holdout_texts = texts[5000:7500]
test_texts = texts[7500:]

In [None]:
# create dictionary mappings for training data
dictionary = Dictionary(training_texts)

We now remove rare words and common words based on their document frequency to further prevent noisy results. Below we remove words that appear in less than 10 documents or in more than 60% of the documents.

In [None]:
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=10, no_above=0.6)

Finally, we transform the documents to a vectorized form. We simply compute the frequency of each word, including the bigrams.

In [None]:
# make a function if use all 3 types

training_corpus = [dictionary.doc2bow(text) for text in training_texts]
holdout_corpus = [dictionary.doc2bow(text) for text in holdout_texts]
test_corpus = [dictionary.doc2bow(text) for text in test_texts]

Let's see how our final corpus looks like in vectorized form.

In [None]:
training_corpus[0]

A document is represented as a list of tuples of (vocab ID, frequency) for each word.

## Train

Let's train our topic model which is just a matter of single line with gensim.

In [None]:
# training LDA model
lda_model = LdaModel(corpus=training_corpus, id2word=dictionary, num_topics=35, passes=50 , chunksize=1500, iterations=200, alpha='auto')

Now we'll see what the typical output of a topic model looks like. Below, we print top 5 topics. As we can see, each topic is associated with a set of words, and each word has a probability of being expressed under that topic.

In [None]:
model.show_topics(num_topics=5)

There are other APIs that can be used to explore the trained topic model, for getting further information about our data.

The function `get_term_topics` returns the odds of a particular word belonging to some particular topic.

In [None]:
model.get_term_topics('money')

The `get_document_topics` method returns topic distribution of the document along with topic distribution for each word in that document.

In [None]:
doc_topic, word_topic, phi_value = model.get_document_topics(training_corpus[0], per_word_topics=True)

In [None]:
doc_topic

The output gives the topic distribution of the document.

In [None]:
word_topic

The output gives a ranked list of topics for every word. The lower the rank, more is the probability of the word  belonging to that topic in the document.

In [None]:
phi_value

Phi values are essentially the probability of that word in that document belonging to a particular topic.

## Evaluation

### Manual:

We would like to know if the correct thing has been learned, does the topics inferred make sense as per our text data.
Thanks to pyLDAvis, we can visualise our topic models in a really handy way and inspect what words the topics consist of or how similar the topics are.

In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, training_corpus, dictionary, sort_topics=False)

We can see what topics does the fake news primarily focuses on, and according to our dataset its mostly about politics. 

#### Left Panel:

1. The area of the circles is proportional to the prevalence of the topics in corpus. So, we can visually determine about the most important topics in our corpus.
2. The positioning of the topics is done according to their inter-topic distances, which to some exent preserves the semantic similarity allowing some related topics to form clusters, it is however a little difficult to determine exactly how similar the topics are. For this, we can directly visualize the matrix of inter-topic distances and know the exact distance (with intersecting/different words) between any pair of topics.

In [55]:
def plot_difference(mdiff, title="", annotation=None):
    """
    Helper function to plot difference between models
    """
    annotation_html = None
    if annotation is not None:
        annotation_html = [["+++ {}<br>--- {}".format(", ".join(int_tokens), ", ".join(diff_tokens))
                            for (int_tokens, diff_tokens) in row]
                           for row in annotation]
        
    data = Heatmap(z=mdiff, colorscale='RdBu', text=annotation_html)
    layout = Layout(width=950, height=950, title=title,
                       xaxis=dict(title="topic"), yaxis=dict(title="topic"))
    py.iplot(dict(data=[data], layout=layout))

In [None]:
difference_matrix, annotation = model.diff(model, distance='jensen_shannon', num_words=50)
plot_difference(difference_matrix, title="Topic difference [jensen shannon distance]", annotation=annotation)

The 2D coordinates which are inferred for each topic in the left panel of pyldavis are based on above distance matrix. [Principal coordinate analysis (PCoA)](https://mb3is.megx.net/gustame/dissimilarity-based-methods/principal-coordinates-analysis) is used for infering the 2d coordinates which seeks to preserve the original topic distances of higher dimensions. You can refer to [this](http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/) blog for getting a graphical reasoning behind this technique.


#### Right Panel:

1. The terms are initially ranked according to their saliency, when no topic is selected. Saliency basically depends on frequency of the term in corpus, and how informative the specific term w is for determining the individual topics.  For example, if a word w occurs in all topics, observing the word tells us little about the document’s topical mixture <sup>[1]</sup>.
2. The tuning parameter, 0≤λ≤1, controls how the terms are ranked for each selected topic, with terms listed in decreasing order of relevance. The relevance of term w to topic t is defined as λ*p(w∣t)+(1−λ)*p(w∣t)/p(w). Values of λ near 1 give high relevance rankings to frequent terms within a given topic, whereas values of λ near zero give high relevance rankings to exclusive terms within a topic i.e. it factors in the general probability of that word in other topics.

But how does this ranking relate to the ranking of terms in gensim's `show_topic()` method that we saw above. In gensim, the float value given, next to every term as shown in the list output of `show_topic()` is the value of p(w∣t) i.e. probability of term w for topic t, and gensim simply ranks the topic terms according to this probability value. So, in order to have similar results as gensim’s `show_topic()` in pyLDAvis, λ can be set to 1 which will then result in the relevance being directly proportional to only p(w∣t). 

Now if we choose λ very close to 1, common terms of the corpus appear near the top for multiple topics, making it hard to differentiate between the meanings of different topics. And if choose λ very close to 0, it can still remain noisy, by giving high rankings to very rare terms that occur in only a single topic, for instance. While such terms may contain useful topic content, but if they are very rare the topic may remain difficult to interpret.

The optimal value suggested for λ is 0.6<sup>[2]</sup>.



### Automatic:

[Coherence](http://qpleple.com/topic-coherence-to-evaluate-topic-models/) is often used to get past the manual inspection and objectively compare the topic models. By returning a score, we can compare between different topic models of the same corpus.

In [None]:
CoherenceModel(model, texts=training_texts, dictionary=dictionary, window_size=10).get_coherence()

Following is a really nice explanation for understanding the intution behind coherence that I came across in Matti Lyra's [Pydata Berlin talk](https://github.com/mattilyra/pydataberlin-2017/blob/master/notebook/EvaluatingUnsupervisedModels.ipynb):

Take the following two documents that talk about ice hockey. I've highlighted terms that I think are related to the subject matter, you may disagree with my judgement. Notice that among the terms that I've highlighted as being part of the topic of Ice Hockey are words such as Penguin, opposing and shots. None of these on the face of it would appear to "belong" to Ice Hockey, but seeing them in context makes it clear that Penguin refers to the ice hockey team, shots refers to disk shaped pieces of vulcanised rubber being launched at the goal at various different speeds and opposing refers to the opposing team although it might more commonly be thought to belong politics or the debate club.

> Rinne stopped 27 of 28 **shots** from the **Penguins** in **Game** 6 at home Sunday, but that lone **goal** allowed was enough for the **opposition** to break out the **Stanley Cup trophy** for the second straight **season**.

Given the terms that I've determined to be a partial description of Ice Hockey (the concept), one could conceivably measure the coherence of that concept by counting how many times those terms occur with each other i.e. co-occur in some sufficiently large reference corpus.

One of course encounters a problem should the reference corpus never refer to ice hockey. A poorly selected reference corpus could for instance be patent applications from the 1800s, it would be unlikely to find those word pairs in that text.

This is precisely what several research papers have aimed to do. To take the top words from the topics in a topic model and measure the support for those words forming a coherent concept / topic by looking at the co-occurrences of those terms in a reference corpus. The research up to now was finally wrapped up into a single paper where the authors develop a coherence pipeline, which allows plugging in all the different methods into a single framework. This coherence pipeline is partially implemented in gensim, below is a few examples on how to use it.

## Application

In this section, we will go through some of the applications or ways in which we can use topic models and get the most out of them for our NLP tasks.


### Document clustering

Now apart from getting the topic distribution of a corpus as a whole (as we did using `show_topics`) we can also get the topic distribution of individual documents using `get_document_topics`. Let's see an example for doing it in gensim and then we will visualize the documents based on their topic distribution. That topic distribution representation of each document can be used for clustering the semantically similar documents together. 

In [None]:
# Get document topics
doc_topic, word_topic, phi_value = model.get_document_topics(training_corpus, minimum_probability=0)
doc_topic[0]

The above output shows the topic distribution of first document in the corpus as a list of (topic_id, topic_probability).

Now, using the topic distribution of a document as it's vector representation, we will plot all the documents in our corpus using Tensorboard.

#### Prepare the Input files for Tensorboard

Tensorboard takes two input files, one containing the embedding vectors and the other containing relevant metadata. As described above, we will use the topic distribution of documents as their representative vector and the metadata file will consist of Document titles.

In [None]:
# create file for tensors(vectors)
with open('doc_lda_tensor.tsv','w') as w:
    for topics in doc_topics:
        w.write(str(topics[1])+ "\t")
    w.write("\n")

In [None]:
# create file for metadata(documet titles)
with open('doc_lda_metadata.tsv','w') as w:
    for doc_id in range(len(doc_topic)):
        w.write("doc_" + df_fake.title[doc_id] + "\n")

Now we can go to http://projector.tensorflow.org/ and upload these two files by clicking on Load data in the left panel.

Next, we will append the topics with highest probability (topic_id, topic_probability) to the document's title, in order to explore what topics do the cluster corners or edges dominantly belong to. For this, we just need to overwrite the metadata file as below:

In [None]:
tensors = []
for doc_topics in all_topics:
    doc_tensor = []
    for topic in doc_topics:
        if round(topic[1], 3) > 0:
            doc_tensor.append((topic[0], float(round(topic[1], 3))))
    # sort topics according to highest probabilities
    doc_tensor = sorted(doc_tensor, key=lambda x: x[1], reverse=True)
    # store vectors to add in metadata file
    tensors.append(doc_tensor[:5])

# overwrite metadata file
with open('doc_lda_metadata.tsv','w') as w:
    w.write("Doc_Title\tTopic_dist\n")
    for doc_id in range(0, len(all_topics)):
        w.write("doc_%s\t%s\n" % (str(doc_id), str(tensors[doc_id])))

Upload the previous tensor file "doc_lda_tensor.tsv" and this new metadata file again to see the updated results.

### Topic Connections

Topic connections can be extremely useful especially for the datasets with interdisciplinary interests. One wide application for topic connections can be seen in semanticscholar.org where they use interrelated topics to index scientific papers.

#### Topic Network

Networks can be a great way to explore topic models. We can use it to navigate that how topics belonging to one context may relate to some topics in other context and discover common factors between them. We can use them to find communities of similar topics and pinpoint the most influential topic that has large no. of connections or perform any number of other workflows designed for network analysis.

In [None]:
# get topic distributions
topic_dist = model.state.get_lambda()

# get topic terms
num_words = 50
topic_terms = [{w for (w, _) in model.show_topic(topic, topn=num_words)} for topic in range(topic_dist.shape[0])]

Firstly, a distance matrix is calculated to store distance between every topic pair. The nodes of the network graph will represent topics and the edges between them will be created based on the distance between two connecting nodes/topics.

To draw the edges, we can use different types of distance metrics available in gensim for calculating the distance between every topic pair. Next, we'd have to define a threshold of distance value such that the topic-pairs with distance above that does not get connected.

In [None]:
from scipy.spatial.distance import pdist, squareform
from gensim.matutils import jensen_shannon
import networkx as nx
import itertools as itt

# calculate distance matrix using the input distance metric
def distance(X, dist_metric):
    return squareform(pdist(X, lambda u, v: dist_metric(u, v)))

topic_distance = distance(topic_dist, jensen_shannon)
topic_distance

In [None]:
# store edges b/w every topic pair along with their distance
edges = [(i, j, {'weight': topic_distance[i, j]})
         for i, j in itt.combinations(range(topic_dist.shape[0]), 2)]

# keep edges with distance below the threshold value
k = np.percentile(np.array([e[2]['weight'] for e in edges]), 20)
edges = [e for e in edges if e[2]['weight'] < k]

Now that we have our edges, let's plot the annotated network graph. On hovering over the nodes, we'll see the topic_id along with it's top words and on hovering over the edges, we'll see the intersecting/different words of the two topics that the edge connects.

In [None]:
# add nodes and edges to graph layout
G = nx.Graph()
G.add_nodes_from(range(topic_dist.shape[0]))
G.add_edges_from(edges)

graph_pos = nx.spring_layout(G)

# initialize traces for drawing nodes and edges 
node_trace = Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=Marker(
        showscale=True,
        colorscale='YIGnBu',
        reversescale=True,
        color=[],
        size=10,
        colorbar=dict(
            thickness=15,
            xanchor='left'
        ),
        line=dict(width=2)))

edge_trace = Scatter(
    x=[],
    y=[],
    text=[],
    line=Line(width=0.5, color='#888'),
    hoverinfo='text',
    mode='lines')


# no. of terms to display in annotation
n_ann_terms = 10

# add edge trace with annotations
for edge in G.edges():
    x0, y0 = graph_pos[edge[0]]
    x1, y1 = graph_pos[edge[1]]
    
    pos_tokens = topic_terms[edge[0]] & topic_terms[edge[1]]
    neg_tokens = topic_terms[edge[0]].symmetric_difference(topic_terms[edge[1]])
    pos_tokens = list(pos_tokens)[:min(len(pos_tokens), n_ann_terms)]
    neg_tokens = list(neg_tokens)[:min(len(neg_tokens), n_ann_terms)]
    annotation = "<br>".join((": ".join(("+++", str(pos_tokens))), ": ".join(("---", str(neg_tokens)))))
    
    x_trace = list(np.linspace(x0, x1, 10))
    y_trace = list(np.linspace(y0, y1, 10))
    text_annotation = [annotation] * 10
    x_trace.append(None)
    y_trace.append(None)
    text_annotation.append(None)
    
    edge_trace['x'] += x_trace
    edge_trace['y'] += y_trace
    edge_trace['text'] += text_annotation

# add node trace with annotations
for node in G.nodes():
    x, y = graph_pos[node]
    node_trace['x'].append(x)
    node_trace['y'].append(y)
    node_info = ''.join((str(node+1), ': ', str(list(topic_terms[node])[:n_ann_terms])))
    node_trace['text'].append(node_info)
    
# color node according to no. of connections
for node, adjacencies in enumerate(nx.generate_adjlist(G)):
    node_trace['marker']['color'].append(len(adjacencies))
    
fig = Figure(data=Data([edge_trace, node_trace]),
             layout=Layout(showlegend=False,
                hovermode='closest',
                xaxis=XAxis(showgrid=True, zeroline=False, showticklabels=True),
                yaxis=YAxis(showgrid=True, zeroline=False, showticklabels=True)))

py.iplot(fig)

For the above graph, we used the 20th percentile of all the distance values as our threshold. But we can experiment with few different values also such that the graph doesn’t become too crowded or too sparse and we could get an optimum amount of information about similar topics or any interesting relations between different topics.

### Topic dendrogram

The topics can be related to each other in a hierarchical form also. For ex. in case of a research paper corpora, we could have papers belonging to maths, physics, biology which can further be categorised into sub-groups. Maths papers can be sub-divided into topics such as Calculus, Algebra, Geometry; Physics papers into mechanics, electronics, astronomy; Biology into Genetics, Anatomy, Molecular etc.

Dendrogram is a tree-structured graph which can be used to visualize the result of a hierarchical clustering. We can use it to explore the topic models and see how the topics are connected to each other in a sequence of successive fusions or divisions that occur in the clustering process.

In [None]:
from scipy import spatial as scs
from scipy.cluster import hierarchy as sch

# get topic distributions
topic_dist = model.state.get_lambda()

# get topic terms
num_words = 300
topic_terms = [{w for (w, _) in model.show_topic(topic, topn=num_words)} for topic in range(topic_dist.shape[0])]

# no. of terms to display in annotation
n_ann_terms = 10

# use Jensen-Shannon distance metric in dendrogram
def js_dist(X):
    return pdist(X, lambda u, v: jensen_shannon(u, v))

# define method for distance calculation in clusters
linkagefun=lambda x: sch.linkage(x, 'single')

# calculate text annotations
def text_annotation(topic_dist, topic_terms, n_ann_terms, linkagefun):
    # get dendrogram hierarchy data
    linkagefun = lambda x: sch.linkage(x, 'single')
    d = js_dist(topic_dist)
    Z = linkagefun(d)
    P = sch.dendrogram(Z, orientation="bottom", no_plot=True)

    # store topic no.(leaves) corresponding to the x-ticks in dendrogram
    x_ticks = np.arange(5, len(P['leaves']) * 10 + 5, 10)
    x_topic = dict(zip(P['leaves'], x_ticks))

    # store {topic no.:topic terms}
    topic_vals = dict()
    for key, val in x_topic.items():
        topic_vals[val] = (topic_terms[key], topic_terms[key])

    text_annotations = []
    # loop through every trace (scatter plot) in dendrogram
    for trace in P['icoord']:
        fst_topic = topic_vals[trace[0]]
        scnd_topic = topic_vals[trace[2]]
        
        # annotation for two ends of current trace
        pos_tokens_t1 = list(fst_topic[0])[:min(len(fst_topic[0]), n_ann_terms)]
        neg_tokens_t1 = list(fst_topic[1])[:min(len(fst_topic[1]), n_ann_terms)]

        pos_tokens_t4 = list(scnd_topic[0])[:min(len(scnd_topic[0]), n_ann_terms)]
        neg_tokens_t4 = list(scnd_topic[1])[:min(len(scnd_topic[1]), n_ann_terms)]

        t1 = "<br>".join((": ".join(("+++", str(pos_tokens_t1))), ": ".join(("---", str(neg_tokens_t1)))))
        t2 = t3 = ()
        t4 = "<br>".join((": ".join(("+++", str(pos_tokens_t4))), ": ".join(("---", str(neg_tokens_t4)))))

        # show topic terms in leaves
        if trace[0] in x_ticks:
            t1 = str(list(topic_vals[trace[0]][0])[:n_ann_terms])
        if trace[2] in x_ticks:
            t4 = str(list(topic_vals[trace[2]][0])[:n_ann_terms])

        text_annotations.append([t1, t2, t3, t4])

        # calculate intersecting/diff for upper level
        intersecting = fst_topic[0] & scnd_topic[0]
        different = fst_topic[0].symmetric_difference(scnd_topic[0])

        center = (trace[0] + trace[2]) / 2
        topic_vals[center] = (intersecting, different)

        # remove trace value after it is annotated
        topic_vals.pop(trace[0], None)
        topic_vals.pop(trace[2], None)  
        
    return text_annotations

# get text annotations
annotation = text_annotation(topic_dist, topic_terms, n_ann_terms, linkagefun)

# Plot dendrogram
dendro = ff.create_dendrogram(topic_dist, distfun=js_dist, labels=range(1, 36), linkagefun=linkagefun, hovertext=annotation)
dendro['layout'].update({'width': 1000, 'height': 600})
py.iplot(dendro)

The x-axis or the leaves of hierarchy represent the topics of our LDA model, y-axis is a measure of closeness of either individual topics or their cluster. Essentially, the y-axis level at which the branches merge (relative to the "root" of the tree) is proportional to their similarity.

Text annotations visible on hovering over the merging nodes show the intersecting/different terms of it's two child nodes. Merging node on first hierarchy level uses the topics on leaves directly, to calculate intersecting/different terms, and the upper nodes assume the intersection(+++) as the topic terms of it's child node.

This type of tree graph could help us see the high level cluster theme that might exist in our data, as we can see the common/different terms of combined topics in a cluster head annotation.

### Document Coloring

We can explore the topic distribution of documents in a visually handy way by coloring its words according to topics.

In [None]:
# this is a sample method to color words.

def color_words(model, doc):
    import matplotlib.pyplot as plt
    import matplotlib.patches as patches
    
    # make into bag of words
    doc = model.id2word.doc2bow(doc)
    # get word_topics
    doc_topics, word_topics, phi_values = model.get_document_topics(doc, per_word_topics=True)

    # color-topic matching
    topic_colors = cm.rainbow(np.linspace(0, 1, len(ys)))
    
    # set up fig to plot
    fig = plt.figure()
    ax = fig.add_axes([0,0,1,1])

    # a sort of hack to make sure the words are well spaced out.
    word_pos = 1/len(doc)
    
    # use matplotlib to plot words
    for word, topics in word_topics:
        ax.text(word_pos, 0.8, model.id2word[word],
                horizontalalignment='center',
                verticalalignment='center',
                fontsize=20, color=topic_colors[topics[0]],  # choose just the most likely topic
                transform=ax.transAxes)
        word_pos += 0.2 # to move the word for the next iter

    ax.set_axis_off()
    plt.show()

In [None]:
color_words(model, df_fake.text.iloc[[0]])

## Author Topic Model

The author-topic model is an extension of Latent Dirichlet Allocation (LDA), that allows us to learn topic representations of authors in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the web. The model can be used as a novel way of data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata.

### Data

The data we'll be using consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). We crawl the folders and files in the dataset, and read the files into memory.

Construct a mapping from author names to document IDs.

In [None]:
import os, re
from smart_open import smart_open

# Folder containing all NIPS papers.
data_dir = 'nipstxt/'  # Set this path to the data on your machine.

# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]

# Get all document texts and their corresponding IDs.
docs = []
doc_ids = []
for yr_dir in dirs:
    files = os.listdir(data_dir + yr_dir)  # List of filenames.
    for filen in files:
        # Get document ID.
        (idx1, idx2) = re.search('[0-9]+', filen).span()  # Matches the indexes of the start and end of the ID.
        doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))
        
        # Read document text
        with smart_open(data_dir + yr_dir + '/' + filen, 'rb', encoding='latin-1') as fid:
            txt = fid.read()
            
        # Replace any whitespace (newline, tabs, etc.) by a single space.
        txt = re.sub(r'\s', ' ', txt)
        
        docs.append(txt)
        
docs[:2]

In [None]:
from smart_open import smart_open
filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs]  # Using the years defined in previous cell.

# Get all author names and their corresponding document IDs.
author2doc = dict()
i = 0
for yr in yrs:
    # The files "a00.txt" and so on contain the author-document mappings.
    filename = data_dir + 'idx/a' + yr + '.txt'
    for line in smart_open(filename, 'rb', errors='ignore', encoding='latin-1'):
        # Each line corresponds to one author.
        contents = re.split(',', line)
        author_name = (contents[1] + contents[0]).strip()
        # Remove any whitespace to reduce redundant author names.
        author_name = re.sub(r'\s', '', author_name)
        # Get document IDs for author.
        ids = [c.strip() for c in contents[2:]]
        if not author2doc.get(author_name):
            # This is a new author.
            author2doc[author_name] = []
            i += 1
        
        # Add document IDs to author.
        author2doc[author_name].extend([yr + '_' + id for id in ids])

# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.
# Mapping from ID of document in NIPS datast, to an integer ID.
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace NIPS IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
    for i, doc_id in enumerate(a_doc_ids):
        author2doc[a][i] = doc_id_dict[doc_id]

Now we will preprocess this dataset using same process and functions we used earlier for the Fake news dataset. 

In [None]:
processed_docs = preprocess(docs)

In [None]:
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = gensim.models.Phrases(docs, min_count=20)

In [None]:
# collocation detection
processed_docs = [bigram[line] for line in processed_docs]
processed_docs[0]

In [None]:
dictionary = Dictionary(processed_docs)

# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)

_ = dictionary[0]

In [None]:
# Vectorize data.

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [None]:
print('Number of authors: %d' % len(author2doc))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

### Training

The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, dictionary (id2word) and number of topics (num_topics), the author-topic model requires either an author to document ID mapping (author2doc), or the reverse (doc2author).

In [None]:
from gensim.models import AuthorTopicModel

author_model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \
                iterations=1, random_state=1)

### Exploring author-topic representation

Now that we have trained a model, we can start exploring the authors and the topics.

First, let's simply print the top 10 relevant words in the topics. Below we print topic 0. 

In [None]:
author_model.show_topic(0)

Below, we give each topic a label based on what each topic seems to be about intuitively.

In [None]:
topic_labels = ['Circuits', 'Neuroscience', 'Numerical optimization', 'Object recognition', \
               'Math/general', 'Robotics', 'Character recognition', \
                'Reinforcement learning', 'Speech recognition', 'Bayesian modelling']

Rather than just calling `model.show_topics(num_topics=10)`, we format the output a bit so it is easier to get an overview.

In [None]:
for topic in author_model.show_topics(num_topics=10):
    print('Label: ' + topic_labels[topic[0]])
    words = ''
    for word, prob in author_model.show_topic(topic[0]):
        words += word + ' '
    print('Words: ' + words)
    print()

These topics are by no means perfect. They have problems such as chained topics, intruded words, random topics, and unbalanced topics ([see Mimno and co-authors 2011](https://people.cs.umass.edu/~wallach/publications/mimno11optimizing.pdf). They will do for the purpose of this tutorial, however.

Now let's retrieve the topic distribution for an author. Each topic has a probability of being expressed given the particular author.

In [None]:
author_model['YannLeCun']

Let's print the top topics of some authors. First, we make a function to view it more easily and replacing the topic no. by the labels we gave for each topic above.

In [None]:
from pprint import pprint

def show_author(name):
    print('\n%s' % name)
    print('Docs:', author_model.author2doc[name])
    print('Topics:')
    pprint([(topic_labels[topic[0]], topic[1]) for topic in author_model[name]])

Below, we print some high profile researchers and inspect them. Three of these, Yann LeCun, Geoffrey E. Hinton and Christof Koch, are spot on.

Terrence J. Sejnowski's results are surprising, however. He is a neuroscientist, so we would expect him to get the "neuroscience" label. This may indicate that Sejnowski works with the neuroscience aspects of visual perception, or perhaps that we have labeled the topic incorrectly, or perhaps that this topic simply is not very informative.

In [None]:
show_author('YannLeCun')

In [None]:
show_author('GeoffreyE.Hinton')

In [None]:
show_author('TerrenceJ.Sejnowski')

In [None]:
show_author('ChristofKoch')

## Visualization

Now let's explore our author-topic model using interactive visualizations.

We take all the author-topic distributions and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE.

t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.

In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the `smallest_author` value if you do not want to view the authors with few documents only.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=0)
smallest_author = 0  # Ignore authors with documents less than this.
authors = [author_model.author2id[a] for a in author_model.author2id.keys() if len(author_model.author2doc[a]) >= smallest_author]
_ = tsne.fit_transform(author_model.state.gamma[authors, :])  # Result stored in tsne.embedding_

We are now ready to make the plot.

Note that if you run this notebook yourself, you will see a different graph. The random initialization of the model will be different, and the result will thus be different to some degree. You may find an entirely different representation of the data, or it may show the same interpretation slightly differently.

If you can't see the plot, you are probably viewing this tutorial in a Jupyter Notebook. View it in an nbviewer instead at http://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb.

In [None]:
# Tell Bokeh to display plots inside the notebook.
from bokeh.io import output_notebook
output_notebook()

In [None]:
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, ColumnDataSource

x = tsne.embedding_[:, 0]
y = tsne.embedding_[:, 1]
author_names = [author_model.id2author[a] for a in authors]

# Radius of each point corresponds to the number of documents attributed to that author.
scale = 0.1
author_sizes = [len(author_model.author2doc[a]) for a in author_names]
radii = [size * scale for size in author_sizes]

source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            author_names=author_names,
            author_sizes=author_sizes,
            radii=radii,
        )
    )

# Add author names and sizes to mouse-over info.
hover = HoverTool(
        tooltips=[
        ("author", "@author_names"),
        ("size", "@author_sizes"),
        ]
    )

p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])
p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)
show(p)

The circles in the plot above are individual authors, and their sizes represent the number of documents attributed to the corresponding author. Hovering your mouse over the circles will tell you the name of the authors and their sizes. Large clusters of authors tend to reflect some overlap in interest.

We see that the model tends to put duplicate authors close together. For example, Terrence J. Sejnowki and T. J. Sejnowski are the same person, and their vectors end up in the same place (see about (−10,−10) in the plot).

At about (−15,−10) we have a cluster of neuroscientists like Christof Koch and James M. Bower.

As discussed earlier, the "object recognition" topic was assigned to Sejnowski. If we get the topics of the other authors in Sejnoski's neighborhood, like Peter Dayan, we also get this same topic. Furthermore, we see that this cluster is close to the "neuroscience" cluster discussed above, which is further indication that this topic is about visual perception in the brain.

Other clusters include a reinforcement learning cluster at about (−5,8), and a Bayesian modelling cluster at about (8,−12).

## References

1. http://vis.stanford.edu/files/2012-Termite-AVI.pdf
2. https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf