# The author-topic model: LDA with metadata

In this tutorial, you will learn how to use the author-topic model in Gensim. First, we will apply it to a corpus consisting of scientific papers, to get insight about the authors of the papers. After that, we will apply the model on StackExchange posts with tags, and implement a simple automatic tagging system.

The author-topic model is in extension of Latent Dirichlet Allocation (LDA). Each document is associated with a set of authors, and the topic distributions for each of these authors are learned. Each author is also associated with multiple documents. To learn about the theoretical side of the author-topic model, see [Rosen-Zvi and co-authors](https://mimno.infosci.cornell.edu/info6150/readings/398.pdf), for example.

Naturally, familiarity with topic modelling, LDA and Gensim is assumed in this tutorial. If you are not familiar with either LDA, or its Gensim implementation, consider some of these resources:
* Gentle introduction to the LDA model: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
* Gensim's LDA API documentation: https://radimrehurek.com/gensim/models/ldamodel.html
* Topic modelling in Gensim: http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html
* Pre-processing and training LDA: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb


In part 1 of this tutorial, we will illustrate basic usage of the model, and explore the resulting representation. How to load and pre-process the dataset used is also covered.

In part 2, we will develop a simple automatic tagging system, and some more of the model's functionality will be shown.

## Part 1: analyzing scientific papers

The data used in part 1 consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). It is the same dataset used in the [Pre-processing and training LDA](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb) tutorial, mentioned earlier.

You can download the data from Sam Roweis' website (http://www.cs.nyu.edu/~roweis/data.html).

In the following sections we will load the data, pre-process it, train the model, and explore the results using some of the implementation's functionality. Feel free to skip the loading and pre-processing for now, if you are familiar with the process.

### Loading the data

In the cell below, we crawl the folders and files in the dataset, and read the files into memory.

In [1]:
import os, re

# Folder containing all NIPS papers.
data_dir = '../../../../data/nipstxt/'  # Set this path to the data on your machine.

# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]

# Get all document texts and their corresponding IDs.
docs = []
doc_ids = []
for yr_dir in dirs:
    files = os.listdir(data_dir + yr_dir)  # List of filenames.
    for filen in files:
        # Get document ID.
        (idx1, idx2) = re.search('[0-9]+', filen).span()  # Matches the indexes of the start end end of the ID.
        doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))
        
        # Read document text.
        # Note: ignoring characters that cause encoding errors.
        with open(data_dir + yr_dir + '/' + filen, errors='ignore', encoding='utf-8') as fid:
            txt = fid.read()
            
        # Replace any whitespace (newline, tabs, etc.) by a single space.
        txt = re.sub('\s', ' ', txt)
        
        docs.append(txt)

Construct a mapping from author names to document IDs.

In [2]:
filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs]  # Using the years defined in previous cell.

# Get all author names and their corresponding document IDs.
author2doc = dict()
i = 0
for yr in yrs:
    # The files "a00.txt" and so on contain the author-document mappings.
    filename = data_dir + 'idx/a' + yr + '.txt'
    for line in open(filename, errors='ignore', encoding='utf-8'):
        # Each line corresponds to one author.
        contents = re.split(',', line)
        author_name = (contents[1] + contents[0]).strip()
        # Remove any whitespace to reduce redundant author names.
        author_name = re.sub('\s', '', author_name)
        # Get document IDs for author.
        ids = [c.strip() for c in contents[2:]]
        if not author2doc.get(author_name):
            # This is a new author.
            author2doc[author_name] = []
            i += 1
        
        # Add document IDs to author.
        author2doc[author_name].extend([yr + '_' + id for id in ids])

# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.
# Mapping from ID of document in NIPS datast, to an integer ID.
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace NIPS IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
    for i, doc_id in enumerate(a_doc_ids):
        author2doc[a][i] = doc_id_dict[doc_id]

### Pre-processing text

The text will be pre-processed using the following steps:
* Tokenize text.
* Replace all whitespace by single spaces.
* Remove all punctuation and numbers.
* Remove stopwords.
* Lemmatize words.
* Add multi-word named entities.
* Add frequent bigrams.
* Remove frequent and rare words.

Part 2 will use the same pre-processing, for the most part, so we shall explain it here.

A lot of the heavy lifting will be done by the great package, Spacy. Spacy markets itself as "industrial-strength natural language processing", is fast, enables multiprocessing, and is easy to use. First, let's import it and load the NLP pipline in english.

In [3]:
import spacy
nlp = spacy.load('en')

In the code below, Spacy takes care of tokenization, removing non-alphabetic characters, removal of stopwords, lemmatization and named entity recognition.

Note that we only keep named entities that consist of more than one word, as single word named entities are already there.

In [4]:
%%time
processed_docs = []    
for doc in nlp.pipe(docs, n_threads=4, batch_size=100):
    # Process document using Spacy NLP pipeline.
    
    ents = doc.ents  # Named entities.

    # Keep only words (no numbers, no punctuation).
    # Lemmatize tokens, remove punctuation and remove stopwords.
    doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]

    # Remove common words from a stopword list.
    #doc = [token for token in doc if token not in STOPWORDS]

    # Add named entities, but only if they are a compound of more than word.
    doc.extend([str(entity) for entity in ents if len(entity) > 1])
    
    processed_docs.append(doc)

CPU times: user 10min 13s, sys: 780 ms, total: 10min 14s
Wall time: 3min 27s


In [5]:
docs = processed_docs
del processed_docs

Below, we use a Gensim model to add bigrams. Note that this achieves the same goal as named entity recognition, that is, finding adjacent words that have some particular significance.

In [6]:
# Compute bigrams.
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)



Now we are ready to construct a dictionary, as our vocabulary is finalized. We then remove common words (occurring $> 50\%$ of the time), and rare words (occur $< 20$ times in total).

In [10]:
# Create a dictionary representation of the documents, and filter out frequent and rare words.

from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)

_ = dictionary[0]  # This sort of "initializes" dictionary.id2token.

We produce the vectorized representation of the documents, to supply the author-topic model with, by computing the bag-of-words.

In [11]:
# Vectorize data.

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Let's inspect the dimensionality of our data.

In [12]:
print('Number of authors: %d' % len(author2doc))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of authors: 2479
Number of unique tokens: 6996
Number of documents: 1740


### Train and use model

We train the author-topic model on the data prepared in the previous sections. 

The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, ID to word mapping (`id2word`) and number of topics (`num_topics`), the author-topic model requires either a author to document ID mapping (`author2doc`), or the reverse (`doc2author`).

Below, we have also (this can be skipped for now):
* Increased the number of `passes` over the dataset (to improve the convergence of the optimization problem).
* Decreased the number of `iterations` over each document (related to the above).
* Specified the mini-batch size (`chunksize`) (primarily to speed up training).
* Turned off bound evaluation (`eval_every`) (as it takes a long time to compute).
* Turned on automatic learning of the `alpha` and `eta` priors (to improve the convergence of the optimization problem).
* Set the random state (`random_state`) of the random number generator (to make these experiments reproducible).

We load the model, and train it.

**FIXME:** why is autotuning turned on below?

In [16]:
from gensim.models import AuthorTopicModel
%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                author2doc=author2doc, chunksize=2000, passes=100, alpha='auto', eta='auto', \
                eval_every=0, iterations=1, random_state=1)

CPU times: user 14min 10s, sys: 1min 12s, total: 15min 22s
Wall time: 14min 6s


Before we explore the model, let's try to improve upon it. To do this, we will train several models with different random initializations, by giving different seeds for the random number generator (`random_state`). We evaluate the topic coherence of the model using the [top_topics](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics) method, and pick the model with the highest topic coherence.



In [73]:
%%time
model_list = []
for i in range(5):
    model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                    author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
                    eval_every=0, iterations=1, random_state=i)
    top_topics = model.top_topics(corpus)
    tc = sum([t[1] for t in top_topics])
    model_list.append((model, tc))

CPU times: user 14min 30s, sys: 1min 18s, total: 15min 49s
Wall time: 14min 43s


Choose the model with the highest topic coherence.

In [66]:
model, tc = max(model_list, key=lambda x: x[1])
print('Topic coherence: %.3e' %tc)

Topic coherence: -1.766e+03


Let's print the most important words in the topics.

In [67]:
model.show_topics(num_topics=10)

[(0,
  '0.009*"control" + 0.007*"memory" + 0.006*"prediction" + 0.006*"table" + 0.006*"signal" + 0.005*"search" + 0.005*"controller" + 0.005*"system" + 0.004*"user" + 0.004*"run"'),
 (1,
  '0.013*"neuron" + 0.010*"threshold" + 0.009*"f" + 0.008*"let" + 0.008*"theorem" + 0.007*"bound" + 0.007*"class" + 0.007*"node" + 0.007*"p" + 0.006*"layer"'),
 (2,
  '0.009*"w" + 0.008*"matrix" + 0.007*"noise" + 0.007*"approximation" + 0.007*"gaussian" + 0.006*"density" + 0.005*"optimal" + 0.005*"generalization" + 0.005*"sample" + 0.005*"y"'),
 (3,
  '0.013*"image" + 0.009*"distance" + 0.008*"cluster" + 0.006*"trajectory" + 0.005*"transformation" + 0.005*"object" + 0.005*"solution" + 0.005*"matrix" + 0.005*"dynamic" + 0.004*"inverse"'),
 (4,
  '0.014*"action" + 0.011*"control" + 0.010*"policy" + 0.009*"q" + 0.009*"reinforcement" + 0.008*"optimal" + 0.006*"dynamic" + 0.005*"robot" + 0.005*"environment" + 0.005*"reward"'),
 (5,
  '0.015*"representation" + 0.012*"layer" + 0.011*"image" + 0.009*"object" +

These topics are by no means perfect. They have problems such as *chained topics*, *intruded words*, *random topics*, and *unbalanced topics* (see [Mimno and co-authors 2011](https://people.cs.umass.edu/~wallach/publications/mimno11optimizing.pdf)). They will do for the purposes of this tutorial, however.

**TODO:** re-write the interpretation of the topics below, if necessary.

Below, we use the `model[name]` syntax to retrieve the topic distribution for some authors. Comparing the authors' topics with the topics above, we observe that the model has correctly identified that Yann LeCun and Geoffrey E. Hinton both have something to do with neural networks (topic 5), speech recognition (topic 1 and 5) and statistical machine learning (topic 9). We also observe that Yann LeCun has been particularly occupied with image processing, and perhaps that Geoffrey E. Hinton has worked with visual perception in neuroscience (this is less clear).

Similarly, Terrence J. Sejnowski and James M. Bower are both neuroscientist, first and foremost, and their topic distributions seem to reflect that.

In [68]:
from pprint import pprint

name = 'YannLeCun'
print('\n%s' % name)
print('Docs:', author2doc[name])
print('Topics:')
pprint(model[name])

name = 'GeoffreyE.Hinton'
print('\n%s' % name)
print('Docs:', author2doc[name])
print('Topics:')
pprint(model[name])

name = 'TerrenceJ.Sejnowski'
print('\n%s' % name)
print('Docs:', author2doc[name])
print('Topics:')
pprint(model[name])

name = 'JamesM.Bower'
print('\n%s' % name)
print('Docs:', author2doc[name])
print('Topics:')
pprint(model[name])


YannLeCun
Docs: [143, 406, 370, 495, 456, 449, 595, 616, 760, 752, 1532]
Topics:
[(3, 0.29943682408405564), (9, 0.70035037360056807)]

GeoffreyE.Hinton
Docs: [56, 143, 284, 230, 197, 462, 463, 430, 688, 784, 826, 848, 869, 1387, 1684, 1728]
Topics:
[(4, 0.07225384180855414), (5, 0.92764230357402855)]

TerrenceJ.Sejnowski
Docs: [513, 530, 539, 468, 611, 581, 600, 594, 703, 711, 849, 981, 944, 865, 850, 883, 881, 1221, 1137, 1224, 1146, 1282, 1248, 1179, 1424, 1359, 1528, 1484, 1571, 1727, 1732]
Topics:
[(5, 0.86190832291064989), (7, 0.13802575466031855)]

JamesM.Bower
Docs: [17, 48, 58, 131, 101, 126, 127, 281, 208, 225]
Topics:
[(7, 0.99980671969007273)]


We can construct the `doc2author` dictionary ourselves.

In [43]:
from gensim.models import atmodel
doc2author = atmodel.construct_doc2author(author2doc=author2doc, corpus=corpus)

We can also compute the (per-word) bound.

In [44]:
# Compute the per-word bound.
# Number of words in corpus.
corpus_words = sum(cnt for document in corpus for _, cnt in document)

# Compute bound and divide by number of words.
perwordbound = model.bound(corpus, author2doc=author2doc, doc2author=doc2author) / corpus_words
print(perwordbound)

-7.6914582241156673

We can evaluate the quality of the topics by computing the topic coherence, as in the LDA class. Use this to e.g. find out which of the topics are poor quality, or as a metric for model selection.

In [65]:
%time top_topics = model.top_topics(corpus)

CPU times: user 17.1 s, sys: 0 ns, total: 17.1 s
Wall time: 17.1 s


### Explore author-topic representation

In [70]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
_ = tsne.fit_transform(model.state.gamma)  # Result stored in tsne.embedding_

In [71]:
from bokeh.io import output_notebook
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, ColumnDataSource

output_notebook()

If you are unable to view or interact with the plot below, it is available [here]() (**TODO:** make a page for the plot, and include the link), or view the entire notebook [here]() (**TODO:** make nvbiewer page for the notebook or something).

In [72]:
x = tsne.embedding_[:, 0]
y = tsne.embedding_[:, 1]
author_names = list(model.id2author.values())

# Radius of each point corresponds to the number of documents attributed to that author.
scale = 0.1
radii = [len(author2doc[a]) * scale for a in author_names]

source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            author_names=author_names,
            radii=radii,
        )
    )

hover = HoverTool(
        tooltips=[
        ("author", "@author_names"),
        ("radius", "@radii"),
        ]
    )

p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])
p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)
show(p)