# Visualizing Doc2Vec with TensorBoard



<img src="Tensorboard.png">


In this tutorial, I will explain how to visualize Doc2Vec Embeddings aka [Paragraph Vectors]() via TensorBoard. It is a data visualization framework for visualizing and inspecting the TensorFlow runs and graphs. We will use a built-in Tensorboard visualizer called *Embedding Projector* in this tutorial. It lets you interactively visualize and analyze high-dimensional data like embeddings.

For this tutorial, a transformed MovieLens dataset<sup>[1]</sup> was used from this [repository](https://github.com/RaRe-Technologies/movie-plots-by-genre) and the movie titles were added afterwards. You can download the prepared csv from [here](https://github.com/parulsethi/DocViz/blob/master/movie_plots.csv). The input documents for training are the synopsis of movies, on which Doc2Vec model is trained. 

The visualizations will be a scatterplot as seen in the above image, where each datapoint is labelled by the movie title and colored by it's corresponding genre. You can also visit this [Projector link](http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/parulsethi/DocViz/master/movie_plot_config.json) which is configured with my embeddings for the above mentioned dataset. 

# Define a Function to Read and Preprocess Text

In [2]:
import gensim
import pandas as pd
import smart_open
import random

# read data
dataframe = pd.read_csv('movie_plots.csv')
dataframe

Unnamed: 0,MovieID,Titles,Plots,Genres
0,1,Toy Story (1995),A little boy named Andy loves to be in his roo...,animation
1,2,Jumanji (1995),When two kids find and play a magical board ga...,fantasy
2,3,Grumpier Old Men (1995),Things don't seem to change much in Wabasha Co...,comedy
3,6,Heat (1995),Hunters and their prey--Neil and his professio...,action
4,7,Sabrina (1995),An ugly duckling having undergone a remarkable...,romance
5,9,Sudden Death (1995),Some terrorists kidnap the Vice President of t...,action
6,10,GoldenEye (1995),James Bond teams up with the lone survivor of ...,action
7,15,Cutthroat Island (1995),"Morgan Adams and her slave, William Shaw, are ...",action
8,17,Sense and Sensibility (1995),"When Mr. Dashwood dies, he must leave the bulk...",romance
9,18,Four Rooms (1995),This movie features the collaborative director...,comedy


Below, we define a function to read the training documents, pre-process each document using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

In [3]:
def read_corpus(documents):
    for i, plot in enumerate(documents):
        yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(plot, max_len=30), [i])

In [4]:
train_corpus = list(read_corpus(dataframe.Plots))

Let's take a look at the training corpus.

In [5]:
train_corpus[:2]

[TaggedDocument(words=[u'little', u'boy', u'named', u'andy', u'loves', u'to', u'be', u'in', u'his', u'room', u'playing', u'with', u'his', u'toys', u'especially', u'his', u'doll', u'named', u'woody', u'but', u'what', u'do', u'the', u'toys', u'do', u'when', u'andy', u'is', u'not', u'with', u'them', u'they', u'come', u'to', u'life', u'woody', u'believes', u'that', u'he', u'has', u'life', u'as', u'toy', u'good', u'however', u'he', u'must', u'worry', u'about', u'andy', u'family', u'moving', u'and', u'what', u'woody', u'does', u'not', u'know', u'is', u'about', u'andy', u'birthday', u'party', u'woody', u'does', u'not', u'realize', u'that', u'andy', u'mother', u'gave', u'him', u'an', u'action', u'figure', u'known', u'as', u'buzz', u'lightyear', u'who', u'does', u'not', u'believe', u'that', u'he', u'is', u'toy', u'and', u'quickly', u'becomes', u'andy', u'new', u'favorite', u'toy', u'woody', u'who', u'is', u'now', u'consumed', u'with', u'jealousy', u'tries', u'to', u'get', u'rid', u'of', u'buzz'

# Training the Doc2Vec Model
We'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 55 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time. Small datasets with short documents, like this one, can benefit from more training passes.

In [8]:
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

92031

Now, we'll save the document embedding vectors per doctag.

In [9]:
model.save_word2vec_format('doc_tensor.w2v', doctag_vec=True, word_vec=False)  

# Prepare the Input files for Tensorboard

Tensorboard takes two Input files. One containing the embedding vectors and the other containing relevant metadata. We'll use a gensim script to directly convert the embedding file saved in word2vec format above to the tsv format required in Tensorboard.

In [11]:
%run ../../gensim/scripts/word2vec2tensor.py -i doc_tensor.w2v -o movie_plot

2017-04-20 02:23:05,284 : MainThread : INFO : running ../../gensim/scripts/word2vec2tensor.py -i doc_tensor.w2v -o movie_plot
2017-04-20 02:23:05,286 : MainThread : INFO : loading projection weights from doc_tensor.w2v
2017-04-20 02:23:05,464 : MainThread : INFO : loaded (1843, 50) matrix from doc_tensor.w2v
2017-04-20 02:23:05,578 : MainThread : INFO : 2D tensor file saved to movie_plot_tensor.tsv
2017-04-20 02:23:05,579 : MainThread : INFO : Tensor metadata file saved to movie_plot_metadata.tsv
2017-04-20 02:23:05,581 : MainThread : INFO : finished running word2vec2tensor.py


The script above generates two files, `movie_plot_tensor.tsv` which contain the embedding vectors and `movie_plot_metadata.tsv`  containing doctags. But, these doctags are simply the unique index values and hence are not really useful to interpret what the document was while visualizing. So, we will overwrite `movie_plot_metadata.tsv` to have a custom metadata file with two columns. The first column will be for the movie titles and the second for their corresponding genres.

In [12]:
with open('movie_plot_metadata.tsv','w') as w:
    w.write('Titles\tGenres\n')
    for i,j in zip(dataframe.Titles, dataframe.Genres):
        w.write("%s\t%s\n" % (i,j))

Now you can go to http://projector.tensorflow.org/ and upload the two files by clicking on *Load data* in the left panel.

For demo purposes I have uploaded the Doc2Vec embeddings generated from the model trained above [here](https://github.com/parulsethi/DocViz). You can access the Embedding projector configured with these uploaded embeddings at this [link](http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/parulsethi/DocViz/master/movie_plot_config.json).

# Using Tensorboard

For the visualization purpose, the multi-dimensional embeddings that we get from the Doc2Vec model above, needs to be  downsized to 2 or 3 dimensions. So that we basically end up with a new 2d or 3d embedding which tries to preserve information from the original multi-dimensional embedding. As these vectors are reduced to a much smaller dimension, the exact cosine/euclidean distances between them are not preserved, but rather relative, and hence as you’ll see below the nearest similarity results may change.

TensorBoard has two popular dimensionality reduction methods for visualizing the embeddings and also provides a custom method based on text searches:

- **Principal Component Analysis**: PCA aims at exploring the global structure in data, and could end up losing the local similarities between neighbours. It maximizes the total variance in the lower dimensional subspace and hence, often preserves the larger pairwise distances better than the smaller ones. See an intuition behind it in this nicely explained [answer](https://stats.stackexchange.com/questions/176672/what-is-meant-by-pca-preserving-only-large-pairwise-distances) on stackexchange.


- **T-SNE**: The idea of T-SNE is to place the local neighbours close to each other, and almost completely ignoring the global structure. It is useful for exploring local neighborhoods and finding local clusters. But the global trends are not represented accurately and the separation between different groups is often not preserved (see the t-sne plots of our data below which testify the same).


- **Custom Projections**: This is a custom bethod based on the text searches you define for different directions. It could be useful for finding meaningful directions in the vector space, for example, female to male, currency to country etc.

You can refer to this [doc](https://www.tensorflow.org/get_started/embedding_viz) for instructions on how to use and navigate through different panels available in TensorBoard.

## Visualize using PCA

The Embedding Projector computes the top 10 principal components. The menu at the left panel lets you project those components onto any combination of two or three. 
<img src="pca.png">
The above plot was made using the first two principal components with total variance covered being 36.5%.


## Visualize using T-SNE

Data is visualized by animating through every iteration of the t-sne algorithm. The t-sne menu at the left lets you adjust the value of it's two hyperparameters. The first one is **Perplexity**, which is basically a measure of information. It may be viewed as a knob that sets the number of effective nearest neighbors<sup>[2]</sup>. The second one is **learning rate** that defines how quickly an algorithm learns on encountering new examples/data points.

<img src="tsne.png">

The above plot was generated with perplexity 8, learning rate 10 and iteration 500. Though the results could vary on successive runs, and you may not get the exact plot as above with same hyperparameter settings. But some small clusters will start forming as above, with different orientations.

# Conclusion

We learned about visualizing the Document Embeddings through Tensorboard's Embedding Projector. It is a useful tool for visualizing different types of data for example, word embeddings, document embeddings or the gene expressions and biological sequences. It just needs an input of 2D tensors and then you can explore your data using provided algorithms. You can also perform nearest neighbours search to find most similar data points to your query point.

# References
 1. https://grouplens.org/datasets/movielens/
 2. https://lvdmaaten.github.io/tsne/
 

