# Evaluating a doc2vec model

What we're going to do in this exercise:
* load a pre-trained doc2vec model
* use it to infer document embeddings for our test set
* cluster the documents based on the embeddings cosine distances
* use t-SNE to visualize the data

In [24]:
import os
import numpy as np
from gensim.models import doc2vec
from gensim.utils import simple_preprocess
from nltk.cluster import kmeans
from nltk.cluster import util
import collections

In [25]:
# generic settings
HOMEDIR = './'

In [26]:
CORPUS_FILE = os.path.join(HOMEDIR, "data/corpus_train.txt")
MODEL_FILE_DM = os.path.join(HOMEDIR, "models/doc2vec_DM_v20171229.bin")
MODEL_FILE_DBOW = os.path.join(HOMEDIR, "models/doc2vec_DBOW_v20171229.bin")

NUM_CLUSTERS = 20  # yes, you can change this

### Read corpus file and parse into token lists

In [27]:
with open(CORPUS_FILE, 'r', encoding='utf-8') as f:
    lines = f.readlines()
    docs = [simple_preprocess(line, deacc=False, min_len=1) for line in lines]

### Read existing model and use it to derive document embeddings

In [58]:
# load pre-trained model
# model = doc2vec.Doc2Vec.load(MODEL_FILE_DM)  # DM model chosen by default
# model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)  # only keep what we need

In [6]:
# *DO NOT RUN*
from gensim.utils import delete_temporary_training_data

model = doc2vec.Doc2Vec.load(MODEL_FILE_DM)  # Load pre-trained model

# Delete temporary training data
delete_temporary_training_data(model, keep_doctags_vectors=True, keep_inference=True)


ImportError: cannot import name 'delete_temporary_training_data' from 'gensim.utils' (c:\Users\uSER\anaconda3\lib\site-packages\gensim\utils.py)

In [28]:
from gensim.models import KeyedVectors

model = doc2vec.Doc2Vec.load(MODEL_FILE_DM)

# Getting errors, hence, manually deleting errors
# del model.wv
# del model.dv
# del model.docvecs 
model.docvecs = KeyedVectors(vector_size=model.vector_size)

  model.docvecs = KeyedVectors(vector_size=model.vector_size)


**_Exercise 2: Combine DM and DBOW models_**

**Note: don't start this exercise yet! First complete the rest of the notebook, then return here to do this exercise!**

The authors of the paper suggest that combining the DM and the DBOW model works better than any single one. Do this by concatenating (you could also try to averaging or summing) the embeddings from both models.

In [14]:
# your code here...

In [22]:
# solution
# del model
model_dm = doc2vec.Doc2Vec.load(MODEL_FILE_DM)
model_dbow = doc2vec.Doc2Vec.load(MODEL_FILE_DBOW)

infer_epochs = 1000

docvecs_dm = [model_dm.infer_vector(d, alpha=0.01, epochs=infer_epochs) for d in docs]
docvecs_dbow = [model_dbow.infer_vector(d, alpha=0.01, epochs=infer_epochs) for d in docs]

docvecs = [docvecs_dm[i] + docvecs_dbow[i] for i in range(len(docs))]


=========== end of exercise ======================

In [29]:
# infer document vectors
infer_epochs = 1000
docvecs = [model.infer_vector(d, alpha=0.01, epochs=infer_epochs) for d in docs]


## Now we have document vectors, start clustering

In [30]:
clusterer = kmeans.KMeansClusterer(NUM_CLUSTERS, distance=util.cosine_distance, repeats=3)

In [31]:
cluster_assignments = clusterer.cluster(docvecs, assign_clusters=True)

In [32]:
# how many documents per cluster?
collections.Counter(cluster_assignments)

Counter({9: 976,
         7: 5515,
         5: 884,
         18: 1849,
         13: 6582,
         12: 1555,
         3: 1877,
         15: 1278,
         11: 1896,
         14: 2085,
         0: 1189,
         16: 946,
         8: 1860,
         17: 1757,
         6: 812,
         1: 669,
         2: 1680,
         10: 1549,
         19: 1152,
         4: 851})

In [33]:
def get_documents_in_cluster(cluster_idx):
    return [doc for i, doc in enumerate(docs) if cluster_assignments[i] == cluster_idx]

In [34]:
def get_document_topics(doc_vec, topic_vecs):
    """
    For a given document, give the topic distribution (softmax probabilities for all topics)
    """
    similarities = [np.dot(doc_vec, topic_vec) for topic_vec in topic_vecs]
    return np.exp(similarities) / np.sum(np.exp(similarities))

You can define the topics as the cluster centroids. Then find the nearest-neighbor words to describe the topic.

In [35]:
topic_vecs = clusterer.means()

# Visualize topics using t-SNE

What we're going to do now:
* reduce 100-dim vector space to 2 dimensions
* plot all documents in this 2D space
* use color to show the clustering
* inspect how close / afar certain documents are

In [36]:
from sklearn.manifold import TSNE
import bokeh.plotting as bp
from bokeh.models import HoverTool
from bokeh.io import push_notebook, output_notebook, show

In [37]:
docs_tsne = TSNE(n_components=2, perplexity=30, init='pca').fit_transform(docvecs)
docs_tsne.shape



(36962, 2)

In [17]:
# create matrix with topic proportion per doc per topic
doc_topic_matrix = [get_document_topics(docvec, topic_vecs) for docvec in docvecs]
# select highest topic prob
prob_max_topic = np.max(doc_topic_matrix, axis=1)

In [18]:
# 20 colors
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

In [19]:
sourcedata = {
    'x': docs_tsne[:, 0],
    'y': docs_tsne[:, 1],
    'color': colormap[cluster_assignments],
    'alpha': prob_max_topic * 50,
    'content': lines,
    'topic_key': cluster_assignments
}

### Make and show the plot

In [20]:
tsne_plot = bp.figure(plot_width=1600, plot_height=900,
                      title="Topics",
                      tools="pan,wheel_zoom,box_zoom,reset,hover",
                      x_axis_type=None, y_axis_type=None, min_border=1)

tsne_plot.scatter(x='x', 
                  y='y',
                  color='color',
                  size='alpha',
                  #size=10,
                  source=bp.ColumnDataSource(sourcedata)
                 )

# add hover tooltips
hover = tsne_plot.select(dict(type=HoverTool))
hover.tooltips = {"content": "@content - topic: @topic_key"}

show(tsne_plot)

**Ok. Now go back up and start exercise 2 and see if it's an improvement!**