# doc2vec with gensim

This notebook demonstrates the basic commands needed to train a doc2vec model using the functions provided by gensim by working through a toy example. Here python 2.7 and gensim version 2.2.0 are used.

doc2vec is an extension of word2vec which converts documents (these could be any sequences of words – paragraphs, comments, or even whole documents). This notebook only demonstrates how to use the technique; it does not explain how it works. For details consult the paper by Le and Mikolov: https://arxiv.org/abs/1405.4053.

## Documents to vectors
In this first section we will show how to produce a vector for each document. First load the required packages.

In [1]:
import re
import numpy as np
import pandas as pd

from gensim.models.doc2vec import TaggedDocument
from gensim.models.doc2vec import Doc2Vec

### Define helper functions
Next we need to define a couple of helper functions. One which splits ('tokenises') a string into a list of words, and one which takes puts our data into the data structure gensim is expecting.

In [2]:
# Tokenise on whitespace
def tokeniser(text):
    return re.findall(r"\S+", text)

# Create labelled sentence data structure
def labeller_id(ID, text):
    return TaggedDocument(tokeniser(text), [ID])

Let's see what the tokeniser does:

In [3]:
tokeniser('an example string of text')

['an', 'example', 'string', 'of', 'text']

This is a very simple tokeniser, much more advanced ones are available. 

Doc2vec expects data in the 'TaggedDocument' format, which is a list of words within the documents and a list of IDs. In this case we are creating one vector per document so each document needs one, unique ID. For example:

In [4]:
ex_document = 'an example document'
ex_id = 'doc1'
labeller_id(ex_id, ex_document)

TaggedDocument(words=['an', 'example', 'document'], tags=['doc1'])

### Getting data into the right format
Let's consider the example where we have text within a pandas dataframe.

In [5]:
# Create example data
df = pd.DataFrame()
df['text'] = [
                'this is text', 
                'any more text?', 
                'even more text!', 
                'words, all words'
            ]

We can use the dataframe IDs as the document IDs and convert all the documents into the required TaggedDocument format.

In [6]:
# Get data into correct format
documents =  df.apply(lambda x: labeller_id(str(x.name), x['text']), axis=1).values
documents

array([TaggedDocument(words=['this', 'is', 'text'], tags=['0']),
       TaggedDocument(words=['any', 'more', 'text?'], tags=['1']),
       TaggedDocument(words=['even', 'more', 'text!'], tags=['2']),
       TaggedDocument(words=['words,', 'all', 'words'], tags=['3'])], dtype=object)

### Build the model
Now we can build the doc2vec model and then delete some of the items created by gensim during the model build in order to free up memory.

In [7]:
# Build model
model = Doc2Vec(documents = documents,
        size = 2,
        seed = 42,
        min_count = 1,
        max_vocab_size = None,
        window = 5,
        iter = 5)

# Free up memory
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

We can retrieve the vectors produced for each of the documents

In [8]:
model.docvecs.doctag_syn0

array([[-0.03483918, -0.16661185],
       [ 0.18773492,  0.22185083],
       [-0.12609929,  0.06258722],
       [ 0.11312576,  0.16608955]], dtype=float32)

And the IDs associated with each vector

In [9]:
model.docvecs.doctags.keys()

['1', '0', '3', '2']

Note, this shows that the order of the vectors does not follow that of the original dataframe. But you can retrieve the text associated with each of these vectors in the correct order:

In [10]:
df.loc[map(int, model.docvecs.doctags.keys()), 'text']

1      any more text?
0        this is text
3    words, all words
2     even more text!
Name: text, dtype: object

### Using the model
Vectors for new documents can be inferred:

In [11]:
new_sentence = ['one', 'more', 'sentence']
model.infer_vector(new_sentence)

array([-0.24704833,  0.23743619], dtype=float32)

As can the vectors for multiple sentences:

In [12]:
new_sentences = [ ['first', 'text'], ['second', 'test']]
np.matrix(map(model.infer_vector, new_sentences))

matrix([[ 0.07819276, -0.06925946],
        [ 0.01494676, -0.2055566 ]], dtype=float32)

## Topics to vectors

Another way in which doc2vec can be used is to learn vectors for topics rather than for individual documents. To do this instead of giving the model a unique ID for each document you instead pass a list of topics. Each document can have multiple topics.

Let's demonstrate. First we need to tweak our labeller function so that it takes a list as input rather than a single ID. This reuses the tokeniser function defined previously

In [13]:
# Create labelled sentence data structure when given topic list
def labeller_topics(topic_list, text):
    return TaggedDocument(tokeniser(text), topic_list)

Next define some example data where there are two topics: 'question' and 'cats'.

In [14]:
# Create example data
df2 = pd.DataFrame()
df2['text'] = [
    'is there any more text?',
    'where are the words of the documents',
    'cats like to chase mice',
    'cats do not like dogs',
    'do you like cats?'
    ]

# Each document can have multiple topics associated with it (list of topics per document)
df2['topic'] = [
    ['question'],
    ['question'],
    ['cats'],
    ['cats'],
    ['question', 'cats']
    ]

df2

Unnamed: 0,text,topic
0,is there any more text?,[question]
1,where are the words of the documents,[question]
2,cats like to chase mice,[cats]
3,cats do not like dogs,[cats]
4,do you like cats?,"[question, cats]"


This can be converted into TaggedDocument form using the new labeller_topics function.

In [15]:
documents2 = df2.apply(lambda x: labeller_topics(x['topic'], x['text']), axis=1).values

Then a doc2vec model can be build as before.

In [16]:
# Build model
model = Doc2Vec(documents = documents2,
        size = 2,
        seed = 42,
        min_count = 1,
        max_vocab_size = None,
        window = 5,
        iter = 5)

# Free up memory
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

Let's look at the document vectors:

In [17]:
model.docvecs.doctag_syn0

array([[-0.03462205, -0.05810256],
       [ 0.10018398,  0.24293993]], dtype=float32)

and note only two vectors have been produced. This is encouraging as we only had two topics. Checking the tags of these vectors gives the two topics we were expecting.

In [18]:
model.docvecs.doctags.keys()

['cats', 'question']