### Background

As dicussion in this [PR](https://github.com/RaRe-Technologies/gensim/pull/1434), Translation Matrix not only can used to translate the words from one source language to another target lanuage, but also to translate new document vectors back to old model space.

For example, if we have trained 15k documents using doc2vec (we called this as model1), and we are going to train new 35k documents using doc2vec(we called this as model2). So we can include those 15k documents as reference documents into the new 35k documents. Then we can get 15k document vectors from model1 and 50k document vectors from model2, but both of the two models have vectors for those 15k documents. We can use those vectors to build a mapping from model1 to model2. Finally, with this relation, we can back-mapping the model2's vector to model1. Therefore, 35k document vectors are learned using this method.

In this notebook, we use the IMDB dataset as example. For more information about this dataset, please refer to [this](http://ai.stanford.edu/~amaas/data/sentiment/). And some of code are borrowed from this [notebook](http://localhost:8888/notebooks/docs/notebooks/doc2vec-IMDB.ipynb)

In [2]:
import gensim
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from collections import namedtuple
from gensim import utils

def read_sentimentDocs():
    SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

    alldocs = []  # will hold all docs in original order
    with utils.smart_open('aclImdb/alldata-id.txt', encoding='utf-8') as alldata:
        for line_no, line in enumerate(alldata):
            tokens = gensim.utils.to_unicode(line).split()
            words = tokens[1:]
            tags = [line_no] # `tags = [tokens[0]]` would also work at extra memory cost
            split = ['train','test','extra','extra'][line_no//25000]  # 25k train, 25k test, 25k extra
            sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown
            alldocs.append(SentimentDocument(words, tags, split, sentiment))

    train_docs = [doc for doc in alldocs if doc.split == 'train']
    test_docs = [doc for doc in alldocs if doc.split == 'test']
    doc_list = alldocs[:]  # for reshuffling per pass

    print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))

    return train_docs, test_docs, doc_list

train_docs, test_docs, doc_list = read_sentimentDocs()

print len(train_docs), len(test_docs), len(doc_list)

100000 docs: 25000 train-sentiment, 25000 test-sentiment
25000 25000 100000


In [None]:
# for the computer performance limited, didn't run on notebook
import multiprocessing
from random import shuffle

cores = multiprocessing.cpu_count()
model1 = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores)
model2 = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores)

small_train_docs = train_docs[:15000]
# train for small corpus
model1.build_vocab(small_train_docs)
for epoch in xrange(50):
    shuffle(small_train_docs)
    model1.train(small_train_docs, total_examples=len(small_train_docs), epochs=1)
model.save("small_doc_15000_iter50.bin")

train_docs.extend(test_docs)
# train for large corpus
model2.build_vocab(train_docs)
for epoch in xrange(50):
    shuffle(train_docs)
    model2.train(train_docs, total_examples=len(train_docs), epochs=1)
# save the model
model2.save("large_doc_50000_iter50.bin")

To evalute those document vector, we use split those 50k document into two part, one for training and the other for testing.

In [9]:
import os
import numpy as np
from sklearn.linear_model import LogisticRegression

# you can change the data folder
basedir = "/home/robotcator/doc2vec"

def test_classifier_error(train, train_label, test, test_label):
    classifier = LogisticRegression()
    classifier.fit(train, train_label)
    score = classifier.score(test, test_label)
    print "the classifier score :", score
    return score

model1 = Doc2Vec.load(os.path.join(basedir, "small_doc_15000_iter50.bin"))
model2 = Doc2Vec.load(os.path.join(basedir, "large_doc_50000_iter50.bin"))

l = model1.docvecs.count
l2 = model2.docvecs.count

m1 = np.array([model1.docvecs[i] for i in range(l)])
m2 = np.array([model2.docvecs[i] for i in range(l)])

# learn the mapping bettween two model
tm = np.linalg.lstsq(m2, m1, -1)[0]

# back mapping the doc vector
for i in range(l, l2):
    x = model2.docvecs[i]
    y = np.dot(x, tm)
    m1 = np.vstack((m1, y))

train_array = np.zeros((25000, 100))
train_label = np.zeros((25000, 1))
test_array = np.zeros((25000, 100))
test_label = np.zeros((25000, 1))

# because those document, 25k documents are postive label, 25k documents are negative label
for i in range(12500):
    train_array[i] = m1[i]
    train_label[i] = 1

    train_array[i+12500] = m1[i+12500]
    train_label[i+12500] = 0

    test_array[i] = m1[i+25000]
    test_label[i] = 1

    test_array[i+12500] = m1[i+37500]
    test_label[i+12500] = 0

test_classifier_error(train_array, train_label, test_array, test_label)

  y = column_or_1d(y, warn=True)


the classifier score : 0.79768


0.79767999999999994