# Gensim `Doc2Vec` Tutorial on the Hindi Wikipedia Dataset
# This tutorial is based on the original Gensim Tutorial for Doc2Vec. <a href="https://github.com/RaRe-Technologies/gensim/blob/ca0dcaa1eca8b1764f6456adac5719309e0d8e6d/docs/notebooks/doc2vec-IMDB.ipynb"> link </a>


## Introduction

In this tutorial, we will learn how to apply Doc2vec using gensim by recreating the results of <a href="https://arxiv.org/pdf/1405.4053.pdf">Le and Mikolov 2014</a>. 

### Outline: 
* Download the Data.
* Clean the Data.
* Tokenize the Data.
* Build the Vocab
* Train the Doc2Vec Model to generate the Word Embedings
* Explore the Word and Doc Embeddings
* Most Similar Words and Docs

### `Word2Vec`
`Word2Vec` is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far. There are two versions of this model based on skip-grams (SG) and continuous-bag-of-words (CBOW), both implemented by the gensim `Word2Vec` class.


But, Word2Vec doesn't yet get us fixed-size vectors for longer texts.


### Paragraph Vector, aka gensim `Doc2Vec`
The straightforward approach of averaging each of a text's words' word-vectors creates a quick and crude document-vector that can often be useful. However, Le and Mikolov in 2014 introduced the <i>Paragraph Vector</i>, which usually outperforms such simple-averaging.

The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim's `Doc2Vec` class implements this algorithm. 

#### Paragraph Vector - Distributed Memory (PV-DM)
This is the Paragraph Vector model analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

#### Paragraph Vector - Distributed Bag of Words (PV-DBOW)
This is the Paragraph Vector model analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)

### Requirements
The following python modules are dependencies for this tutorial:
* testfixtures ( `pip install testfixtures` )
* statsmodels ( `pip install statsmodels` )

## Load corpus

Let's download the wikipedia Hindi Data if it is not already downloaded (47 MB). This will be our text data for this tutorial.   
The data can be found here: https://www.dropbox.com/s/p8bx1k3rn0b964r/hindi-wiki-data.7z?dl=0

This cell will only reattempt steps such as Cleaning the Data. Download the Data first and Extract it to The directory of this notebook before running this Cell for the first time.

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Clean the Data

In [2]:
%%time 

import locale
import glob
import os.path
import requests
import tarfile
import sys
import codecs
from smart_open import smart_open
import re

dirname = 'hindi-wiki-data'
filename = 'aclImdb_v1.tar.gz'
locale.setlocale(locale.LC_ALL, 'C')
all_lines = []

if sys.version > '3':
    control_chars = [chr(0x85)]
else:
    control_chars = [unichr(0x85)]

# Convert text to lower-case and strip punctuation/symbols from words
def normalize_text(text):
    norm_text = text.lower()
    # Replace breaks with spaces
    norm_text = norm_text.replace('<br />', ' ')
    # Pad punctuation with spaces on both sides
    norm_text = re.sub(r"([\.\",\(\)!\?;:])", " \\1 ", norm_text)
    return norm_text

if not os.path.isfile('hindi-wiki-data/alldata-id.txt'):
    if not os.path.isdir(dirname):
        print("archive directory not available, please download.")
    else:
        print("archive directory already available without download.")

    # Collect & normalize test/train data
    print("Cleaning up dataset...")
    folders = ['train','test', 'valid']
    for fol in folders:
        temp = u''
        newline = "\n".encode("utf-8")
        output = fol.replace('/', '-') + '.txt'
        # Is there a better pattern to use?
        txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
        print(" %s: %i files" % (fol, len(txt_files)))
        with smart_open(os.path.join(dirname, output), "wb") as n:
            for i, txt in enumerate(txt_files):
                with smart_open(txt, "rb") as t:
                    one_text = t.read().decode("utf-8")
                    for c in control_chars:
                        one_text = one_text.replace(c, ' ')
                    one_text = normalize_text(one_text)
                    all_lines.append(one_text)
                    n.write(one_text.encode("utf-8"))
                    n.write(newline)

    # Save to disk for instant re-use on any future runs
    with smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:
        for idx, line in enumerate(all_lines):
            num_line = u"_*{0} {1}\n".format(idx, line)
            f.write(num_line.encode("utf-8"))

assert os.path.isfile("hindi-wiki-data/alldata-id.txt"), "alldata-id.txt unavailable"
print("Success, alldata-id.txt is available for next steps.")

Success, alldata-id.txt is available for next steps.
Wall time: 810 ms


## Tokenize the Documents 

In [3]:
%%time

import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple

# this data object class suffices as a `TaggedDocument` (with `words` and `tags`) 
# plus adds other state helpful for our later evaluation/reporting
SentimentDocument = namedtuple('Document', 'words tags')

alldocs = []
with smart_open('hindi-wiki-data/alldata-id.txt', 'rb', encoding='utf-8') as alldata:
    for line_no, line in enumerate(alldata):
        tokens = gensim.utils.to_unicode(line).split()
        words = tokens[1:]
        tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
        alldocs.append(SentimentDocument(words, tags))

print('%d docs: ' % (len(alldocs)))

601230 docs: 
Wall time: 24.1 s


### Quick Check of how the Document looks now.

In [8]:
from random import shuffle
doc_list = alldocs[:]  
shuffle(doc_list)
doc_list[50000]

Document(words=['क्षेत्र', 'की', 'कमी', 'और', 'मंगल', 'का', 'अत्यंत', 'पतला', 'वायुमंडल', 'एक', 'चुनौती', 'है', ':', 'इस', 'ग्रह', 'के', 'पास', ',', 'अपनी', 'सतह', 'के', 'आरपार', 'मामूली', 'ताप', 'संचरण', ',', 'सौर', 'वायु', 'के', 'हमले', 'के', 'खिलाफ', 'कमजोर', 'अवरोधक', 'और', 'पानी', 'को', 'तरल', 'रूप', 'में', 'बनाए', 'रखने', 'के', 'लिए', 'अपर्याप्त', 'वायु', 'मंडलीय', 'दाब', 'है।', 'मंगल', 'भी', 'करीब', 'करीब', ',', 'या', 'शायद', 'पूरी', 'तरह', 'से', ',', 'भूवैज्ञानिक', 'रूप', 'से', 'मृत', 'है', ';', 'ज्वालामुखी', 'गतिविधि', 'के', 'अंत', 'ने', 'उपरी', 'तौर', 'पर', 'ग्रह', 'के', 'भीतर', 'और', 'सतह', 'के', 'बीच', 'में', 'रसायनों', 'और', 'खनिजों', 'के', 'पुनर्चक्रण', '(', 'रीसाइक्लिंग', ')', 'को', 'बंद', 'कर', 'दिया', 'है।'], tags=[34993])

## Set-up Doc2Vec Training & Build Vocab


We vary the following parameter choices:
* 50-dimensional vectors, as the 400-d vectors of the paper take a lot of memory and, in our tests of this task, don't seem to offer much benefit
* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`
* Added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)
* A `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)

In [9]:
%%time
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores),
    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05'),
    # PV-DM w/ concatenation - big, slow, experimental mode
    # window=5 (both sides) approximates paper's apparent 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=5, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores),
]

for model in simple_models:
    model.build_vocab(alldocs)
    print("%s vocabulary scanned & state initialized" % model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

Doc2Vec(dbow,d100,n5,mc2,t8) vocabulary scanned & state initialized
Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8) vocabulary scanned & state initialized
Doc2Vec(dm/c,d100,n5,w5,mc2,t8) vocabulary scanned & state initialized
Wall time: 2min 2s


Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model with the help of a thin wrapper class included in a gensim test module. (Note that this a separate, later concatenation of output-vectors than the kind of input-window-concatenation enabled by the `dm_concat=1` mode above.)

## Training & Evaluation

Note that doc-vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.

(On a 4-core 1.6Ghz Intel Core i5, these 20 passes training and evaluating 3 main models takes about 3 hours.)

In [10]:
for model in simple_models: 
    print("Training %s" % model)
    %time model.train(doc_list, total_examples=len(doc_list), epochs=model.epochs)    
 

Training Doc2Vec(dbow,d100,n5,mc2,t8)
Wall time: 38min 36s
Training Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)
Wall time: 1h 19min 44s
Training Doc2Vec(dm/c,d100,n5,w5,mc2,t8)
Wall time: 42min 25s


## Examining Results

### Are inferred vectors close to the precalculated ones?

In [19]:
#import numpy as np
#doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    #inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    inferred_docvec = model.infer_vector(['अमेरिका', 'ईरान', 'जवाब','ट्रंप', 'बताया', 'सूट-बूट', 'आतंकी'])
    print(inferred_docvec)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=1)))

for doc 288084...
[-0.00299281  0.06446436 -0.087025   -0.00240485 -0.01394253 -0.20658462
 -0.11662332  0.03342903  0.08690358  0.08469858 -0.10076504 -0.07633572
  0.05571734 -0.1240387  -0.08273178 -0.0545119  -0.2659948  -0.09168988
  0.11924595  0.28239778  0.21871158  0.08158211  0.01639976  0.12588635
  0.09813615  0.08310021  0.285711   -0.15112688  0.04318526  0.1822056
  0.10976867 -0.00807587 -0.1847904  -0.05084115 -0.0724686   0.1298497
 -0.26680115 -0.05710117  0.02228931 -0.37422612  0.09797944 -0.02640666
  0.08247092 -0.17209062 -0.3326172  -0.02629808  0.4261826   0.1485871
  0.22895584 -0.2637845   0.1261971   0.05094935 -0.2904289   0.13632144
 -0.16419807 -0.1945721  -0.23614773 -0.02044909  0.1173032   0.35706726
  0.13505965 -0.10583614 -0.01803126 -0.1905576   0.37404984  0.07108288
 -0.09450726 -0.14552376 -0.35908136  0.14968678  0.09081653  0.00218842
 -0.14549948 -0.0023069  -0.01853857  0.1493949  -0.24878217  0.14314504
  0.42944664  0.09150154 -0.21470235

(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words. Defaults for inference may benefit from tuning for each dataset or model parameters.)

This object contains the paragraph vectors learned from the training data. There will be one such vector for each unique document tag supplied during training. They may be individually accessed using the tag as an indexed-access key. For example, if one of the training documents used a tag of ‘245220’:

In [29]:
#for model in simple_models:
    #model.docvecs['245220']
    #model.save('hindi-wiki-data_model_docvec.d2v')
    
for i in range(3):
     simple_models[i].save('hindi-wiki-data_model'+str(i)+'_docvec.d2v')

### Do close documents seem more related than distant ones?

In [47]:
import random

doc_id = np.random.randint(simple_models[1].docvecs.count)  # pick random doc, re-run cell for more examples #155234
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))

TARGET (443028): «प्रौद्योगिकी संस्थान , ताड़ेपल्लीगुड़म आंध्र प्रदेश में स्थित राष्ट्रीय महत्व का संस्थान है। यह भारत के प्रशिद्ध राष्ट्रीय प्रौद्योगिकी संस्थानों में से एक है। इसे 'एनआईटी ताड़ेपल्लीगुड़म' या 'एनआईटी आंध्र प्रदेश' के नाम से भी जाना जाता है। इस संस्थान में शिक्षण कार्य सन २०१५ से प्रारम्भ हुआ था। पूर्णकालिक कैंपस का निर्माण ४०० एकड़ में शुरू हो चुका है। इसके लिए पैसा मानव संसाधन विकास मंत्रालय देता है।»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8):

MOST (443034, 0.8336281776428223): «प्रोद्योगिकी संस्थान , गोवा गोवा में स्थित राष्ट्रीय महत्व का संस्थान है। यह भारत के प्रशिद्ध राष्ट्रीय प्रौद्योगिकी संस्थानों में से एक है। इसे 'एनआईटी गोवा' के नाम से भी जाना जाता है। इस संस्थान में शिक्षण कार्य सन २०१० से प्रारम्भ हुआ था। पूर्णकालिक कैंपस का निर्माण ३०० एकड़ में शुरू हो चुका है। इसके लिए पैसा मानव संसाधन विकास मंत्रालय देता है।»

MEDIAN (429652, 0.34647613763809204): «2014 में , एक अंतरराष्ट्रीय महिला टी-20 प्रतियोगिता के गठन , इंडिय

Somewhat, in terms of Vocab and topics of document etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the cell to try another random target document.

### Do the word vectors show useful similarities?

In [48]:
word_models = simple_models[:]

In [65]:
import random
from IPython.display import HTML
# pick a random word with a suitable number of occurences
while True:
    word = random.choice(word_models[0].wv.index2word)
    if word_models[0].wv.vocab[word].count > 10:
        break
# or uncomment below line, to just pick a word from the relevant domain:
#word = 'साम्यवाद'
similars_per_model = [str(model.wv.most_similar(word, topn=20)).replace('), ','),<br>\n') for model in word_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
print("most similar words for '%s' (%d occurences)" % (word, simple_models[0].wv.vocab[word].count))
HTML(similar_table)

most similar words for 'संचालन' (2235 occurences)


"Doc2Vec(dbow,d100,n5,mc2,t8)","Doc2Vec(""alpha=0.05"",dm/m,d100,n5,w10,mc2,t8)","Doc2Vec(dm/c,d100,n5,w5,mc2,t8)"
"[('437–438', 0.42513883113861084), ('endured', 0.4212109446525574), ('इंडियारेलइन्फो', 0.41686201095581055), ('‘अब', 0.4159364700317383), ('vtp', 0.405132532119751), ('फ़्रागार्या', 0.39857217669487), ('होलिवुड', 0.3927476108074188), ('//mod', 0.3907133936882019), ('संगटन', 0.38695940375328064), ('अप्रिल', 0.3860411047935486), (""ओवल'"", 0.3858442008495331), ('निम्नबल', 0.3831627368927002), ('हालार', 0.38268977403640747), ('سنڌي\u200e', 0.38066890835762024), ('डेफ़िनिशन', 0.380027174949646), ('सेरामपोर', 0.3793221116065979), ('rreq', 0.37876227498054504), ('फुफ्फुसावरणशोथ', 0.37512099742889404), ('भागा।', 0.3748989701271057), ('बंजरपन', 0.37415167689323425)]","[('निर्माण', 0.8500988483428955), ('अभ्यास', 0.8122480511665344), ('पालन', 0.8114192485809326), ('उत्पादन', 0.8039084076881409), ('विस्तार', 0.7956109046936035), ('उपयोग', 0.7877824306488037), ('आयोजन', 0.783574640750885), ('अनुसरण', 0.783011257648468), ('समर्थन', 0.7800379991531372), ('प्रचार-प्रसार', 0.7774072885513306), ('प्रसारण', 0.7745814323425293), ('प्रचार', 0.7743381857872009), ('प्रसार', 0.7741221785545349), ('गठन', 0.7739957571029663), ('अध्ययन', 0.773598313331604), ('विकास', 0.7735896110534668), ('सृजन', 0.7734421491622925), ('प्रतिनिधित्व', 0.7727755904197693), ('चयन', 0.7726584672927856), ('निर्धारण', 0.7721700668334961)]","[('क्रियान्वयन', 0.7410501837730408), ('परिचालन', 0.7080305814743042), ('अनुरक्षण', 0.6725230813026428), ('रखरखाव', 0.6673260927200317), ('प्रबंधन', 0.6670118570327759), ('वित्तपोषण', 0.651642918586731), ('पर्यवेक्षण', 0.6502020359039307), ('प्रबन्धन', 0.6321255564689636), ('निष्पादन', 0.6200909614562988), ('रख-रखाव', 0.614891767501831), ('नवीनीकरण', 0.6146354675292969), ('समन्\u200dवयन', 0.6104453802108765), ('निकास', 0.606842041015625), ('प्रशासन', 0.6063936352729797), ('पुनरोद्धार', 0.6056331396102905), ('संपादन', 0.5962674617767334), ('मूल्\u200dयांकन', 0.5926483869552612), ('कार्यान्वयन', 0.5885398387908936), ('निर्माण', 0.5872750282287598), ('निस्तारण', 0.5830670595169067)]"


Do the DBOW words look meaningless? That's because the gensim DBOW model doesn't train word vectors – they remain at their random initialized values – unless you ask with the `dbow_words=1` initialization parameter.

Words from DM models tend to show meaningfully similar words when there are many examples in the training data (as with 'यूएसएसआर' or 'रूस'). (All DM modes inherently involve word-vector training concurrent with doc-vector training.)

### Visualize Words in Vector Space
source: https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []
    count = 0
    for word in model.wv.vocab:
        if count != 500:
            tokens.append(model[word])
            labels.append(word)
            count++;
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
word_models = simple_models[:]
tsne_plot(word_models[1])

Other APIS of Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html

In [None]:
%load_ext autoreload
%autoreload 2