## doc2vec training excercise

In this excercise, you will train a Paragraph Vectors / doc2vec model using gensim. You can find information on the gensim doc2vec api here: https://radimrehurek.com/gensim/models/doc2vec.html

N.B. You should be using Python 3 for this.

The data folder contains a train and test set with small sets of documents from the "20 newsgroups" dataset.

What we're going to do is the following:
* Read a dataset with documents
* Transform each document into a list of tokens (words)
* Train a doc2vec model (DM)
* Train a second model (DBOW)
* Inspect the outcomes a bit

In [19]:
import os
from gensim.models import doc2vec
from gensim.utils import simple_preprocess

In [20]:
# generic settings
HOMEDIR = './'
CORPUS_FILE = os.path.join(HOMEDIR, "data/corpus_train.txt")

# file names for the models we'll be creating
MODEL_FILE_DM = os.path.join(HOMEDIR, "models/doc2vec_DM_v20171229.bin")
MODEL_FILE_DBOW = os.path.join(HOMEDIR, "models/doc2vec_DBOW_v20171229.bin")

**Read the corpus. Each line is a document / paragraph. Optionally preprocess it first.**

In [21]:
flg_preprocess = False

if flg_preprocess:
    # quick & simple approach
    docs = doc2vec.TaggedLineDocument(CORPUS_FILE)
else:
    # with pre-processing
    with open(CORPUS_FILE, 'r', encoding='utf-8') as f:
        lines = f.readlines()
        docs = [simple_preprocess(line, deacc=False, min_len=1) for line in lines]
        docs = [doc2vec.TaggedDocument(doc, tags=[i]) for i, doc in enumerate(docs)]

In [22]:
# have a look at the data
docs[0]

TaggedDocument(words=['どうかを見極めましょう', 'なるべく新しい情報を集める', '情報は日々更新されています', '特にイ'], tags=[0])

## Training a DM (Distributed Memory) model

In [23]:
# train DM model
model_dm = doc2vec.Doc2Vec(docs, 
                           vector_size=200, # vector size, should be the same size as pre-trained embedding size when not using dm_concat
                           window=10, # window size for word context, on each side
                           min_count=1, # minimum nr. of occurrences of a word
                           sample=1e-5, # threshold for undersampling high-frequency words
                           workers=4, # for multicore processing
                           hs=0, # if 1, use hierarchical softmax; if 0, use negative sampling
                           dm=1, # if 1 use PV-DM, if 0 use PM-DBOW
                           negative=5, # how many words to use for negative sampling
                           dbow_words=1, # train word vectors
                           dm_concat=1, # concatenate vectors or sum/average them?
                           epochs=100 # nr of epochs to train
                          )

In [24]:
# save it for later use
model_dm.save(MODEL_FILE_DM)

## Training a DBOW (Distributed Bag Of Words) model

**_Excercise 1: Train a DBOW model_**

It's very similar to the previous command. What should you change?

In [25]:
# train DBOW model
# ...enter your code here...
model_dbow = doc2vec.Doc2Vec(docs, 
                            vector_size=200, # vector size, should be the same size as pre-trained embedding size when not using dm_concat
                            window=10, # window size for word context, on each side
                            min_count=1, # minimum nr. of occurrences of a word
                            sample=1e-5, # threshold for undersampling high-frequency words
                            workers=4, # for multicore processing
                            hs=0, # if 1, use hierarchical softmax; if 0, use negative sampling
                            dm=0, # if 1 use PV-DM, if 0 use PM-DBOW
                            negative=5, # how many words to use for negative sampling
                            dbow_words=1, # train word vectors
                            epochs=100 # nr of epochs to train
                            )

In [8]:
# solution *This is code of an older version. do not run*
model_dbow = doc2vec.Doc2Vec(docs, 
                            size=200, # vector size, should be the same size as pre-trained embedding size when not using dm_concat
                            window=10, # window size for word context, on each side
                            min_count=1, # minimum nr. of occurrences of a word
                            sample=1e-5, # threshold for undersampling high-frequency words
                            workers=4, # for multicore processing
                            hs=0, # if 1, use hierarchical softmax; if 0, use negative sampling
                            dm=0, # if 1 use PV-DM, if 0 use PM-DBOW
                            negative=5, # how many words to use for negative sampling
                            dbow_words=1, # train word vectors
                            iter=100 # nr of epochs to train
                            )

========= END OF EXERCISE ============

In [26]:
# also save this one
model_dbow.save(MODEL_FILE_DBOW)

## **Question: Look at the model files that are now created in the models directory. Can you explain why there are 2 files for the DM model, but only 1 for the DBOW model?**

In [27]:
def show_most_similar(model, docs, ref_doc_id):
    """
    For a given document, display the most similar ones in the corpus
    """
    def print_doc(doc_id):
        doc_txt = ' '.join(docs[doc_id].words)
        print("[Doc {}]: {}".format(doc_id, doc_txt))
        
    print("[Original document]")
    print_doc(ref_doc_id)
    print("\n[Most similar documents]")
    for doc_id, similarity in model.docvecs.most_similar(ref_doc_id, topn=3):
        print("-----------------")
        print("similarity: {}".format(similarity))
        print_doc(doc_id)


In [28]:
show_most_similar(model_dbow, list(docs), 200)

[Original document]
[Doc 200]: 体的な対策案を詳しく解説します html を定期的に変更する サイトの html を定期的に変更す

[Most similar documents]
-----------------
similarity: 0.976435661315918
[Doc 7319]: サイトの html が書き換えられても web スクレイピングできる web スクレイピングで取得
-----------------
similarity: 0.9718865156173706
[Doc 15348]: 穴が空いてしまった 残念ながら サイトの html 構造の
-----------------
similarity: 0.9671600461006165
[Doc 7341]: web サイトの html が書き換えられ


  for doc_id, similarity in model.docvecs.most_similar(ref_doc_id, topn=3):


## Prediction phase

In [29]:
test_data_file = os.path.join(HOMEDIR, "data/corpus_total.txt")

In [30]:
# read test data: each line into a list of tokens
with open(test_data_file, "r") as f:
    test_docs = [ x.strip().split() for x in f.readlines() ]

In [31]:
# inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

Create the embeddings for the test documents. Remember: this is an inference step that actually trains a network.

In [32]:
# test_docvecs = [model_dm.infer_vector(d, alpha=start_alpha, steps=infer_epoch) for d in test_docs]

infer_epoch = 20  # Number of inference epochs

test_docvecs = [model_dm.infer_vector(d, alpha=start_alpha, epochs=infer_epoch) for d in test_docs]


In [33]:
# see what one document embedding looks like
test_docvecs[0]

array([ 2.1745462e-03, -1.3884333e-03, -2.1389707e-03,  2.7547378e-04,
       -1.6884271e-03,  9.8828354e-04,  2.6889006e-03,  1.8476255e-03,
        9.7175839e-04, -1.4412728e-03,  1.1199429e-03,  5.7260576e-04,
        2.0723555e-03,  1.4076020e-03,  1.6640181e-03,  3.1390295e-03,
       -1.1036033e-03, -1.6768518e-03, -2.2916360e-04, -1.3597443e-04,
        1.2440752e-04, -1.3818053e-03,  1.0196362e-03,  1.6079628e-03,
        3.5036361e-04,  8.6599769e-04,  2.1145046e-03, -1.8057314e-03,
       -1.4154599e-03,  2.8959394e-04, -3.2196488e-04,  3.0237322e-03,
        1.6618476e-03,  9.0015819e-05,  1.7056705e-03, -1.3774577e-04,
        1.0094722e-03, -1.5144229e-04,  2.0610006e-03, -3.4465315e-03,
        1.3685275e-03, -2.2041015e-03,  1.0888297e-03, -7.2553717e-05,
       -1.0431758e-03, -2.8383040e-03,  8.6026359e-04, -3.1328239e-03,
        6.2263738e-05,  1.2314832e-03, -4.3906824e-04, -2.2127018e-03,
        2.0221604e-03,  1.3907196e-03,  1.8366361e-03, -4.3923350e-04,
      

### The excercise continues in the next notebook!