# Import libs

In [63]:
%matplotlib inline

In [64]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [65]:
import os
import gensim
# Set file names for train and test data
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

# Create the dataset

In [66]:
import pandas as pd

In [67]:
balanced_dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ressources/keywi/landing_title_dataset.csv')
balanced_dataset['title'] = balanced_dataset.title.astype(str)

In [68]:
train_df = balanced_dataset.sample(frac=0.5)
test_df = balanced_dataset[~balanced_dataset.index.isin(train_df.index)]

In [69]:
with open("train_file.txt", 'w') as file:
    file.write('\n'.join(train_df.title))
with open("test_file.txt", 'w') as file:
    file.write('\n'.join(test_df.title))

In [70]:
# from google.colab import files
# files.download(lee_test_file)

In [71]:
lee_train_file = '/content/train_file.txt'
lee_test_file = '/content/test_file.txt'

Define a Function to Read and Preprocess Text
---------------------------------------------

Below, we define a function to:

- open the train/test file (with latin encoding)
- read the file line-by-line
- pre-process each line (tokenize text into individual words, remove punctuation, set to lowercase, etc)

The file we're reading is a **corpus**.
Each line of the file is a **document**.

.. Important::
  To train the model, we'll need to associate a tag/number with each document
  of the training corpus. In our case, the tag is simply the zero-based line
  number.




In [72]:
import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

Let's take a look at the training corpus




In [73]:
print(train_corpus[:2])

[TaggedDocument(words=['gratis', 'hoortestdagen'], tags=[0]), TaggedDocument(words=['westerop', 'ko', 'kapsalon', 'santpoortzuid'], tags=[1])]


And the testing corpus looks like this:




In [74]:
print(test_corpus[:2])

[['contact', 'gemeente', 'groningen'], ['sign', 'in', 'linkedin']]


Notice that the testing corpus is just a list of lists and does not contain
any tags.




Training the Model
------------------

Now, we'll instantiate a Doc2Vec model with a vector size with 50 dimensions and
iterating over the training corpus 40 times. We set the minimum word count to
2 in order to discard words with very few occurrences. (Without a variety of
representative examples, retaining such infrequent words can often make a
model worse!) Typical iteration counts in the published `Paragraph Vector paper <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>`__
results, using 10s-of-thousands to millions of docs, are 10-20. More
iterations take more time and eventually reach a point of diminishing
returns.

However, this is a very very small dataset (300 documents) with shortish
documents (a few hundred words). Adding training passes can sometimes help
with such small datasets.




In [75]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)



Build a vocabulary



In [76]:
model.build_vocab(train_corpus)

2022-03-11 02:50:57,335 : INFO : collecting all words and their counts
2022-03-11 02:50:57,338 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2022-03-11 02:50:57,347 : INFO : collected 2420 word types and 1027 unique tags from a corpus of 1027 examples and 5206 words
2022-03-11 02:50:57,350 : INFO : Loading a fresh vocabulary
2022-03-11 02:50:57,363 : INFO : effective_min_count=2 retains 650 unique words (26% of original 2420, drops 1770)
2022-03-11 02:50:57,364 : INFO : effective_min_count=2 leaves 3436 word corpus (66% of original 5206, drops 1770)
2022-03-11 02:50:57,373 : INFO : deleting the raw counts dictionary of 2420 items
2022-03-11 02:50:57,388 : INFO : sample=0.001 downsamples 57 most-common words
2022-03-11 02:50:57,390 : INFO : downsampling leaves estimated 2435 word corpus (70.9% of prior 3436)
2022-03-11 02:50:57,397 : INFO : estimated required memory for 650 words and 50 dimensions: 790400 bytes
2022-03-11 02:50:57,404 : INFO : resetting

Essentially, the vocabulary is a dictionary (accessible via
``model.wv.vocab``\ ) of all of the unique words extracted from the training
corpus along with the count (e.g., ``model.wv.vocab['penalty'].count`` for
counts for the word ``penalty``\ ).




Next, train the model on the corpus.
If the BLAS library is being used, this should take no more than 3 seconds.
If the BLAS library is not being used, this should take no more than 2
minutes, so use BLAS if you value your time.




In [None]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Now, we can use the trained model to infer a vector for any piece of text
by passing a list of words to the ``model.infer_vector`` function. This
vector can then be compared with other vectors via cosine similarity.




In [78]:
vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

[ 0.11821076  0.03322035 -0.06197895  0.12135652  0.20852715  0.04502743
  0.05179525 -0.0193969   0.07544646 -0.02888737  0.18293221  0.13676941
 -0.06969482 -0.03838854 -0.02144772 -0.10653731 -0.02389352 -0.07344662
 -0.12902184  0.08786379 -0.07799368  0.11776935 -0.18422349  0.04785805
  0.0651237   0.00181668  0.06122094 -0.21121283  0.1931972   0.00314464
 -0.07126746  0.16450453  0.02620476  0.21824433 -0.08971546 -0.13908242
  0.01268764 -0.2485366  -0.09876403  0.10125385  0.11163259  0.0685856
 -0.04178153 -0.07361675 -0.0867291   0.06208353  0.01660148 -0.06469194
  0.10408166 -0.02004464]


Note that ``infer_vector()`` does *not* take a string, but rather a list of
string tokens, which should have already been tokenized the same way as the
``words`` property of original training document objects.

Also note that because the underlying training/inference algorithms are an
iterative approximation problem that makes use of internal randomization,
repeated inferences of the same text will return slightly different vectors.




Assessing the Model
-------------------

To assess our new model, we'll first infer new vectors for each document of
the training corpus, compare the inferred vectors with the training corpus,
and then returning the rank of the document based on self-similarity.
Basically, we're pretending as if the training corpus is some new unseen data
and then seeing how they compare with the trained model. The expectation is
that we've likely overfit our model (i.e., all of the ranks will be less than
2) and so we should be able to find similar documents very easily.
Additionally, we'll keep track of the second ranks for a comparison of less
similar documents.




In [79]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

2022-03-11 02:51:00,504 : INFO : precomputing L2-norms of doc weight vectors


Let's count how each document ranks with respect to the training corpus

NB. Results vary between runs due to random seeding and very small corpus



In [80]:
import collections

counter = collections.Counter(ranks)
print(counter)

Counter({0: 44, 1: 18, 2: 15, 3: 11, 4: 9, 5: 9, 9: 6, 95: 5, 6: 5, 822: 5, 8: 5, 224: 4, 21: 4, 7: 4, 56: 4, 701: 4, 83: 4, 10: 4, 17: 4, 48: 4, 14: 4, 367: 4, 471: 4, 145: 4, 351: 4, 740: 4, 466: 3, 735: 3, 604: 3, 134: 3, 41: 3, 336: 3, 27: 3, 1012: 3, 660: 3, 503: 3, 902: 3, 11: 3, 809: 3, 72: 3, 681: 3, 31: 3, 1013: 3, 730: 3, 258: 3, 146: 3, 1014: 3, 461: 3, 60: 3, 620: 3, 685: 3, 42: 3, 57: 3, 453: 3, 375: 3, 712: 3, 13: 3, 344: 3, 286: 3, 19: 3, 12: 3, 661: 3, 229: 3, 409: 3, 1026: 3, 878: 3, 311: 3, 188: 3, 770: 3, 433: 3, 16: 3, 55: 2, 240: 2, 1015: 2, 964: 2, 698: 2, 86: 2, 284: 2, 30: 2, 524: 2, 931: 2, 481: 2, 493: 2, 241: 2, 154: 2, 811: 2, 662: 2, 668: 2, 631: 2, 554: 2, 602: 2, 721: 2, 774: 2, 322: 2, 255: 2, 285: 2, 211: 2, 777: 2, 169: 2, 857: 2, 695: 2, 525: 2, 825: 2, 429: 2, 366: 2, 459: 2, 324: 2, 842: 2, 821: 2, 617: 2, 297: 2, 232: 2, 611: 2, 838: 2, 776: 2, 577: 2, 62: 2, 550: 2, 312: 2, 861: 2, 793: 2, 882: 2, 54: 2, 915: 2, 450: 2, 201: 2, 627: 2, 316: 2, 356

Basically, greater than 95% of the inferred documents are found to be most
similar to itself and about 5% of the time it is mistakenly most similar to
another document. Checking the inferred-vector against a
training-vector is a sort of 'sanity check' as to whether the model is
behaving in a usefully consistent manner, though not a real 'accuracy' value.

This is great and not entirely surprising. We can take a look at an example:




In [87]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (500): «trim en trainingsgroep weesp»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (831, 0.9356752038002014): «about hba»

SECOND-MOST (937, 0.9353445768356323): «about us betway group»

MEDIAN (30, 0.9156643748283386): «alles is energie mayawijsheid door fokje ijkema»

LEAST (868, -0.8325261473655701): «sign in linkedin»



Notice above that the most similar document (usually the same text) is has a
similarity score approaching 1.0. However, the similarity score for the
second-ranked documents should be significantly lower (assuming the documents
are in fact different) and the reasoning becomes obvious when we examine the
text itself.

We can run the next cell repeatedly to see a sampling other target-document
comparisons.




In [82]:
# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (750): «dc online gambling your gambling guide to the nation capital»

Similar Document (997, 0.9765671491622925): «wordpress fout»



Testing the Model
-----------------

Using the same approach above, we'll infer the vector for a randomly chosen
test document, and compare the document to our model by eye.




In [83]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (500): «home capability»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (831, 0.9356752038002014): «about hba»

MEDIAN (30, 0.9156643748283386): «alles is energie mayawijsheid door fokje ijkema»

LEAST (868, -0.8325261473655701): «sign in linkedin»



Conclusion
----------

Let's review what we've seen in this tutorial:

0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec
1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)
2. Train a Doc2Vec `core_concepts_model` model using the training corpus
3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`
4. Assess the model
5. Test the model on the test corpus

That's it! Doc2Vec is a great way to explore relationships between documents.

Additional Resources
--------------------

If you'd like to know more about the subject matter of this tutorial, check out the links below.

* `Word2Vec Paper <https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf>`_
* `Doc2Vec Paper <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>`_
* `Dr. Michael D. Lee's Website <http://faculty.sites.uci.edu/mdlee>`_
* `Lee Corpus <http://faculty.sites.uci.edu/mdlee/similarity-data/>`__
* `IMDB Doc2Vec Tutorial <doc2vec-IMDB.ipynb>`_


