#### **D. Document Embeddings**

Document Embedding is a technique which tries to provide vector representations to larger units of texts - **from sentences to books**.

A **document** refers to **any sequence of words**, ranging from sentences and paragraphs through social media posts all the way to articles, books, and more complexly structured text documents such as forms.

Applications:

1. Text classification and sentiment analysis tasks (Le & Mikolov, 2014)
2. Document similarity tasks (Dai et al, 2015)
3. Forum question duplication task and the Semantic Textual Similarity SemEval shared task (Lau & Baldwin, 2016)
4. Semantic relatedness, paraphrase detection, image-sentence ranking using *skip-thought* vectors (Kiros et al, 2015)
5. Sentence pair similarity tasks showed by *BioSentVec* (Chen et al, 2018)
6. Information retrieval, web search ranking, ad relevance, contextual entity search and interestingness tasks, question answering, knowledge inference, image captioning, and machine translation tasks showcased by the ***Deep Semantic Similarity Model***.

Document Embedding Approaches:

1. Summarizing Word Vectors
    - The classic approach of *summing* or *averaging* word vectors to represent documents.

2. Topic Modelling
    - they *inherently generate a document embedding space* meant to model and explain word distribution in the corpus and where dimensions can be seen as latent semantic structures hidden in the data, and are thus useful in our context.

3. Encoder-decoder models
    - Unsupervised methods, such as *doc2vec* and *skip-thought*. This has been around since early 2000's under the name *neural probabilistic language models*. This approach gains more than others from the increasing availability of large unlabeled corpora.

4. Supervised representation learning
    - By simply inputting a bag-of-words representation into a neural network learning to solve some supervised text-related problem, you get a model where hidden layers hold rich representations of the input texts. Unfortunately, the limited availability of very large labeled corpora will inhibit the growth of this approach in the coming years.

#### **A. Paragraph Vectors (Doc2Vec)**

It is the first attempt to generalize *word2vec* to work with word sequences. The authors introduced two variants:

a. Paragraph Vectors: Distributed Memory (PV-DM)
- The training task here is quite similar to that of continuous bag of words; a single word is to be predicted from its context. In this case, the context words are the preceding words, not the surrounding words, as is the paragraph. The PV-DM model augments the standard encoder-decoder model by adding a memory vector, aimed at capturing the topic of the paragraph, or context from the input. To achieve this, every paragraph is mapped to a unique vector, represented by a column in a matrix (denoted by D), as is each word in the vocabulary.

b. Paragraph Vectors: Distributed Bag of Words (PV-DBOW)

- It is a parallel of word2vec's skipgram architecture, the classification task is to predict a single context word using only the paragraph vector.

- In addition, the process of input subsampling is removed, considering the entire sentence as context instead. This means both that:
1. the use of frequent word subsampling is discarded — so as not to prevent the generation of n-grams features — and
2. the dynamic context windows used by word2vec are made away with: the entire sentence is considered as the context window, instead of sampling the context window size for each subsampled word uniformly between 1 and the length of the current sentence.

`set dbow_words = 1` for gensim doc2vec

Note: In its Gensim implementation, PV-DBOW uses randomly initialized word embeddings by default; if `dbow_words is set to 1`, a single step of skip-gram is ran to update word embeddings before running dbow. [Lau & Baldwin, 2016] argue that even though dbow can in theory work with randomized word embeddings, this degrades performance severely in the tasks they have examined.

An intuitive explanation can be traced back to the model’s objective function, which is to maximize the dot product between the document embedding and its constituent word embeddings: if word embeddings are randomly distributed, it becomes more difficult to optimize the document embedding to be close to its more critical content words.

Coded Implementation Using Gensim

**Loading the tokens**

In [1]:
import pandas as pd
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from nltk import word_tokenize
import joblib

In [2]:
tweets_train_tokenized = pd.read_csv('csvs/tweets_train_tokens.csv', index_col=False)
tweets_train_tokenized_message = pd.Series(tweets_train_tokenized.message)
# Converting Panda series into Unicode datatype as required by vectorizers
tweets = tweets_train_tokenized_message.astype('U').values
tweets

array(['arirang simply kpop kim hyung jun cross ha yeong playback',
       'read politico article donald trump running mate tom brady list likely choice',
       'type bazura project google image image photo dad glenn moustache whatthe',
       ..., 'bring dunkin iced coffee tomorrow hero',
       'currently holiday portugal come home tomorrow poland tuesday holocaust memorial trip',
       'ladykiller saturday aternoon'], dtype=object)

In [3]:
# We tokenize each sentence to get every word

words=[word_tokenize(sentence) for sentence in tweets]
words

[['arirang',
  'simply',
  'kpop',
  'kim',
  'hyung',
  'jun',
  'cross',
  'ha',
  'yeong',
  'playback'],
 ['read',
  'politico',
  'article',
  'donald',
  'trump',
  'running',
  'mate',
  'tom',
  'brady',
  'list',
  'likely',
  'choice'],
 ['type',
  'bazura',
  'project',
  'google',
  'image',
  'image',
  'photo',
  'dad',
  'glenn',
  'moustache',
  'whatthe'],
 ['fast',
  'lerner',
  'subpoena',
  'tech',
  'guy',
  'work',
  'hillary',
  'private',
  'server',
  'plead',
  'sound',
  'familiar'],
 ['sony',
  'reward',
  'app',
  'like',
  'lot',
  'female',
  'singer',
  'non',
  'retro',
  'sale',
  'no',
  'info'],
 ['watch',
  'brooklyn',
  'nets',
  'new',
  'york',
  'knick',
  'tonight',
  'postpone',
  'knick',
  'butt',
  'fuck',
  'miami',
  'tomorrow'],
 ['guy', 'open', 'gate', 'naruto', 'save', 'ass', 'goat'],
 ['triple',
  'h',
  'never',
  'ric',
  'flair',
  'bitch',
  'sunday',
  'no',
  'pressure',
  'rollin',
  'look',
  'look',
  'hhh',
  'raw'],
 ['join

In [4]:
# We create a tag or an individual `paragraph id for each document we have`
def tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield TaggedDocument(list_of_words, [i])

data_for_training = list(tagged_document(words))
print(data_for_training[0])

TaggedDocument<['arirang', 'simply', 'kpop', 'kim', 'hyung', 'jun', 'cross', 'ha', 'yeong', 'playback'], [0]>


In [5]:
model = Doc2Vec(vector_size=10, min_count=2, epochs=10)
model.build_vocab(data_for_training)

In [6]:
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

In [7]:
import numpy as np
d2v_tweets = []

for i in range(model.corpus_count):
    d2v_tweets.append(model.dv[i])

doc2vec_tweets = np.array(d2v_tweets)
doc2vec_tweets.shape

(49675, 10)

In [8]:
doc2vec_tweets

array([[-0.09706153,  0.00836273, -0.10036916, ..., -0.05951937,
        -0.12229338, -0.02943969],
       [ 0.0768654 ,  0.11051423, -0.14390431, ...,  0.15714288,
        -0.1072313 , -0.10160755],
       [-0.03059013,  0.0757471 ,  0.00896006, ..., -0.16501954,
         0.01974987, -0.09750597],
       ...,
       [ 0.03058317, -0.04542535,  0.02273304, ...,  0.00468193,
        -0.12863246, -0.12644576],
       [ 0.08006796, -0.18050203,  0.09373045, ...,  0.01869539,
        -0.16327702, -0.09766775],
       [ 0.03441153,  0.00589671, -0.02517486, ...,  0.08582754,
        -0.00996797, -0.01485144]], dtype=float32)

In [9]:
# Save the tfidvectorizer to disk
doc2vec_tweets_file = 'vectors/doc2vec_tweets.sav'
joblib.dump(doc2vec_tweets,doc2vec_tweets_file)

['vectors/doc2vec_tweets.sav']

#### **End. Thank you!**