# 1. Vector Space Models
Throughout this chapter, we’ll represent text
units (characters, phonemes, words, phrases, sentences, paragraphs,
and documents) with vectors of numbers. This is known as the vector
space model (VSM). \
It’s a
mathematical model that represents text units as vectors.In this setting, the most common way to
calculate similarity between two text blobs is using cosine similarity:

![](images/similarity.png)

# 2. Basic Vectorization 
Let’s start with a basic idea of text representation: map each word in
the vocabulary (V) of the text corpus to a unique ID (integer value),
then represent each sentence or document in the corpus as a V dimensional vector.

Table 3-1. Our toy corpus

|    |                |
|----|----------------|
| D1 | Dog bites man  |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |
|    |                |
|    |                |

Lowercasing text and ignoring punctuation, the vocabulary of this
corpus is comprised of six words: [dog, bites, man, eats, meat, food].
Every
document in this corpus can now be represented with a vector of size
six. 

## One-Hot Encoding
Let’s understand this via our toy corpus. We first map each of the six
words to unique IDs: dog = 1, bites = 2, man = 3, meat = 4 , food = 5,
eats = 6. Let’s consider the document D1: “dog bites man”. As per the
scheme, each word is a six-dimensional vector. Dog is represented as
[1 0 0 0 0 0], as the word “dog” is mapped to ID 1. Bites is
represented as [0 1 0 0 0 0], and so on and so forth. Thus, D1 is
represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]. D4 is
represented as [ [ 0 0 1 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]]. Other
documents in the corpus can be represented similarly.


In [2]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [4]:
# build the vocabulary
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count = count + 1
            vocab[word] = count
print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [5]:
#Get one hot representation for any string based on this vocabulary. 
#If the word exists in the vocabulary, its representation is returned. 
#If not, a list of zeroes is returned for that word. 
def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1
        onehot_encoded.append(temp)
    return onehot_encoded

In [6]:
print(processed_docs[1])
get_onehot_vector(processed_docs[1]) #one hot representation for a text from our cor

man bites dog


[[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]

In [7]:
get_onehot_vector("man and dog are good") 
#one hot representation for a random text, using the above vocabulary

[[0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

In [8]:
get_onehot_vector("man and man are good") 

[[0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

### One-hot encoding using scikit -learn
We will demostrate:
- One Hot Encoding: In one-hot encoding, each word w in corpus vocabulary is given a unique integer id wid that is between 1 and |V|, where V is the set of corpus vocab. Each word is then represented by a V-dimensional binary vector of 0s and 1s.

- Label Encoding: In Label Encoding, each word w in our corpus is converted into a numeric value between 0 and n-1 (where n refers to number of unique words in our corpus).

In [3]:
S1 = 'dog bites man'
S2 = 'man bites dog'
S3 = 'dog eats meat'
S4 = 'man eats food'

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

data = [S1.split(), S2.split(), S3.split(), S4.split()]
values = data[0] + data[1]+ data[2]+ data[3]
print("The data: ", data)

# Label Encoding
label_encorder = LabelEncoder()
integer_encoder = label_encorder.fit_transform(values)
print("Label Encoder: ", integer_encoder)

# One hot encoder
onehot_encoder = OneHotEncoder()
onehot_encoder = onehot_encoder.fit_transform(data).toarray()
print("One hot matrix: \n", onehot_encoder)

The data:  [['dog', 'bites', 'man'], ['man', 'bites', 'dog'], ['dog', 'eats', 'meat'], ['man', 'eats', 'food']]
Label Encoder:  [1 0 4 4 0 1 1 2 5 4 2 3]
One hot matrix: 
 [[1. 0. 1. 0. 0. 0. 1. 0.]
 [0. 1. 1. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 1. 0. 0.]]


**Few shortcomings of one hot encodering**
- The size of a one-hot vector is directly proportional to size of
the vocabulary, and most real-world corpora have large
vocabularies.
- This representation does not give a fixed-length
representation for text.
- It treats words as atomic units and has no notion of
(dis)similarity between words.
- Out of vocabulary (OOV) problem.

## Bag of Words
The key idea behind it is as follows:
represent the text under consideration as a bag (collection) of words
while ignoring the order and context. 

Thus, for our toy corpus (Table 3-1), where the word IDs are dog = 1,
bites = 2, man = 3, meat = 4 , food = 5, eats = 6, D1 becomes [1 1 1 0
0 0]. This is because the first three words in the vocabulary appeared
exactly once in D1, and the last three did not appear at all. D4
becomes [0 0 1 0 1 1].

In [5]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."] #Same as the earlier notebook
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# look at document list
print("Our corpus: ", processed_docs)

count_vect = CountVectorizer()
#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

# Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("Bow representation for 'dog bites man': ", bow_rep[0].toarray())
print("Bow representation for 'man bites dog': ", bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our corpus:  ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
Bow representation for 'dog bites man':  [[1 1 0 0 1 0]]
Bow representation for 'man bites dog':  [[1 1 0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]


In [7]:
#BoW with binary vectors
count_vect = CountVectorizer(binary=True)
count_vect.fit(processed_docs)
temp = count_vect.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Bow representation for 'dog and dog are friends': [[0 1 0 0 0 0]]


Let’s look at some of the advantages of this encoding:
- Like one-hot encoding, BoW is fairly simple to understand
and implement.
- With this representation, documents having the same words
will have their vector representations closer to each other in
Euclidean space as compared to documents with completely
different words.
- We have a fixed-length encoding for any sentence of arbitrary
length.

However, it has its share of disadvantages, too:
- The size of the vector increases with the size of the
vocabulary.
- It does not capture the similarity between different words that
mean the same thing.
- This representation does not have any way to handle out of
vocabulary words (i.e., new words that were not seen in the
corpus that was used to build the vectorizer).
- As the name indicates, it is a “bag” of words—word order
information is lost in this representation.

## Bag of N-Grams
The
bag-of-n-grams (BoN) approach tries to remedy this. It does so by
breaking text into chunks of n contiguous words (or tokens). This can
help us capture some context, which earlier approaches could not do.
Each chunk is called an n-gram.

In [8]:
#our corpus
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]

processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

CountVectorizer, which we used for BoW, can be used for getting a Bag of N-grams representation as well, using its ngram_range argument. The code snippet below shows how:

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

#Ngram vectorization example with count vectorizer and uni, bi, trigrams
count_vect = CountVectorizer(ngram_range=(1,3))

#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ",bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])

print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our vocabulary:  {'dog': 3, 'bites': 0, 'man': 12, 'dog bites': 4, 'bites man': 2, 'dog bites man': 5, 'man bites': 13, 'bites dog': 1, 'man bites dog': 14, 'eats': 8, 'meat': 17, 'dog eats': 6, 'eats meat': 10, 'dog eats meat': 7, 'food': 11, 'man eats': 15, 'eats food': 9, 'man eats food': 16}
BoW representation for 'dog bites man':  [[1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0]]
BoW representation for 'man bites dog:  [[1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0]]
Bow representation for 'dog and dog are friends': [[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


Here are the main pros and cons of BoN:
- It captures some context and word-order information in the
form of n-grams.
- Thus, resulting vector space is able to capture some semantic
similarity. Documents having the same n-grams will have
their vectors closer to each other in Euclidean space as
compared to documents with completely different n-grams.
- As n increases, dimensionality (and therefore sparsity) only
increases rapidly.
- It still provides no way to address the OOV problem

## TF-IDF
In all the other approaches we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.

*TF (term frequency)* measures how often a term or word occurs in a
given document.

$TF(t, d) = \frac{(Number of occurrences of term t in document d)}{(Total number of terms in the document d)}$

*IDF (inverse document frequency)* measures the importance of the
term across a corpus. 

$IDF(t) = log_{e}\frac{(Total number of documents in the corpus)
}{(Number of documents with term t in them )}$

$TF-IDF score = TF * IDF$

In [10]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)
#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names())
print("-"*10)

#TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

temp = tfidf.transform(["dog and man are friends"])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

IDF for all words in the vocabulary [1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]
----------
All words in the vocabulary ['bites', 'dog', 'eats', 'food', 'man', 'meat']
----------
TFIDF representation for all documents in our corpus
 [[0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.         0.44809973 0.55349232 0.         0.         0.70203482]
 [0.         0.         0.55349232 0.70203482 0.44809973 0.        ]]
----------
Tfidf representation for 'dog and man are friends':
 [[0.         0.70710678 0.         0.         0.70710678 0.        ]]




# 3. Distributed Representations
- **Distributional similarity**: This is the idea that the meaning of a word can be understood from
the context in which the word appears.
- **Distributional hypothesis**: In linguistics, this hypothesizes that words that occur in similar
contexts have similar meanings.
- **Distributional representation**: This refers to representation schemes that are obtained based on
distribution of words from the context in which the words appear.
These schemes are based on distributional hypotheses.
- **Distributed representation**: This is a related concept. It, too, is based on the distributional
hypothesis.
- **Embedding**: For the set of words in a corpus, embedding is a mapping between
vector space coming from distributional representation to vector
space coming from distributed representation.
- **Vector semantics**: This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a
large corpus.


## Word Embeddings
Word2vec ensures
that the learned word representations are low dimensional (vectors of
dimensions 50–500, instead of several thousands) and dense (that is, most values
in these vectors are non-zero). Word2vec uses distributional similarity and distributional
hypothesis. Word2vec takes a large corpus of text as input and
“learns” to represent the words in a common vector space based on the
contexts in which they appear in the corpus.

### PRE-TRAINED WORD EMBEDDINGS
Some of the most popular pre-trained embeddings are
**Word2vec** by Google, **GloVe** by Stanford, and **fasttext**
embeddings by Facebook, to name a few. Further, they’re
available for various dimensions like $d = 25, 50, 100, 200, 300, 600$.

In [2]:
import os
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 

import psutil #This module helps in retrieving information on running processes and system resource utilization
process = psutil.Process(os.getpid())
from psutil import virtual_memory
mem = virtual_memory()

import time #This module is used to calculate the time 

In [3]:
from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = "GoogleNews-vectors-negative300.bin"

#Load W2V model. This will take some time, but it is a one time effort! 
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
print('-'*10)

start_time = time.time() #Start the timer
ttl = mem.total #Toal memory available

w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) #load the model
print("%0.2f seconds taken to load"%float(time.time() - start_time)) #Calculate the total time elapsed since starting the timer
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) #Calculate the memory used after loading the model
print('-'*10)

print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) #Percentage increase in memory after loading the model
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.vocab)) #Number

Memory used in GB before Loading the Model: 0.13
----------
39.68 seconds taken to load
----------
Finished loading Word2Vec
----------
Memory used in GB after Loading the Model: 4.82
----------
Percentage increase in memory usage: 3666.09% 
----------
Numver of words in vocablulary:  3000000


In [4]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353004455566406),
 ('lovely', 0.810693621635437),
 ('stunningly_beautiful', 0.7329413890838623),
 ('breathtakingly_beautiful', 0.7231341004371643),
 ('wonderful', 0.6854087114334106),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402292251587)]

In [5]:
#What is the vector representation for a word? 
w2v_model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

**Two things to note while using pre-trained models:**
1. Tokens/Words are always lowercased. If a word is not in the vocabulary, the model throws an exception.
2. So, it is always a good idea to encapsulate those statements in try/except blocks.

### TRAINING OUR OWN EMBEDDINGS
### CBOW (Continuous bag of words)
CBOW tries to learn a language model that tries to predict the “center”
word from the words in its context.

![](images/cbow.png)
<center>CBOW: given the context words, predict the center word</center>

In [2]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')



In [3]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for

In [4]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[-2.2637864e-04  1.1613251e-03 -1.0871805e-03  2.7705024e-03
 -3.1460954e-03 -1.5414666e-03  1.0732063e-04  2.1121099e-03
  1.3804922e-03 -8.9474044e-05 -1.9507032e-04 -3.7300987e-03
  1.2606074e-03 -4.3903719e-04  9.7034819e-05 -2.3361982e-03
  3.0791026e-03 -7.3196675e-04 -1.5462874e-03  4.1705086e-03
 -1.2412993e-03  1.2338976e-03 -4.3189838e-03 -1.8837935e-03
 -4.1378485e-03 -4.2353724e-03 -4.7259713e-03  2.3909356e-03
 -4.4260966e-03 -6.0296076e-04  5.2839715e-04 -4.4406243e-03
 -4.7190804e-03  4.7737407e-03  4.9759797e-03 -3.8089291e-03
  2.2341565e-03 -2.6227571e-03  1.8058733e-04 -4.9039810e-03
  2.0937852e-03  1.7333912e-03 -1.9353403e-03  3.7302757e-03
 -3.3402010e-03  2.8572769e-03  3.3832456e-03 -4.9053812e-03
 -2.1320502e-03  1.7676475e-03 -4.2440277e-03 -2.0944851e-03
 -4.5314971e-03 -3.3794966e-04  3.1579842e-03  3.4176917e-03
  3.3303302e-06  7.3370477e-04  3.4401997e-03 -3.7586696e

In [5]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))

Similarity between eats and bites: -0.0658826
Similarity between eats and man: 0.0586519


In [6]:
#Most similarity
model_cbow.most_similar('meat')

[('food', 0.1552993208169937),
 ('eats', 0.06202014908194542),
 ('bites', 0.03326358646154404),
 ('dog', -0.15304674208164215),
 ('man', -0.15878522396087646)]

### SkipGram
SkipGram is very similar to CBOW, with some minor changes. In
SkipGram, the task is to predict the context words from the center
word.

![](images/skipgram.png)
<center>SkipGram: given the center word, predict every word in context</center>

![](images/cbow-skipgram.png)
<center>cbow-skipgram model</center>

Using packages like gensim, it’s pretty straightforward from a code
point of view to implement Word2vec.

In [7]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[-2.2637864e-04  1.1613251e-03 -1.0871805e-03  2.7705024e-03
 -3.1460954e-03 -1.5414666e-03  1.0732063e-04  2.1121099e-03
  1.3804922e-03 -8.9474044e-05 -1.9507032e-04 -3.7300987e-03
  1.2606074e-03 -4.3903719e-04  9.7034819e-05 -2.3361982e-03
  3.0791026e-03 -7.3196675e-04 -1.5462874e-03  4.1705086e-03
 -1.2412993e-03  1.2338976e-03 -4.3189838e-03 -1.8837935e-03
 -4.1378485e-03 -4.2353724e-03 -4.7259713e-03  2.3909356e-03
 -4.4260966e-03 -6.0296076e-04  5.2839715e-04 -4.4406243e-03
 -4.7190804e-03  4.7737407e-03  4.9759797e-03 -3.8089291e-03
  2.2341565e-03 -2.6227571e-03  1.8058733e-04 -4.9039810e-03
  2.0937852e-03  1.7333912e-03 -1.9353403e-03  3.7302757e-03
 -3.3402010e-03  2.8572769e-03  3.3832456e-03 -4.9053812e-03
 -2.1320502e-03  1.7676475e-03 -4.2440277e-03 -2.0944851e-03
 -4.5314971e-03 -3.3794966e-04  3.1579842e-03  3.4176917e-03
  3.3303302e-06  7.3370477e-04  3.4401997e-03 -3.7586696e

In [8]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[-2.2637864e-04  1.1613251e-03 -1.0871805e-03  2.7705024e-03
 -3.1460954e-03 -1.5414666e-03  1.0732063e-04  2.1121099e-03
  1.3804922e-03 -8.9474044e-05 -1.9507032e-04 -3.7300987e-03
  1.2606074e-03 -4.3903719e-04  9.7034819e-05 -2.3361982e-03
  3.0791026e-03 -7.3196675e-04 -1.5462874e-03  4.1705086e-03
 -1.2412993e-03  1.2338976e-03 -4.3189838e-03 -1.8837935e-03
 -4.1378485e-03 -4.2353724e-03 -4.7259713e-03  2.3909356e-03
 -4.4260966e-03 -6.0296076e-04  5.2839715e-04 -4.4406243e-03
 -4.7190804e-03  4.7737407e-03  4.9759797e-03 -3.8089291e-03
  2.2341565e-03 -2.6227571e-03  1.8058733e-04 -4.9039810e-03
  2.0937852e-03  1.7333912e-03 -1.9353403e-03  3.7302757e-03
 -3.3402010e-03  2.8572769e-03  3.3832456e-03 -4.9053812e-03
 -2.1320502e-03  1.7676475e-03 -4.2440277e-03 -2.0944851e-03
 -4.5314971e-03 -3.3794966e-04  3.1579842e-03  3.4176917e-03
  3.3303302e-06  7.3370477e-04  3.4401997e-03 -3.7586696e

# 4. Distributed Representations Beyond Words and Characters
Let’s look at another approach, Doc2vec, which allows us to directly
learn the representations for texts of arbitrary lengths (phrases,
sentences, paragraphs, and documents) by taking the context of words
in the text into account.

The
two architectures are called *distributed memory (DM)* and *distributed
bag of words (DBOW)*.

![](images/dm.png)
<center>Doc2vec architectures: DM (left) and DBOW (right)</center>

In [9]:
#Import spacy and load the model
import spacy
nlp = spacy.load("en_core_web_sm") #here nlp object refers to the 'en_core_web_sm' 

In [10]:
#Assume each sentence in documents corresponds to a separate document.
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

print("Document After Pre-Processing:",processed_docs)


#Iterate over each document and initiate an nlp instance.
for doc in processed_docs:
    doc_nlp = nlp(doc) #creating a spacy "Doc" object which is a container for accessing linguistic annotations. 
    
    print("-"*30)
    print("Average Vector of '{}'\n".format(doc),doc_nlp.vector)#this gives the average vector of each document
    for token in doc_nlp:
        print()
        print(token.text,token.vector)#this gives the text of each word in the doc and their respective vectors.

Document After Pre-Processing: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
------------------------------
Average Vector of 'dog bites man'
 [ 0.35036066  0.10273071  0.33009622 -0.2030462   0.397346   -0.05984474
 -0.2201394   0.25512496 -0.3575406   0.39600918 -0.75429106 -0.5888314
  0.29484284 -0.63774514 -0.23350935  0.5900816  -0.2566284  -0.71845955
  0.20040572  0.7679166  -0.26526675 -0.6816276  -0.0701522   0.04820635
  0.1266749   0.2589217  -0.6932214  -0.3419633   1.0904325  -0.32465276
  1.4362421  -0.5931116   0.32251295 -0.341225   -0.12486354 -0.7798831
 -0.29717746  0.4014299  -0.1318171   0.910722   -0.41182932  0.04191664
  0.59365046 -0.04422406 -0.18440922 -0.05003772  0.59136873 -0.6386824
  1.8019737  -0.04936111  0.27116123  0.21994926 -0.2368415  -0.23461938
  0.22323321  1.0983983  -0.39096567  0.10752393 -0.06386908  0.14312072
  0.37180772 -0.34773377 -0.42992604 -0.4652144  -0.58004665  0.37198398
  0.04235339 -0.4719428   0.281242

In this notebook we demonstrate how to train a doc2vec model on a custom corpus.

In [11]:
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\minhh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [14]:
#dbow
model_dbow = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)

In [15]:
print(model_dbow.infer_vector(['man','eats','food']))#feature vector of man eats food

[ 0.00420315  0.00653283 -0.00940372  0.01937548 -0.02182949 -0.01730926
  0.01996748  0.00888949 -0.02249881 -0.0033143   0.014049    0.01591324
 -0.00174675  0.02163521  0.01903928 -0.0141784   0.0058233  -0.01405842
  0.01193869  0.01563022]


In [16]:
model_dbow.wv.most_similar("man",topn=5)#top 5 most simlar words.

[('bites', 0.10207571089267731),
 ('eats', 0.10173594951629639),
 ('food', -0.06450934708118439),
 ('dog', -0.19907894730567932),
 ('meat', -0.2696749269962311)]

In [17]:
model_dbow.wv.n_similarity(["dog"],["man"])

-0.19907895

In [18]:
#dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

print("Inference Vector of man eats food\n ",model_dm.infer_vector(['man','eats','food']))

print("Most similar words to man in our corpus\n",model_dm.wv.most_similar("man",topn=5))
print("Similarity between man and dog: ",model_dm.wv.n_similarity(["dog"],["man"]))

Inference Vector of man eats food
  [ 0.00417791  0.00643509 -0.00935274  0.01932913 -0.02187857 -0.01716197
  0.02004039  0.00903782 -0.02242476 -0.00328896  0.01409931  0.01591488
 -0.00172501  0.02172272  0.01902425 -0.01420487  0.00586274 -0.01404743
  0.01185267  0.01552101]
Most similar words to man in our corpus
 [('bites', 0.10207571089267731), ('eats', 0.10173594951629639), ('food', -0.06450934708118439), ('dog', -0.19907894730567932), ('meat', -0.2696749269962311)]
Similarity between man and dog:  -0.19907895


What happens when we compare between words which are not in the vocabulary?

In [19]:
model_dm.wv.n_similarity(['covid'],['man'])

KeyError: "word 'covid' not in vocabulary"

# 5. Universal Text Representations
These representations are very useful and popular in modernday NLP. However, based on our experience, here are a few important
aspects to keep in mind while using them in your project:

- All text representations are inherently biased based on what
they saw in training data.
- Unlike the basic vectorization approaches, pre-trained
embeddings are generally large-sized files (several
gigabytes), which may pose problems in certain deployment
scenarios. This is something we need to address while using
them, otherwise it can become an engineering bottleneck in
performance.
- Modeling language for a real-world application is more than
capturing the information via word and sentence embeddings.
- As we speak, neural text representation is an evolving area in
NLP, with rapidly changing state of the art.

# 6. Visualizing Embeddings
Enter t-SNE, or *t-distributed Stochastic Neighboring
Embedding*. It’s a technique used for visualizing high-dimensional data
like embeddings by reducing them to two- or three-dimensional data.

![](images/mnist.png)
<center>Visualizing MNIST data using t-SNE</center>

Figure shows not only the
position of the vectors of these words, but also an interesting
observation between the vectors—the arrows capture the
“relationship” between words. t-SNE visualization helps greatly in
coming up with such nice observations.

![](images/t-sne.png)
<center>t-SNE visualization shows some interesting relationships</center>

# 7. Handcrafted Feature Representations
Clearly, measures such as “syntactic
complexity,” “concreteness,” etc., cannot be calculated by only
converting text into BoW or embedding representations. They have to
be designed manually, keeping in mind both the domain knowledge and
the ML algorithms to train the NLP models. This is why we call these
*handcrafted feature representations*.