[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ignaziogallo/data-mining/blob/aa20-21/tutorials/data/Word2Vec.ipynb)

http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code  
https://mc.ai/word2vec-for-phrases%E2%80%8A-%E2%80%8Alearning-embeddings-for-more-than-one-word/  
https://becominghuman.ai/how-does-word2vecs-skip-gram-work-f92e0525def4

# Word Embedding

* Are **methods to represent words in a numerical way**.
* Example: **one-hot encoding** to map each word to a one-hot vector.
<img src="figures/one-hot-word-embedding-vectors.png" width="60%">

# Idea behind Word2Vec
* The idea behind Word2Vec is pretty  
  `the meaning of a word can be inferred by the company it keeps`. 
* This is analogous to the saying,  
  “`show me your friends, and I’ll tell who you are`”. 

* If we have two words that have **very similar neighbors**,   
  then these words are probably quite **similar in meaning** or are at least highly related.

**Example** 

* the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

Viceversa
* the word “`play`” has a different meaning in the sentences  
  * “`The boy loves to play outside`”
  * “`The play was fantastic`”.

**Learning word from its context**

* `Word2Vec` starts from **large unsupervised corpus** 
  * **for each word** in the corpus, we try to **predict it** by its **given context** (CBOW), or 
  * trying to **predict the context given** a specific **word** (Skip-Gram)

<img src="figures/w2v-models.png" width="60%">

**Neural network structure**

Word2vec is a neural network structure **to generate word embedding** by training the model on a supervised classification problem. 

* Introduced by **Mikolov** et al.,2013 in the paper *Efficient Estimation of Word Representations in Vector Space*  
* used to measure syntactic and **semantic similarities** between words.

**Word2Vec with Continuous Bag of Words (CBOW) Learning technique**

<img src="figures/Architecture-of-Word2Vec-with-CBOW-technique.png" width="40%">

**Word2vec SKIP-GRAM architecture**

Given a word $w(t)$, the model **predicts the words** that precede and proceed it in a **window** of 4 words, 
$w(t-2)$, $w(t-1)$, $w(t+1)$, $w(t+2)$ 
<img src="figures/WORD2VEC-SKIP-GRAM.png" width="70%">

**Training SKIP-GRAM**

<img src="figures/skip-gram-training.png" width="70%">

### The Neural model for Skip-Gram
* [Use this](https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca) to understand better how to turn each input word into a vector using an embedding algorithm
* *Xin Rong* has created a [visual demo](https://ronxin.github.io/wevi/) that shows how word embeddings are trained
<img src="figures/skip-gram-neural-model.png" width="70%">

**Emdedding matrix after Word2Vec training**

* Each **word** will be represented by a **d-dimension continuous vector**, 
* The **meaning of each word** will be captured by its **relation to other words**. 

**The reason**
* in training time, `if two target words share the some context, intuitively the weight of the network for this two target words will be close to each other`. 

**Similarity tasks**

Looking the vector as a whole, one can perform many similarity tasks.   
For example, 
* we get that $$V(“King”)-V(“Man”)+V(“Woman) ~= V(“Queen”)$$ and $$V(“Paris”)-V(“France)+V(“Spain”) ~= V(“Madrid”)$$. 
* We can perform similarity measures, like **cosine-similarity between the vectors** and get that   
  “`president`” will be close to “`Obame`”, “`Trump`”, “`CEO`”, “`Chairman`”, etc.

# Word2Vec in Gensim 

**Performance** is really a combination of two things, 
1. your **input data** 
2. your **parameter** settings. 

**Note** that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

### Imports and logging

First, we start with our imports and get logging established:

In [22]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


### Dataset 
**The secret** to getting Word2Vec really working for you is to **have lots and lots of text data**. 
* We are going to use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. 
* This dataset has full user **reviews of cars and hotels**. 

**Download** dataset from https://github.com/kavgan/nlp-in-practice/blob/master/word2vec/reviews_data.txt.gz (86 MB)

In [4]:
#!wget https://github.com/kavgan/nlp-in-practice/raw/master/word2vec/reviews_data.txt.gz --output-file=data/reviews_data.txt.gz

In [23]:
data_file="data/reviews_data.txt.gz"
# let's take a closer look at this data below by printing the first line.
with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

### Read files into a list
We can read it into a list so that we can pass this on to the Word2Vec model. 

**Notice** 
* we are directly reading the compressed file. 
* We are doing a basic **pre-processing** of the reviews using `gensim.utils.simple_preprocess (line)`.   
  * such as **tokenization**, **lowercasing**, etc.. and 
  * returns back a **list of tokens** (words).

In [7]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

2021-03-14 13:56:06,650 : INFO : reading file data/reviews_data.txt.gz...this may take a while
2021-03-14 13:56:06,652 : INFO : read 0 reviews
2021-03-14 13:56:08,524 : INFO : read 10000 reviews
2021-03-14 13:56:10,409 : INFO : read 20000 reviews
2021-03-14 13:56:12,584 : INFO : read 30000 reviews
2021-03-14 13:56:14,736 : INFO : read 40000 reviews
2021-03-14 13:56:16,997 : INFO : read 50000 reviews
2021-03-14 13:56:19,172 : INFO : read 60000 reviews
2021-03-14 13:56:21,016 : INFO : read 70000 reviews
2021-03-14 13:56:22,709 : INFO : read 80000 reviews
2021-03-14 13:56:24,489 : INFO : read 90000 reviews
2021-03-14 13:56:26,215 : INFO : read 100000 reviews
2021-03-14 13:56:27,909 : INFO : read 110000 reviews
2021-03-14 13:56:29,625 : INFO : read 120000 reviews
2021-03-14 13:56:31,778 : INFO : read 130000 reviews
2021-03-14 13:56:33,662 : INFO : read 140000 reviews
2021-03-14 13:56:35,423 : INFO : read 150000 reviews
2021-03-14 13:56:37,212 : INFO : read 160000 reviews
2021-03-14 13:56:3

## Training the Word2Vec model

* Just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). 
* Word2Vec uses all these tokens to internally **create a vocabulary**.
* call `train(...)` to start training the Word2Vec model. (takes about 10 minutes)
* we are **not** going to **use the neural network** after training. 
* **The goal** is to learn the weights of the hidden layer, i.e. the **word vectors** that we’re trying to learn. 

In [8]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2021-03-14 13:58:15,286 : INFO : collecting all words and their counts
2021-03-14 13:58:15,287 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-14 13:58:15,571 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2021-03-14 13:58:15,839 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2021-03-14 13:58:16,159 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2021-03-14 13:58:16,478 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2021-03-14 13:58:16,812 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2021-03-14 13:58:17,133 : INFO : PROGRESS: at sentence #60000, processed 11013723 words, keeping 76780 word types
2021-03-14 13:58:17,407 : INFO : PROGRESS: at sentence #70000, processed 12637525 words, keeping 83193 word types
2021-03-14 13:58:17,666 : INFO : PROG

2021-03-14 13:59:15,899 : INFO : EPOCH 1 - PROGRESS: at 94.62% examples, 768178 words/s, in_qsize 16, out_qsize 3
2021-03-14 13:59:16,900 : INFO : EPOCH 1 - PROGRESS: at 97.26% examples, 768039 words/s, in_qsize 20, out_qsize 0
2021-03-14 13:59:17,846 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-03-14 13:59:17,866 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-03-14 13:59:17,874 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-03-14 13:59:17,880 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-03-14 13:59:17,892 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-03-14 13:59:17,896 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-03-14 13:59:17,897 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-03-14 13:59:17,900 : INFO : EPOCH 1 - PROGRESS: at 99.96% examples, 768211 words/s, in_qsize 2, out_qsize 1
2021-03-14 13:59:17,90

2021-03-14 14:00:11,140 : INFO : EPOCH 3 - PROGRESS: at 28.55% examples, 725095 words/s, in_qsize 20, out_qsize 2
2021-03-14 14:00:12,142 : INFO : EPOCH 3 - PROGRESS: at 31.47% examples, 730301 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:00:13,169 : INFO : EPOCH 3 - PROGRESS: at 34.12% examples, 732256 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:00:14,170 : INFO : EPOCH 3 - PROGRESS: at 36.81% examples, 734232 words/s, in_qsize 16, out_qsize 3
2021-03-14 14:00:15,174 : INFO : EPOCH 3 - PROGRESS: at 39.51% examples, 735809 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:00:16,189 : INFO : EPOCH 3 - PROGRESS: at 42.12% examples, 733130 words/s, in_qsize 18, out_qsize 1
2021-03-14 14:00:17,191 : INFO : EPOCH 3 - PROGRESS: at 44.81% examples, 733569 words/s, in_qsize 18, out_qsize 1
2021-03-14 14:00:18,214 : INFO : EPOCH 3 - PROGRESS: at 47.45% examples, 734622 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:00:19,235 : INFO : EPOCH 3 - PROGRESS: at 50.06% examples, 733915 words/s,

2021-03-14 14:01:15,066 : INFO : EPOCH 4 - PROGRESS: at 92.52% examples, 772906 words/s, in_qsize 20, out_qsize 2
2021-03-14 14:01:16,078 : INFO : EPOCH 4 - PROGRESS: at 95.24% examples, 773235 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:01:17,082 : INFO : EPOCH 4 - PROGRESS: at 97.89% examples, 773263 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:01:17,756 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-03-14 14:01:17,769 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-03-14 14:01:17,784 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-03-14 14:01:17,789 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-03-14 14:01:17,805 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-03-14 14:01:17,824 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-03-14 14:01:17,827 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-03-14 14:01:17,8

2021-03-14 14:02:08,436 : INFO : EPOCH 1 - PROGRESS: at 24.45% examples, 758774 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:02:09,437 : INFO : EPOCH 1 - PROGRESS: at 27.29% examples, 760468 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:02:10,440 : INFO : EPOCH 1 - PROGRESS: at 29.99% examples, 759973 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:02:11,458 : INFO : EPOCH 1 - PROGRESS: at 33.01% examples, 763425 words/s, in_qsize 20, out_qsize 1
2021-03-14 14:02:12,465 : INFO : EPOCH 1 - PROGRESS: at 35.74% examples, 766144 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:02:13,472 : INFO : EPOCH 1 - PROGRESS: at 38.57% examples, 767792 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:02:14,478 : INFO : EPOCH 1 - PROGRESS: at 41.65% examples, 771555 words/s, in_qsize 18, out_qsize 1
2021-03-14 14:02:15,483 : INFO : EPOCH 1 - PROGRESS: at 44.65% examples, 774488 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:02:16,487 : INFO : EPOCH 1 - PROGRESS: at 47.36% examples, 775215 words/s,

2021-03-14 14:03:12,016 : INFO : EPOCH 2 - PROGRESS: at 91.82% examples, 770359 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:03:13,020 : INFO : EPOCH 2 - PROGRESS: at 94.41% examples, 769925 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:03:14,027 : INFO : EPOCH 2 - PROGRESS: at 97.13% examples, 770400 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:03:14,971 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-03-14 14:03:14,982 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-03-14 14:03:14,989 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-03-14 14:03:15,007 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-03-14 14:03:15,009 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-03-14 14:03:15,017 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-03-14 14:03:15,023 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-03-14 14:03:15,0

2021-03-14 14:04:07,375 : INFO : EPOCH 4 - PROGRESS: at 28.78% examples, 793760 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:04:08,410 : INFO : EPOCH 4 - PROGRESS: at 31.94% examples, 796509 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:04:09,426 : INFO : EPOCH 4 - PROGRESS: at 34.71% examples, 797997 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:04:10,448 : INFO : EPOCH 4 - PROGRESS: at 37.50% examples, 796987 words/s, in_qsize 18, out_qsize 1
2021-03-14 14:04:11,454 : INFO : EPOCH 4 - PROGRESS: at 40.52% examples, 799934 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:04:12,459 : INFO : EPOCH 4 - PROGRESS: at 43.45% examples, 799686 words/s, in_qsize 17, out_qsize 2
2021-03-14 14:04:13,471 : INFO : EPOCH 4 - PROGRESS: at 46.46% examples, 800815 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:04:14,483 : INFO : EPOCH 4 - PROGRESS: at 49.36% examples, 801728 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:04:15,496 : INFO : EPOCH 4 - PROGRESS: at 52.17% examples, 802797 words/s,

2021-03-14 14:05:11,205 : INFO : EPOCH 5 - PROGRESS: at 92.99% examples, 756055 words/s, in_qsize 18, out_qsize 1
2021-03-14 14:05:12,217 : INFO : EPOCH 5 - PROGRESS: at 95.91% examples, 758321 words/s, in_qsize 17, out_qsize 2
2021-03-14 14:05:13,246 : INFO : EPOCH 5 - PROGRESS: at 98.81% examples, 759687 words/s, in_qsize 17, out_qsize 2
2021-03-14 14:05:13,574 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-03-14 14:05:13,592 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-03-14 14:05:13,615 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-03-14 14:05:13,620 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-03-14 14:05:13,621 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-03-14 14:05:13,627 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-03-14 14:05:13,632 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-03-14 14:05:13,6

2021-03-14 14:06:06,557 : INFO : EPOCH 7 - PROGRESS: at 33.66% examples, 772540 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:06:07,590 : INFO : EPOCH 7 - PROGRESS: at 36.51% examples, 773736 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:06:08,596 : INFO : EPOCH 7 - PROGRESS: at 39.40% examples, 775930 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:06:09,602 : INFO : EPOCH 7 - PROGRESS: at 42.24% examples, 775751 words/s, in_qsize 20, out_qsize 7
2021-03-14 14:06:10,606 : INFO : EPOCH 7 - PROGRESS: at 45.27% examples, 778928 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:06:11,623 : INFO : EPOCH 7 - PROGRESS: at 47.99% examples, 779279 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:06:12,633 : INFO : EPOCH 7 - PROGRESS: at 50.90% examples, 781009 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:06:13,639 : INFO : EPOCH 7 - PROGRESS: at 53.53% examples, 782561 words/s, in_qsize 18, out_qsize 1
2021-03-14 14:06:14,641 : INFO : EPOCH 7 - PROGRESS: at 56.38% examples, 783687 words/s,

2021-03-14 14:07:10,578 : INFO : EPOCH 8 - PROGRESS: at 96.27% examples, 738287 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:07:11,587 : INFO : EPOCH 8 - PROGRESS: at 99.08% examples, 739790 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:07:11,844 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-03-14 14:07:11,873 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-03-14 14:07:11,879 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-03-14 14:07:11,889 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-03-14 14:07:11,892 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-03-14 14:07:11,896 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-03-14 14:07:11,905 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-03-14 14:07:11,909 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-03-14 14:07:11,912 : INFO : worker thre

2021-03-14 14:08:06,672 : INFO : EPOCH 10 - PROGRESS: at 25.98% examples, 729227 words/s, in_qsize 16, out_qsize 3
2021-03-14 14:08:07,673 : INFO : EPOCH 10 - PROGRESS: at 28.67% examples, 729258 words/s, in_qsize 18, out_qsize 1
2021-03-14 14:08:08,673 : INFO : EPOCH 10 - PROGRESS: at 31.68% examples, 735345 words/s, in_qsize 20, out_qsize 0
2021-03-14 14:08:09,681 : INFO : EPOCH 10 - PROGRESS: at 34.05% examples, 733122 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:08:10,692 : INFO : EPOCH 10 - PROGRESS: at 36.91% examples, 737696 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:08:11,699 : INFO : EPOCH 10 - PROGRESS: at 39.82% examples, 742690 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:08:12,707 : INFO : EPOCH 10 - PROGRESS: at 42.72% examples, 745088 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:08:13,718 : INFO : EPOCH 10 - PROGRESS: at 45.64% examples, 746839 words/s, in_qsize 19, out_qsize 0
2021-03-14 14:08:14,733 : INFO : EPOCH 10 - PROGRESS: at 48.41% examples, 749651

(303493583, 415193550)

## skip-gram and CBOW models

* The word2vec algorithms implemented in `gensim` include **skip-gram** and **CBOW** models

``class gensim.models.word2vec.Word2Vec(... sg=0 ...)``
* `sg` ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW  (default).

## How to obtain word vectors?

* We need to access its `model.wv` property, which holds the standalone keyed vectors.

In [24]:
word_vectors = model.wv

* Persist the word vectors to disk with

In [25]:
from gensim.models import KeyedVectors

In [26]:
word_vectors.save('data/vectors.kv')

2021-03-14 22:36:03,265 : INFO : saving Word2VecKeyedVectors object under data/vectors.kv, separately None
2021-03-14 22:36:03,325 : INFO : storing np array 'vectors' to data/vectors.kv.vectors.npy
2021-03-14 22:36:03,556 : INFO : not storing attribute vectors_norm
2021-03-14 22:36:08,240 : INFO : saved data/vectors.kv


In [14]:
reloaded_word_vectors = KeyedVectors.load('data/vectors.kv')

2021-03-14 14:12:16,282 : INFO : loading Word2VecKeyedVectors object from data/vectors.kv
2021-03-14 14:12:16,433 : INFO : loading vectors from data/vectors.kv.vectors.npy with mmap=None
2021-03-14 14:12:16,517 : INFO : setting ignored attribute vectors_norm to None
2021-03-14 14:12:16,518 : INFO : loaded data/vectors.kv


## Get the key’s vector, as a 1D numpy array

*  `get_vector(key, norm=False)`  Returns **Vector** for the specified key.

In [20]:
vector = word_vectors['computer']  # numpy vector of a word
print(vector.shape)
print(vector)

(150,)
[ 2.5194955   3.98519     0.13134773  3.4767156   2.2527318   2.0601292
 -0.210349    3.2553098   1.1068059  -2.2040725  -2.3085032  -0.9839303
  0.29481873 -0.8248244  -2.3532808   1.5908425  -0.13116792  0.09025211
 -4.3067923   1.8525307  -0.73047614  2.0734007  -1.7344201   2.9230053
 -1.024135    2.9512484  -2.1600788  -3.7378323   1.5650132  -3.299354
 -0.17427887  1.0837468  -2.8641825   0.02351804  1.1271863   0.90298986
  0.43268943  2.5099602   0.9015156  -1.3839414  -0.7269674   0.58083653
  1.4954923  -1.045981   -3.6593657  -0.09399015 -6.0203776  -1.752114
  2.6161103   0.54835    -0.62417734  3.080784    1.0163897   0.09957929
  0.1295551  -1.0796468   1.1874051  -0.8069812  -0.5943146   1.0086601
 -1.3204163   0.21888547  2.7572293  -2.295222    0.5467379  -3.0543237
  3.6519177  -0.27496272 -5.3813467   2.1258466   1.5814056   5.18143
 -2.9958913  -0.69019824  2.2060552   3.7228682   0.880785   -1.5059057
  4.9941382  -0.57638025 -0.5742147   0.762392   -0.68570

## Now, let's look at some output 
* a simple case of looking up **words similar to** the word `dirty`. 
* All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. 
* This returns the top 10 similar words. 

In [27]:

w1 = "italy"
model.wv.most_similar (positive=w1)


2021-03-14 22:55:06,457 : INFO : precomputing L2-norms of word weight vectors


[('venice', 0.6811449527740479),
 ('chinatown', 0.5627354383468628),
 ('korea', 0.5476475954055786),
 ('village', 0.538396418094635),
 ('nolita', 0.5254664421081543),
 ('brazil', 0.51933354139328),
 ('greenwich', 0.471917062997818),
 ('mott', 0.46979326009750366),
 ('soho', 0.46181896328926086),
 ('jewelbox', 0.4516487419605255)]

**Another example**

Let's look at similarity for `polite`, `france` and `shocked`. 

In [50]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)


[('courteous', 0.9174547791481018),
 ('friendly', 0.8309274911880493),
 ('cordial', 0.7990915179252625),
 ('professional', 0.7945970892906189),
 ('attentive', 0.7732747197151184),
 ('gracious', 0.7469891309738159)]

In [53]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)


[('canada', 0.6603403091430664),
 ('germany', 0.6510637998580933),
 ('spain', 0.6431018114089966),
 ('barcelona', 0.61174076795578),
 ('mexico', 0.6070996522903442),
 ('rome', 0.6065913438796997)]

In [54]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)


[('horrified', 0.80775386095047),
 ('amazed', 0.7797470092773438),
 ('astonished', 0.7748459577560425),
 ('dismayed', 0.7680633068084717),
 ('stunned', 0.7603034973144531),
 ('appalled', 0.7466776371002197)]

You can even 
* specify **several positive examples** to get things that are related in the provided context 
* provide **negative examples** to say what should not be considered as related. 

In the example below we are asking for all items that **relate to bed** only:

In [55]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


[('duvet', 0.7086508274078369),
 ('blanket', 0.7016597390174866),
 ('mattress', 0.7002605199813843),
 ('quilt', 0.6868821978569031),
 ('matress', 0.6777950525283813),
 ('pillowcase', 0.6413239240646362),
 ('sheets', 0.6382123827934265),
 ('foam', 0.6322235465049744),
 ('pillows', 0.6320573687553406),
 ('comforter', 0.5972476601600647)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the **similarity between two words** that are present in the vocabulary. 

In [57]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

0.76181122646029453

In [58]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0000000000000002

In [59]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.25355593501920781

**Under the hood**

the above three snippets computes the **cosine similarity** between the two specified words using word vectors of each. 
* From the scores, it makes sense that `dirty` is highly similar to `smelly` but 
* `dirty` is dissimilar to `clean`. 
* If you do a **similarity between two identical words**, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find odd items
You can even use Word2Vec to find odd items *given a list of items*.

In [63]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

'france'

In [77]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])


'shower'

## Understanding some of the parameters

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```


### `size`
The size of the **dense vector** to `represent each token` or word. 
* If you have **very limited data**, then size should be a much **smaller** value. 
* If you have **lots of data**, its good to experiment with **various sizes**. 

### `window`
The maximum distance between the target word and its neighboring word. 
* If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. 
* In theory, a smaller window should give you terms that are more related. 
* If you have **lots of data**, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. 
* The model would **ignore words that do not statisfy the** `min_count`. 
* Extremely **infrequent words are** usually **unimportant**, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?


## When should you use Word2Vec?

There are many application scenarios for Word2Vec. 
* Imagine if you need to **build a sentiment lexicon**. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary. 
* You could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could **find tags that are related to a given tag** and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work. 


## Context Problem

* Word2Vec models generate embeddings that are **context-independent**: ie - there is just `one vector` (numeric) representation for each word.
* **Different senses** of the word (if any) are combined into **one** single vector.

* **Example**: Word2Vec embedding for the word "`bank`" will be a confused representation as it has collapsed different contexts into a single vector.
<img src="figures/different-contexts.png" width="60%">

* Word2Vec collaps different contexts into a single vector.
  * > We went to the river `bank`
  * > I need to go to `bank` to make a deposit


In [21]:
vector = word_vectors['bank']  # numpy vector of a word
print(vector.shape)
print(vector)

(150,)
[ 2.5194955   3.98519     0.13134773  3.4767156   2.2527318   2.0601292
 -0.210349    3.2553098   1.1068059  -2.2040725  -2.3085032  -0.9839303
  0.29481873 -0.8248244  -2.3532808   1.5908425  -0.13116792  0.09025211
 -4.3067923   1.8525307  -0.73047614  2.0734007  -1.7344201   2.9230053
 -1.024135    2.9512484  -2.1600788  -3.7378323   1.5650132  -3.299354
 -0.17427887  1.0837468  -2.8641825   0.02351804  1.1271863   0.90298986
  0.43268943  2.5099602   0.9015156  -1.3839414  -0.7269674   0.58083653
  1.4954923  -1.045981   -3.6593657  -0.09399015 -6.0203776  -1.752114
  2.6161103   0.54835    -0.62417734  3.080784    1.0163897   0.09957929
  0.1295551  -1.0796468   1.1874051  -0.8069812  -0.5943146   1.0086601
 -1.3204163   0.21888547  2.7572293  -2.295222    0.5467379  -3.0543237
  3.6519177  -0.27496272 -5.3813467   2.1258466   1.5814056   5.18143
 -2.9958913  -0.69019824  2.2060552   3.7228682   0.880785   -1.5059057
  4.9941382  -0.57638025 -0.5742147   0.762392   -0.68570

## Contextual embeddings using BERT

* BERT = Bidirectional Encoder Representations from Transformers is a Transformer-based machine learning technique for natural language **processing** pre-training developed by Google.
* BERT model explicitly takes as **input the position** (index) of each word in the sentence before calculating its embedding.

* **Example**: The BERT embedding will be able to distinguish and capture the two different semantic meanings by producing two different vectors for the same word "`bank`".
<img src="figures/different-contexts.png" width="60%">