In [1]:
from fastbook import *

# Word Embeddings in NLP

## Resources

**From Ruder**

 - [Ruder - On Word Embeddings - Part 1](https://www.ruder.io/word-embeddings-1/),  [📃PDF](./pdfs/On%20word%20embeddings%20-%20Part%201.pdf)
 - [Ruder - On Word Embeddings - Part 2]( ), [📃PDF](./pdfs/On%20word%20embeddings%20-%20Part%202%20-%20Approximating%20the%20Softmax%20for%20Learning%20WordEmbeddings.pdf)
 - [Ruder - On Word embeddings - Part 3 - The secret ingredients of word2vec](https://www.ruder.io/secret-word2vec/). [📃PDF](./pdfs/On%20word%20embeddings%20-%20Part%203%20-%20The%20secret%20ingredients%20of%20word2vec.pdf) 

**Other**
 - [📽️ Embeddings for everything](https://www.youtube.com/watch?v=JGHVJXP9NHw)
 - [📽️ A complete overview of word embeddings](https://www.youtube.com/watch?v=5MaWmXwxFNQ)

**Even older history**
  - [Brown clustering](https://en.wikipedia.org/wiki/Brown_clustering)
  - [Latent semantic analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis)

# Motivations

Why do we even need to do this conversion to numbers and deal with embedding.

## Text to numbers

Machine learning models all use numberical representations. The underlying `regression` and `loss` function logic all operate in terms of numbers. To put such ML models to use for NLP, the text has to converted to a numerical representation. This representational need is what eventually motivated embeddings.

**Per Ruder**: _Word embeddings refer to dense representation of words in a low-dimensional vector space_. He particularly focuses on _neural word embeddings_, i.e., word embedings learned by a neural network.

## Search space

This come later on the scene but is a crucial motivator of advanced techniques.
 - In the target embedding space, we can target semantic similarity. i.e., vectors close to each other are semantically similar
 - Current techniques allows embedding words, sentences or entire documents into a semantically meaningful embedding space.
 - Research techniques also allow embedding of mixed media: text, images, videos etc into the same target space so you can locate text and the images/videos are that are semantically similar.

This video by a google researcher at berkely, [Embeddings for everything: Search in the neural network era](https://www.youtube.com/watch?v=JGHVJXP9NHw) shows how they approach training embeddings for such a search space. There is no magic here, simply the use of training combined multiple encoding schemes to achieve such a mixed media search. The generalization of this to a search problem is pretty powerful.
 - Dual and Multi encoder models
 - [Recipe2vec - How word2vec helps us discover related tasty recipes](https://www.youtube.com/watch?v=RTyHP_PiX9M)
   - uses word2vec with skip-gram techniques. Played around and ended up with 70 dimension starting space and ended with 2 dimensions ?  
   - t-sne for dimensionality reduction (PCA is another option). Hyperparameters matter a lot
   - building recipe-vec from word-vecs of ingredients. _Lots of scope here for stages, cooking temps, styles etc as well_.
   - testing prototyping steps
   - productioning things
   - using cosine similarity to get similar recipes from the database
   - Retrain their model every 12 hours.
   - Lots of good questions about the actual modeling. _This always needs careful thought_ and how they measured success rates in the product.

## Neural embeddings

 If you look back to encoding, even huffman encoding, jpeg blocks and maybe encryption schemes are all ways of using optimized numerical representations. What `neural embedding` or more appropriately, `neural embedding search` aims to do is searching for a good embedding. This is done via standard back-propogation apprach: an optimization problem. The design of the cost function etc uses.

 - Cross entropy
 - Cosine similarity

## What high dimensional vector space ?

Quickly found a resource to explain what this _high dimensional_ space is in the first place
 
 - [Youtube - A brief history of word embeddngs](https://www.youtube.com/watch?v=5MaWmXwxFNQ)
 
## History - one hot encoding

> Each word has a representation

This is where it all starts: a simple, naive mapping from text to numerical representation. The idea has is to establish a vocabulary of the text we are training on. i.e, a set of the words used in the text. Given that vocabulary, each word is represented as a one-hot vector (_i.e, just one bit is hot: 1 and the rest are 0_).

$
\begin{align}
Vocab &= \begin{vmatrix} I & You & Bag & Apple & Cat & Dog \end{vmatrix} \\
Rep_{bag} &= \begin{vmatrix}0 & 0 & 1 & 0 & 0 & 0 \end{vmatrix}
\end{align}
$  

 - Long vector as long as the size of the vocabulary. $\begin{vmatrix}V\end{vmatrix} = n$ say.
 - Each dimension of the vector is one unique word in the vocabulary. _You can see this creating a large vocabulary, thus creating the `high dimensional` space. A problem which we will later have to solve_
 - One word is encoded per vector: each word representation is a vector `n` long with a `1` for it's word axis and `0` everywhere else

The location of the hot bit is an index into the vocabulary. _Not sure why a approach using a simple numerical index into the vocbulary was discarded. Some constraint related to all inputs having to be the same sized tensor ?_.

If the size of this vocabulary is `n`, then we can think of each one-hot representation as belonging to a `n dimensional` space with each word defining an axis. Hence the _high dimensional vector space_ terminology

This introduces some complications because of the high dmensionality, size of the vocabulary.
   - Large memory use for one.
   - Inefficient computation as you are mostly multiplying zeros.
   - Sparse vectors (_just the one bit in a sea of zeros, Inefficient use of memory space_)
 
## History - Word counting approaches 

> Try to pack entire sentence into one representation

### Bag of words

 - Each word in vocabulary is assigned a dimension in the vector space of size `n`
 - Ignores order in which words occur, just counts the number of times each word occurs in a sentence
 - A sentence representation is a vector of size `n` with a hot bit for each word that is in the sentence.
 
For instance, the represetnation of _He arrives precisely when he mans to_ in a contrived vocabulary is as follows. This shows multi-hot where the actual count of each word is listed in the vector representation.
 
 $
 \begin{align}
 Vocab &= \begin{vmatrix} A & He & Is & When & Never & to & Means ...\end{vmatrix} \\
 rep &= \begin{vmatrix} 0 & 2 & 0 & 1 & 0 & 1 & 1 & ... \end{vmatrix}
 \end{align}
 $
 
 ### n-gram 
 
 > 2-gram in this case
 > Similar to bag-of-words approach but take two words at a time with one word overlapping.

For instance, the represetnation of _He arrives precisely when he mans to_ in a contrived vocabulary is as follows. 
 - _Note that never late ends the previous sentence hence no overlap into the next 2-gram_. 
 - only _he arrives_ has an encoding in the 2-gram vocab.
 
 $
 \begin{align}
 Vocab &= \begin{vmatrix} A \ wizard & wizard\ is & is\ never& never\ late &  he\ arrives & ...\end{vmatrix} \\
 rep &= \begin{vmatrix} 0 & 0 & 0 & 0 & 1 & ... \end{vmatrix}
 \end{align}
 $

# What are we embedding into ?

Given the above approachs ranging from _one hot_ to _n-grams_, we are starting out with a large vocabulary and hence a large vector space for the representation.

We want to **embed** these vectors from such a large space into a *lower dimensional vector space*. This embedding has multiple goals
 - We achieve a dense represenation which is space/memory efficient. i.e., Vector components are not mostly zero
 - We find a lower-dimension space to embed in. i.e., the dimensionality of the embedding space is less than the size of the vocabulary.
 - vector distance in the embedded space reflect _similarity_ of words.
 

## What does similarity mean ?

 Similarity in the NLP context means
  - `Contextual similarity` - $w_1$ is closer to $w_2$ if they occur close together in multiple texts.
  - `Semantically similar` - $w_1$ is closer to $w_2$ if it means similar things. Eg. `king` and `queen` are as close as `man` and `woman`.
  
Similarity in a vector space is reflected in vector distances. Closer means more similar and so on.  
  
Looks like this is a design feature as well. Certain word embeddings can be designed for contextual similarity of a certain kind. For instance. Given word embeddings of $V_{queen}$, $V_{king}$, $V_{man}$ & $V_{woman}$, we could expect that

$ V_{Queen} = V_{king} + (V_{man} - V_{woman})$

# How are the embeddings created

A high level description that doesn't make much sense to me right now is that this is done via a neural training
 - Words from vocabulary is input
 - frst layer is the embedding layer which embeds these into a lower vector space
   - $E_{}
 - remaining layers are the model itself
 - trough many cycles of backprop of the model itsef, the embedding layer is also fine tuned and finally yields the embedding as the weights/parameters of the embedding layer.

There are some details related to how things are trained
 - loss functions
 - words as tokens but then trouble dealing with new unseen words
 - word fragments as tokens (Fasttext), better at dealing with new unseen words
 - global contexts (Glove)

there are some videos that talk about how to actually code the embedding layer in pytorch, update the embedding in the backprop step and so on. Right now, not sure that adds any value to my hunt for solutioning my problem
 - TODO [How to train word embeddings using the WikiText2 dataset in pytorch](https://www.youtube.com/watch?v=N-WfUrdgdFw)


# Using word embeddings in 2023

This video, [A complete overview of word embeddings](https://www.youtube.com/watch?v=5MaWmXwxFNQ) has a very valuable section which shows how some pre-trained word embeddings can be used.
 - You can train your own.
   - Needs a lot of data
   - Needs a lot of time and compute resources
 - Use released embeddings as-is
 - Use released embeddings and fine-tune on your own data

> Needs `conda install gensim`. I don't feel like `!pip install --upgrade gensim` as using `pip` blindly has broken jupyter for me in the past.

In [1]:
import gensim.downloader as api
info=api.info()
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + "...",
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

## Gensim embedding models

The block above shows me a bunch of embedding models

 - `conceptnet-numberbatch-17-06-300` (1917247 records): ConceptNet Numberbatch consists of state...
 - `fasttext-wiki-news-subwords-300` (999999 records): 1 million word vectors trained on Wikipe... - 
 - `glove-twitter-100` (1193514 records): Pre-trained vectors based on  2B tweets,.. - .
 - `glove-twitter-200` (1193514 records): Pre-trained vectors based on 2B tweets, . - ..
 - `glove-twitter-25` (1193514 records): Pre-trained vectors based on 2B tweets,  - ...
 - `glove-twitter-50` (1193514 records): Pre-trained vectors based on 2B tweets, -  ...
 - `glove-wiki-gigaword-100` (400000 records): Pre-trained vectors based on Wikipedia -  2...
 - `glove-wiki-gigaword-200` (400000 records): Pre-trained vectors based on Wikipedi - a 2...
 - `glove-wiki-gigaword-300` (400000 records): Pre-trained vectors based on Wikiped - ia 2...
 - `glove-wiki-gigaword-50` (400000 records): Pre-trained vectors based on Wikipe - dia 2...
 - `word2vec-google-news-300` (3000000 records): Pre-trained vectors trained on a p - art of...
 - `word2vec-ruscorpora-300` (184973 records): Word2vec Continuous Skipgram vec

Three groups
 - glove
 - fasttext
 - word2vec

## Using word2vec, glove and fasttext

Using embeddings in general. If you look at TODO, naive embeddings are simply a large table which maps each word into an embedding vector of some fixed length. So I'd think the API is a simple layer on top of a lookup table. The table is the trained magic here.

The three models are downloaded in the cell below. Note that they total 2.5GB so only download em if you want to pay with em. Maybe download the smaller one for testing out the API.


In [9]:
# Very beefy downloads. 
# Download size = 1.6GB
# Load time = 
wv = api.load('word2vec-google-news-300')

#glove-twitter-50 is 200MB
# Download Size = 200MB
# Load time     = 25s
glove = api.load('glove-twitter-50')

# fasttext-wiki-news-subwords-300 is 1GB
# Download Size = 1GB
# Loadtime = 125 s
fasttext = api.load('fasttext-wiki-news-subwords-300')

Functions for *glove*. The others are similar. There are a lot of these!

 - `add_lifecycle_event`
 - `add_vector`
 - `add_vectors`
 - `allocate_vecattrs`
 - `closer_than`
 - `cosine_similarities`
 - `distance`
 - `distances`
 - `doesnt_match`
 - `evaluate_word_analogies`
 - `evaluate_word_pairs`
 - `expandos`
 - `fill_norms`
 - `get_index`
 - `get_mean_vector`
 - `get_normed_vectors`
 - `get_vecattr`
 - `get_vector`
 - `has_index_for`
 - `index2entity`
 - `index2word`
 - `index_to_key`
 - `init_sims`
 - `intersect_word2vec_format`
 - `key_to_index`
 - `lifecycle_events`
 - `load`
 - `load_word2vec_format`
 - `log_accuracy`
 - `log_evaluate_word_pairs`
 - `mapfile_path`
 - `most_similar`
 - `most_similar_cosmul`
 - `most_similar_to_given`
 - `n_similarity`
 - `next_index`
 - `norms`
 - `rank`
 - `rank_by_centrality`
 - `relative_cosine_similarity`
 - `resize_vectors`
 - `save`
 - `save_word2vec_format`
 - `set_vecattr`
 - `similar_by_key`
 - `similar_by_vector`
 - `similar_by_word`
 - `similarity`
 - `similarity_unseen_docs`
 - `sort_by_descending_frequency`
 - `unit_normalize_all`
 - `vector_size`
 - `vectors`
 - `vectors_for_all`
 - `vectors_norm`
 - `vocab`
 - `wmdistance`
 - `word_vec`
 - `words_closer_than`

### Most similar to a given word

Returns semantically similar (distance in the embedding space) to the given word and sorts by similarity. Fasttext seems to be the best followeb by glove with word2vec looking the least precise.

In [58]:
from IPython.display import Markdown

# Print results out as markdown.
from string import Template
mdT = Template("""
 Similar Word | Similarity Metric
  --          | --
  $lines
""")

def printSimilarityTable(ret):    
    display(Markdown(
        mdT.substitute(
            lines = "\n".join(map( lambda x: f"{x[0]} | {x[1]}", ret))
      )
    ))        

display(Markdown("---"))
display(Markdown("#### Similarity from word2vec"))
printSimilarityTable(wv.most_similar("tea"))

display(Markdown("---"))
display(Markdown("#### Similarity from fasttext"))
printSimilarityTable(fasttext.most_similar("tea"))

display(Markdown("---"))
display(Markdown("#### Similarity from glove"))
printSimilarityTable(glove.most_similar("tea"))

---

#### Similarity from word2vec


 Similar Word | Similarity Metric
  --          | --
  Tea | 0.7009037137031555
teas | 0.6727380156517029
shape_Angius | 0.6323482394218445
activist_Jamie_Radtke | 0.5863860845565796
decaffeinated_brew | 0.5839535593986511
planter_bungalow | 0.575829029083252
herbal_tea | 0.5731174349784851
coffee | 0.5635291337966919
jasmine_tea | 0.548339307308197
Tea_NASDAQ_PEET | 0.5402543544769287


---

#### Similarity from fasttext


 Similar Word | Similarity Metric
  --          | --
  tea- | 0.7728265523910522
coffee | 0.7583760619163513
teas | 0.731768786907196
cuppa | 0.7301388382911682
teabags | 0.6973742246627808
Tea | 0.6826096773147583
tea-drinking | 0.6748528480529785
teabag | 0.6707128882408142
tea-making | 0.6683591604232788
tea-bags | 0.6638833284378052


---

#### Similarity from glove


 Similar Word | Similarity Metric
  --          | --
  coffee | 0.8929038643836975
milk | 0.8667818903923035
wine | 0.8507667183876038
cream | 0.8502466678619385
ice | 0.8362609148025513
juice | 0.8177550435066223
beer | 0.8157102465629578
sugar | 0.8099128007888794
cake | 0.8080540895462036
drink | 0.8000376224517822


In [68]:
display(Markdown("Distance between `tea` and `coffee`"))
display(Markdown(f' - Word2vec : {wv.distance("tea", "coffee")} '))
display(Markdown(f' - Glove : {glove.distance("tea", "coffee")} '))
display(Markdown(f' - Fasttext : {fasttext.distance("tea", "coffee")} '))

Distance between `tea` and `coffee`

 - Word2vec : 0.43647074699401855 

 - Glove : 0.10709607601165771 

 - Fasttext : 0.2416238784790039 

### Vector math in embedded space (analogy making)

This is essentially doing $King - Man + Woman$

In [70]:
wv.most_similar_cosmul(positive=['king', 'woman'], negative=['man'])

[('queen', 0.9314123392105103),
 ('monarch', 0.858533501625061),
 ('princess', 0.8476566672325134),
 ('Queen_Consort', 0.8150269389152527),
 ('queens', 0.8099815249443054),
 ('crown_prince', 0.8089976906776428),
 ('royal_palace', 0.8027306795120239),
 ('monarchy', 0.8019613027572632),
 ('prince', 0.800979733467102),
 ('empress', 0.7958388328552246)]