## Pre trained embeddings

Word embeddings are generally computed using word-occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of  techniques, some involving neural networks, others not. The idea of a dense, lowdimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s,1 but it only started to take off in research and industry applications after the release of one of the most famous and successful word-embedding schemes: the Word2vec algorithm (https://code.google.com/ archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2vec dimensions capture specific semantic properties, such as gender.

There are various precomputed databases of word embeddings that you can download and use in a Keras Embedding layer. Word2vec is one of them. Another popular one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics. Its developers have made available precomputed embeddings for millions of English tokens, obtained from Wikipedia data and Common Crawl data.

One of the most widely used pretrained word embeddings is Glove and can be downloaded from https://nlp.stanford.edu/projects/glove/

GloVe is pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens).

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!mkdir glove
!unzip glove.6B.zip -d glove/

In [None]:
!head -20 /content/glove/glove.6B.50d.txt

### Exploring the embeddings

In [None]:
!pip install -U pandas

In [None]:
!pip install numpy==1.26.4

In [None]:
!pip install gensim

## Notes:

Restart the session and start executing only the step below. No need to execute the steps above.

In [None]:
import pandas as pd
import numpy as np
import gensim

In [None]:
pd.__version__

In [None]:
np.__version__

In [None]:
gensim.__version__

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

word2vec_output_file = "/content/glove/glove.6B.50d.txt"

In [None]:
pretrained_w2v_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False, no_header=True)

In [None]:
len(pretrained_w2v_model)

In [None]:
pretrained_w2v_model.most_similar('bangalore')

In [None]:
pretrained_w2v_model.most_similar('dhoni')

In [None]:
pretrained_w2v_model.most_similar('google')

In [None]:
pretrained_w2v_model.most_similar('hp')

In [None]:
pretrained_w2v_model.most_similar('wikipedia')

In [None]:
def analogy(a, b, c):
    result = pretrained_w2v_model.most_similar([c, b], [a])
    return result[0][0]

In [None]:
analogy('india', 'indian', 'japan')

In [None]:
analogy('india', 'delhi', 'france')

In [None]:
analogy('india', 'dhoni', 'england')

## Excellent References

For further exploration and better understanding, you can use the following references.

- Glossary of Deep Learning: Word Embedding

    https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca


- wevi: word embedding visual inspector

    https://ronxin.github.io/wevi/  
    
    
- Learning Word Embedding    

    https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html


- On the contribution of neural networks and word embeddings in Natural Language Processing

    https://medium.com/@josecamachocollados/on-the-contribution-of-neural-networks-and-word-embeddings-in-natural-language-processing-c8bb1b85c61c