<a href="https://colab.research.google.com/github/larpig/nlp/blob/main/static_word_embedding_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

**Word Embedding History**

Word embeddings are a type of representation for natural language processing tasks in which words are represented by numeric vectors. These vectors capture the meaning of the words and the relationships between them.

The idea of representing words as numeric vectors dates back to at least the 1960s, with the development of word2vec in the early 2010s being a significant milestone in the history of word embeddings.

Word2vec is a machine learning model that is trained to predict a target word given a context word, or vice versa. During training, the model learns the relationships between words and encodes them as numeric vectors, or "word embeddings". These embeddings can then be used as input to other natural language processing tasks, such as text classification or machine translation.

Since the development of word2vec, there have been many other approaches to generating word embeddings, including fastText, GloVe, and BERT. These approaches have improved upon the original word2vec model and have allowed for the creation of large, high-quality word embedding models that are widely used in natural language processing tasks.


**Table of Contents**

* INTRODUCTION
    * Word Embedding History
    * Table of Contents

1. Word2Vec
    1. Training Your Own word2vec Model
    2. Using a Pretrained "word2vec" Model
    3. Word2vec for Recommendation

2. Other Approaches
    1. GloVe
    2. fastText
        1.  Subword Tokenization

* References

* Acknowledgment


# 1. Word2Vec

**Original Paper**: Mikolov at al. (2013)

*Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that*
$$\overrightarrow{\text{king}} \ - \ \overrightarrow{\text{man}} \ + \ \overrightarrow{\text{woman}} \ \approx \ \overrightarrow{\text{queen}}$$


**Model Architectures**

1. Continuous Bag-of-Words Model (CBOW)
2. Continuous Skip-gram Model

<p align="center">
  <img src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2018/01/18/sagemaker-word2vec-1.gif" alt="" width="500">
</p>

<center>Image source: Mikolov at al. (2013) </center>

The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word.


**Self-supervised: The Fake Task**

<p align="center">
  <img src="http://mccormickml.com/assets/word2vec/training_data.png" alt="" width="600">
</p>

<center>Image source: McCormick, Chris (2016) </center>

**Under the Hood: Continuous Skip-gram Model**

<p align="center">
  <img src="http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png" alt="" width="600">
</p>

<center>Image source: McCormick, Chris (2016) </center>

**From the neural network to the word embeddings**

<p align="center">
  <img src="https://lilianweng.github.io/posts/2017-10-15-word-embedding/word2vec-skip-gram.png" alt="" width="600">
</p>

<center>Image source: Weng, Lilian (2017) </center>

## 1.1. Training Your Own word2vec Model

Code source: Patel, Dhaval [codebasics] (2021a)

In [None]:
!pip install gensim==3.6.0 --quiet
!pip install nltk==3.7 --quiet

In [None]:
from gensim.models import Word2Vec
import gensim.downloader
import nltk
from nltk.corpus import brown
nltk.download("brown")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
# Define a corpus to train as Brown Corpus
# https://en.wikipedia.org/wiki/Brown_Corpus
corpus = brown.sents()
corpus

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [None]:
# Initialize the model
model = Word2Vec(size=100, window=5, min_count=1, workers=4)

# Build Vocabulary
model.build_vocab(corpus, progress_per=1000)

# Train the Word2Vec Model
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

(4270788, 5805960)

In [None]:
# Get the embedding for a word
vector = model.wv['Government']
vector

array([-1.3682545 ,  0.8700504 ,  0.84637284, -0.28152472, -0.36479753,
        0.62864673, -0.39947757, -0.19395827, -0.04070418, -0.12854913,
       -0.08353873,  0.37000635,  0.3129534 , -0.4955536 ,  0.16299991,
        0.20593488,  0.15088503,  0.51814723,  0.27549544, -0.02539549,
        0.3392709 ,  0.0812735 , -0.8843315 ,  0.8175514 , -0.36355928,
        0.47324288, -0.3648843 ,  0.04472586, -0.5131664 , -0.04934174,
        0.4114774 ,  0.8081736 , -0.4333348 ,  0.33533475,  0.12678465,
        0.26695752,  0.42117417,  0.37199467,  0.44636178, -0.4233144 ,
       -0.2165184 ,  0.73775876, -0.7551475 , -0.82751644, -0.06093928,
        0.12336696, -0.30301565, -0.2139607 , -0.5781787 , -0.4914802 ,
        0.75109476,  0.36604956,  0.35860997, -0.595864  , -0.15172918,
        0.12596309,  0.57722074, -0.29941458,  0.02710133, -0.6827911 ,
        0.28633878,  0.5538602 , -0.34357   , -0.80099106, -0.5399351 ,
        0.43680775, -0.6101016 , -1.0071228 , -0.38077193,  0.38

In [None]:
# Compute the similarity between two words
pairs = [
    ('Government', 'rich'),
    ('Government', 'communism'),
    ('Government', 'math'),
    ('Government', 'people'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, model.wv.similarity(w1, w2)))

'Government'	'rich'	0.89
'Government'	'communism'	0.82
'Government'	'math'	0.56
'Government'	'people'	0.49


In [None]:
# Get the most similar words to a given word
model.wv.most_similar("Government")

[('Federal', 0.9730518460273743),
 ('board', 0.9717510342597961),
 ('press', 0.9701008796691895),
 ('strengthening', 0.9700838327407837),
 ('reaction', 0.9695568084716797),
 ('operation', 0.9675626754760742),
 ('Union', 0.9674526453018188),
 ('Constitution', 0.9671463370323181),
 ('Soviet', 0.9661285877227783),
 ('link', 0.9656205177307129)]

## 1.2. Using a Pretrained "word2vec" Model

Code source: Řehůřek, Radim (2022)

In [None]:
# Show all available models in gensim-data
list(gensim.downloader.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [None]:
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')



In [None]:
# Use the downloaded vectors as usual, ex:
glove_vectors.most_similar('twitter')

[('facebook', 0.9480051398277283),
 ('tweet', 0.9403422474861145),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104823470115662),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885936141014099),
 ('tweets', 0.8878157734870911),
 ('tl', 0.8778461813926697),
 ('link', 0.877821147441864),
 ('internet', 0.8753897547721863)]

## 1.3. Word2vec for Recommendation

<p align="center">
  <img src="https://cdn-images-1.medium.com/max/1400/1*xbNM_CnEIWQtGbsLmZtE-A.gif" alt="" width="600">
</p>

<center>Image source: Karam, Ramzi (2017) </center>

Suggested resources:
* McCormick, Chris (2018)
* Karam, Ramzi (2017)

# Other Approaches

## 2.1.  GloVe

**Original Paper**: Pennington at al. (2014)

GloVe (Global Vectors) is a word embedding model that represents words as numeric vectors in a high-dimensional space. These vectors capture the meaning of the words and the relationships between them, and they can be used as input to a variety of natural language processing tasks.

GloVe was developed by Stanford University researchers in 2014 as an extension of the word2vec model, which was introduced in the early 2010s. Like word2vec, GloVe represents words as vectors in a high-dimensional space, but it uses a different training objective and a different algorithm to learn the word embeddings.

One of the main advantages of GloVe is that it is able to learn meaningful word embeddings from very large corpora of text, even when the corpora are very sparse (i.e., when the number of words in the corpus is much larger than the number of unique words). This makes GloVe particularly well-suited for tasks that require the use of very large text corpora, such as language translation and language modeling.

GloVe has been widely used in natural language processing tasks and has achieved state-of-the-art performance on many benchmarks. It is also available in a variety of open-source software libraries, making it easy to use in a variety of applications. However, like all machine learning models, it has some limitations and disadvantages:

1. GloVe requires a large amount of training data to learn effective word embeddings. This can be a disadvantage if you do not have access to a large enough corpus of text to train the model.

2. GloVe is a computationally intensive model to train. It requires a large amount of computation to learn the word embeddings, and this can be a disadvantage if you do not have access to sufficient computational resources.

3. **GloVe is a "static" word embedding model, which means that the word embeddings are fixed after the model is trained and do not change based on the context in which the words appear.** This can be a disadvantage in tasks that require context-sensitive word embeddings, such as language modeling or machine translation.

4. *GloVe is a "linear" word embedding model, which means that it represents words as linear combinations of a small number of basis vectors.* This can be a disadvantage in tasks that require more complex, non-linear representations of words, such as image recognition or speech recognition.

## 2.2.  fastText

**Original Paper**: Bojanowski at al. (2016)

"As the name suggests, fastText is a fast-to-train word representation based on the Word2Vec skip-gram model, that can be trained on more than one billion words in less than ten minutes using a standard multicore CPU.

fastText can address limitations 3 [Word2Vec cannot understand out-of-vocabulary (OOV) words, i.e. words not present in training data. You could assign a UNK token which is used for all OOV words or you could use other models that are robust to OOV words.] and 4 [By assigning a distinct vector to each word, Word2Vec ignores the morphology of words. For example, eat, eats, and eaten are considered independently different words by Word2Vec, but they come from the same root: eat, which might contain useful information.] (...) 

The model learns word representations while also taking into account morphology, which is captured by considering subword units (character n-grams)." (Uzila, 2012)

### 2.2.1  Subword tokenization

Two most famous techniques:
* Byte Pair Encoding (used in GPT, for example)
* WordPiece (used in BERT, for example)

Suggested resources:
* Krohn, Jon (2022)
* Huggingface (2021a)
* Huggingface (2021b)


# References


---
## Papers

* Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey. 2013. Efficient Estimation of Word Representations in Vector Space, arXiv, <https://arxiv.org/abs/1301.3781.pdf>


* Pennington, Jeffrey and Socher, Richard and Manning, Christopher. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. <https://aclanthology.org/D14-1162/>



* Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. 2016. Enriching Word Vectors with Subword Information, arXiv, <https://arxiv.org/abs/1607.04606>

---
## Articles in Magazines/Sites

* Karam, Ramzi (2017). Using Word2vec for Music Recommendations. Towards Data Science, <https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484>


* Uzila, Albers (2012). GloVe and fastText Clearly Explained: Extracting Features from Text Data. Level Up Coding, <https://levelup.gitconnected.com/glove-and-fasttext-clearly-explained-extracting-features-from-text-data-1d227ab017b2>

---
## Web Pages

* McCormick, Chris (2016). Word2Vec Tutorial - The Skip-Gram Model, accessed December 2022, <http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/>

* McCormick, Chris (2018). Applying word2vec to Recommenders and Advertising, accessed January 2023, <http://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/>


* Řehůřek, Radim (2022). Word2vec embeddings, accessed December 2022, <https://radimrehurek.com/gensim/models/word2vec.html>


* Weng, Lilian (2017). Learning Word Embedding, accessed January 2023, <https://lilianweng.github.io/posts/2017-10-15-word-embedding/word2vec-skip-gram.png>


---
## Videos and Podcasts

* Patel, Dhaval [codebasics] (2021a). Word2Vec Part 2 | Implement word2vec in gensim | | Deep Learning Tutorial 42 with Python. YouTube, accessed December 2022, <https://www.youtube.com/watch?v=Q2NtCcqmIww&t=0s>

* Krohn, Jon (2022). SDS 626: Subword Tokenization with Byte-Pair Encoding. Super Data Science Podcast, accessed December 2022, <https://www.superdatascience.com/podcast/subword-tokenization-with-byte-pair-encoding>

* Huggingface (2021a). Byte Pair Encoding Tokenization. YouTube, accessed December 2022, <https://www.youtube.com/watch?v=HEikzVL-lZU&t=1s> (and <https://huggingface.co/course/chapter6/5>)

* Huggingface (2021b). WordPiece Tokenization. YouTube, accessed December 2022, <https://www.youtube.com/watch?v=qpv6ms_t_1A> (and <https://huggingface.co/course/chapter6/6>)

# Acknowledgment

Notebook texts were powered by OpenAI's ChatGPT.