# Developing Word Embeddings

Rather than use pre-trained embeddings (as we did in the baseline_deep_dive notebook), we can train word embeddings using our own dataset. In this notebook, we demonstrate the training process for producing word embeddings using the word2vec, GloVe, and fastText models. We'll utilize the STS Benchmark dataset for this task. 

In [31]:
%load_ext blackcellmagic

In [3]:
import gensim
import sys

## Load and Preprocess Data

In [None]:
sys.path.append("../../../")  ## set the environment path
BASE_DATA_PATH = "../../../data"

from utils_nlp.dataset.stsbenchmark import STSBenchmark
from utils_nlp.dataset.preprocess import (
    to_lowercase,
    to_spacy_tokens,
    rm_spacy_stopwords,
)

In [None]:
# Initializing this instance runs the downloader and extractor behind the scenes, then convert to dataframe
stsTrain = STSBenchmark("train", base_data_path=BASE_DATA_PATH).as_dataframe()

In [None]:
# train preprocessing
df_low = to_lowercase(stsTrain)  # covert all text to lowercase
sts_tokenize = to_spacy_tokens(df_low)  # tokenize normally
sts_train = rm_spacy_stopwords(sts_tokenize)  # tokenize with removal of stopwords

In [8]:
sentences = sts_train["sentence1_tokens"].append(sts_train["sentence2_tokens"])

In [9]:
sentences[:10]

0                       [a, plane, is, taking, off, .]
1            [a, man, is, playing, a, large, flute, .]
2    [a, man, is, spreading, shreded, cheese, on, a...
3                 [three, men, are, playing, chess, .]
4                 [a, man, is, playing, the, cello, .]
5                        [some, men, are, fighting, .]
6                             [a, man, is, smoking, .]
7               [the, man, is, playing, the, piano, .]
8    [a, man, is, playing, on, a, guitar, and, sing...
9    [a, person, is, throwing, a, cat, on, to, the,...
dtype: object

## Word2Vec

Word2vec is a predictive model for learning word embeddings from text. Word embeddings are learned such that words that share common contexts in the corpus will be close together in the vector space. There are two different model architectures that can be used to produce word2vec embeddings: continuous bag-of-words (CBOW) or continuous skip-gram. The former uses a window of surrounding words (the "context") to predict the current word and the latter uses the current word to predict the surrounding context words. See this [tutorial](https://www.guru99.com/word-embedding-word2vec.html#3) on word2vec for more detailed background on the model.

The gensim Word2Vec model has many different parameters (see [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) but the ones that are useful to know about are:  
- size: length of the word embedding/vector (defaults to 100)
- window: maximum distance between word being predicted and the current word (defaults to 5)
- min_count: ignores all words that have a frequency lower than this value (defaults to 5)
- workers: number of worker threads used to train the model (deafults to 3)
- sg: training algorithm- 1 for skip-gram and 0 for CBOW (defaults to 0)

In [None]:
from gensim.models import Word2Vec

word2vec_model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=3, sg=0)

Now that the model is trained we can:

1. Query for the word embeddings of a given word. 
2. Inspect the model vocabulary
3. Save the word embeddings

In [None]:
# 1. Let's see the word embedding for "apple" by accessing the "wv" attribute of our model and passing in "apple" as the key.
print("Embedding for apple:", word2vec_model.wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words
print("\nFirst 30 vocabulary words:", list(word2vec_model.wv.vocab)[:20])

# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format
word2vec_model.wv.save_word2vec_format("model.bin", binary=True)  # binary format
word2vec_model.wv.save_word2vec_format("model.bin", binary=False)  # ASCII format

## GloVe

GloVe is an unsupervised algorithm for obtaining word embeddings. Training occurs on word-word co-occurance statistics with the objective of learning word embeddings such that the dot product of two word's embeddings is equal to the word's probability of co-occurance. See this [tutorial](https://nlp.stanford.edu/projects/glove/) on GloVe for more detailed background on the model. 

## fastText

fastText is an unsupervised algorithm created by Facebook Research for efficiently learning word embeddings. fastText is significantly different than word2vec or GloVe in that these two algorithms we saw earlier treat each word as the smallest possible unit to find an embedding for. Conversely, fastText assumes that words are formed by an n-gram of characters (i.e. 2-grams of the word "language" would be {la, an, ng, gu, ua, ag, ge}). The embedding for a word is then composed of the sum of these character n-grams. This has advantages when finding word embeddings for rare words and words not present in the dictionary, as these words can still be broken down into character n-grams. Typically, for smaller datasets, fastText performs better than word2vec or GloVe. See this [tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html) on fastText for more detail.

The gensim fastText model has many different parameters (see [here](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText)) but the ones that are useful to know about are:  
- size: length of the word embedding/vector (defaults to 100)
- window: maximum distance between word being predicted and the current word (defaults to 5)
- min_count: ignores all words that have a frequency lower than this value (defaults to 5)
- workers: number of worker threads used to train the model (deafults to 3)
- sg: training algorithm- 1 for skip-gram and 0 for CBOW (defaults to 0)
- iter: number of epochs (defaults to 5)


In [None]:
from gensim.models.fasttext import FastText

fastText_model = FastText(size=100, window=5, min_count=5, sentences=sentences, iter=5)

We can utilize the same attributes as we saw above for word2vec due to them both originating from the gensim package

In [None]:
# 1. Let's see the word embedding for "apple" by accessing the "wv" attribute of our model and passing in "apple" as the key.
print("Embedding for apple:", fastText_model.wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words
print("\nFirst 30 vocabulary words:", list(fastText_model.wv.vocab)[:20])

# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format
fastText_model.wv.save_word2vec_format("FTmodel.bin", binary=True)  # binary format
fastText_model.wv.save_word2vec_format("FTmodel.bin", binary=False)  # ASCII format