# Developing Word Embeddings

Rather than use pre-trained embeddings (as we did in the baseline_deep_dive notebook), we can train word embeddings using our own dataset. In this notebook, we demonstrate the training process for producing word embeddings using the word2vec, GloVe, and fastText models. We'll utilize the STS Benchmark dataset for this task. 

# Table of Contents
* [Data Loading and Preprocessing](#Load-and-Preprocess-Data)
* [Word2Vec](#Word2Vec)
* [fastText](#fastText)
* [GloVe](#GloVe)
* [Concluding Remarks](#Concluding-Remarks)

In [1]:
import gensim
import sys
# Set the environment path
sys.path.append("../../") 
import os
from utils_nlp.dataset.preprocess import (
    to_lowercase,
    to_spacy_tokens,
    rm_spacy_stopwords,
)
from utils_nlp.dataset import stsbenchmark
from gensim.models import Word2Vec
from gensim.models.fasttext import FastText

In [2]:
# Set the path for where your datasets are located
BASE_DATA_PATH = "../../data" 
# Location to save embeddings
SAVE_FILES_PATH = BASE_DATA_PATH + "/trained_word_embeddings/"

In [3]:
if not os.path.exists(SAVE_FILES_PATH):
    os.makedirs(SAVE_FILES_PATH)

## Load and Preprocess Data

In [4]:
# Produce a pandas dataframe for the training set
sts_train = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split="train")

#### Training set preprocessing

In [5]:
# Convert all text to lowercase
df_low = to_lowercase(sts_train)  
# Tokenize text
sts_tokenize = to_spacy_tokens(df_low) 
# Tokenize with removal of stopwords
sts_train_stop = rm_spacy_stopwords(sts_tokenize) 

In [6]:
# Append together the two sentence columns to get a list of all tokenized sentences.
all_sentences =  sts_train_stop[["sentence1_tokens_rm_stopwords", "sentence2_tokens_rm_stopwords"]]
sentences = all_sentences.values.flatten().tolist()

In [7]:
sentences[:10]

[['plane', 'taking', '.'],
 ['air', 'plane', 'taking', '.'],
 ['man', 'playing', 'large', 'flute', '.'],
 ['man', 'playing', 'flute', '.'],
 ['man', 'spreading', 'shreded', 'cheese', 'pizza', '.'],
 ['man', 'spreading', 'shredded', 'cheese', 'uncooked', 'pizza', '.'],
 ['men', 'playing', 'chess', '.'],
 ['men', 'playing', 'chess', '.'],
 ['man', 'playing', 'cello', '.'],
 ['man', 'seated', 'playing', 'cello', '.']]

## Word2Vec

Word2vec is a predictive model for learning word embeddings from text. Word embeddings are learned such that words that share common contexts in the corpus will be close together in the vector space. There are two different model architectures that can be used to produce word2vec embeddings: continuous bag-of-words (CBOW) or continuous skip-gram. The former uses a window of surrounding words (the "context") to predict the current word and the latter uses the current word to predict the surrounding context words. See this [tutorial](https://www.guru99.com/word-embedding-word2vec.html#3) on word2vec for more detailed background on the model.

The gensim Word2Vec model has many different parameters (see [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) but the ones that are useful to know about are:  
- size: length of the word embedding/vector (defaults to 100)
- window: maximum distance between the word being predicted and the current word (defaults to 5)
- min_count: ignores all words that have a frequency lower than this value (defaults to 5)
- workers: number of worker threads used to train the model (defaults to 3)
- sg: training algorithm; 1 for skip-gram and 0 for CBOW (defaults to 0)

In [8]:
word2vec_model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=3, sg=0)

Now that the model is trained we can:

1. Query for the word embeddings of a given word. 
2. Inspect the model vocabulary
3. Save the word embeddings

In [9]:
# 1. Let's see the word embedding for "apple" by accessing the "wv" attribute and passing in "apple" as the key.
print("Embedding for apple:", word2vec_model.wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words.
print("\nFirst 30 vocabulary words:", list(word2vec_model.wv.vocab)[:20])

# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format.
word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+"word2vec_model", binary=True)  # binary format
word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+"word2vec_model", binary=False)  # ASCII format

Embedding for apple: [-1.45744249e-01 -4.23673481e-01  3.25938733e-03  2.16118336e-01
 -3.38810831e-02  4.36680727e-02 -9.32965130e-02 -1.30193695e-01
  1.22479253e-01 -5.57641592e-03  1.15231648e-01 -1.90644771e-01
  1.36980727e-01  1.54277921e-01  3.02371562e-01 -1.50472924e-01
  1.22283474e-01  1.66438103e-01 -4.53472175e-02 -7.50410557e-02
  9.35729593e-02  2.84669697e-02 -1.56790614e-01  2.13695884e-01
  3.59700769e-02 -2.28801727e-01 -6.65102080e-02 -1.00866072e-01
  1.73690766e-02 -6.05353080e-02  6.91714063e-02 -1.40613124e-01
 -1.55061241e-02  3.15357596e-02  1.05547914e-02  8.11263919e-02
  8.59601721e-02 -1.93002746e-01  5.66715710e-02  4.07571755e-02
 -5.99270537e-02  8.00862685e-02 -7.79552236e-02 -1.84332090e-03
 -3.64068709e-02  9.83870868e-03  2.55715717e-02 -6.73236232e-03
 -1.00992797e-02  2.55039651e-02  7.85838366e-02  1.66635379e-01
  5.47658615e-02  1.30082041e-01  1.10447399e-01  3.11631355e-02
 -7.80537724e-02  5.18930964e-02  1.29589550e-02  4.16627787e-02
 -1.

## fastText

fastText is an unsupervised algorithm created by Facebook Research for efficiently learning word embeddings. fastText is significantly different than word2vec or GloVe in that these two algorithms treat each word as the smallest possible unit to find an embedding for. Conversely, fastText assumes that words are formed by an n-gram of characters (i.e. 2-grams of the word "language" would be {la, an, ng, gu, ua, ag, ge}). The embedding for a word is then composed of the sum of these character n-grams. This has advantages when finding word embeddings for rare words and words not present in the dictionary, as these words can still be broken down into character n-grams. Typically, for smaller datasets, fastText performs better than word2vec or GloVe. See this [tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html) on fastText for more detail.

The gensim fastText model has many different parameters (see [here](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText)) but the ones that are useful to know about are:  
- size: length of the word embedding/vector (defaults to 100)
- window: maximum distance between the word being predicted and the current word (defaults to 5)
- min_count: ignores all words that have a frequency lower than this value (defaults to 5)
- workers: number of worker threads used to train the model (defaults to 3)
- sg: training algorithm- 1 for skip-gram and 0 for CBOW (defaults to 0)
- iter: number of epochs (defaults to 5)


In [10]:
fastText_model = FastText(size=100, window=5, min_count=5, sentences=sentences, iter=5)

We can utilize the same attributes as we saw above for word2vec due to them both originating from the gensim package

In [11]:
# 1. Let's see the word embedding for "apple" by accessing the "wv" attribute and passing in "apple" as the key.
print("Embedding for apple:", fastText_model.wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words.
print("\nFirst 30 vocabulary words:", list(fastText_model.wv.vocab)[:20])

# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format.
fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+"fastText_model", binary=True)  # binary format
fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+"fastText_model", binary=False)  # ASCII format

Embedding for apple: [ 0.28720722  0.09591733  0.54433066 -0.05831062 -0.16943751  0.3623266
 -0.15930647 -0.11854805 -0.51492     0.34766328  0.14267923  0.05289214
  0.50301915  0.36670431 -0.0856811   0.40855268 -0.42051238 -0.42935574
 -0.00590014  0.1055185   0.19841222  0.28653464  0.02694756 -0.2998636
  0.2037611  -0.09531905 -0.18712749 -0.0126017   0.37071258 -0.10778084
  0.05781018 -0.28327507  0.36217982  0.19254316  0.39235938 -0.25031894
 -0.05957209  0.88860816  0.07133473  0.08105652 -0.2374883  -0.00684688
 -0.2799134   0.03850232 -0.03038069  0.09097645 -0.21751858 -0.34727782
 -0.2102714   0.06907686 -0.3250014   0.14360848  0.06191867  0.30928102
  0.00916505 -0.2980404   0.08639123  0.43495005  0.2583049  -0.05802987
  0.25165558 -0.05238454  0.21494739  0.16128843  0.23630676 -0.60110664
  0.41299316  0.09353159 -0.03104531 -0.17996807  0.11146976 -0.0121649
  0.06300905  0.30792916  0.06998671 -0.1877395   0.02814171  0.10962378
 -0.05561887  0.01427535  0.49409

## GloVe

GloVe is an unsupervised algorithm for obtaining word embeddings created by the Stanford NLP group. Training occurs on word-word co-occurrence statistics with the objective of learning word embeddings such that the dot product of two words' embeddings is equal to the words' probability of co-occurrence. See this [tutorial](https://nlp.stanford.edu/projects/glove/) on GloVe for more detailed background on the model. 

Gensim doesn't have an implementation of the GloVe model, so we suggest getting the code directly from the Stanford NLP [repo](https://github.com/stanfordnlp/GloVe). Run the following commands to clone the repo and then make. Clone the repo in the same location as this notebook! Otherwise, the paths below will need to be modified.  

    git clone http://github.com/stanfordnlp/glove    
    cd glove && make  

### Train GloVe vectors

Training GloVe embeddings requires some data prep and then 4 steps (also documented in the original Stanford NLP repo [here](https://github.com/stanfordnlp/GloVe/tree/master/src)).

**Step 0: Prepare Data**
   
In order to train our GloVe vectors, we first need to save our corpus as a text file with all words separated by 1+ spaces or tabs. Each document/sentence is separated by a new line character.

In [12]:
# Save our corpus as tokens delimited by spaces with new line characters in between sentences.
with open(BASE_DATA_PATH+'/clean/stsbenchmark/training-corpus-cleaned.txt', 'w', encoding='utf8') as file:
    for sent in sentences:
        file.write(" ".join(sent) + "\n")

**Step 1: Build Vocabulary**

Run the vocab_count executable. There are 3 optional parameters:
1. min-count: lower limit on how many times a word must appear in dataset. Otherwise the word is discarded from our vocabulary.
2. max-vocab: upper bound on the number of vocabulary words to keep
3. verbose: 0, 1, or 2 (default)

Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the vocabulary to 

In [13]:
!"glove/build/vocab_count" -min-count 5 -verbose 2 <"../../data/clean/stsbenchmark/training-corpus-cleaned.txt"> "../../data/trained_word_embeddings/vocab.txt"

BUILDING VOCABULARY
Processed 0 tokens.Processed 84997 tokens.
Counted 11716 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 2943.



**Step 2: Construct Word Co-occurrence Statistics**

Run the cooccur executable. There are many optional parameters, but we list the top ones here:
1. symmetric: 0 for only looking at left context, 1 (default) for looking at both left and right context
2. window-size: number of context words to use (default 15)
3. verbose: 0, 1, or 2 (default)
4. vocab-file: path/name of the vocabulary file created in Step 1
5. memory: soft limit for memory consumption, default 4
6. max-product: limit the size of dense co-occurrence array by specifying the max product (integer) of the frequency counts of the two co-occurring words

Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the co-occurrences to

In [14]:
!"glove/build/cooccur" -memory 4 -vocab-file "../../data/trained_word_embeddings/vocab.txt" -verbose 2 -window-size 15 <"../../data/clean/stsbenchmark/training-corpus-cleaned.txt"> "../../data/trained_word_embeddings/cooccurrence.bin"

COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "../../data/trained_word_embeddings/vocab.txt"...loaded 2943 words.
Building lookup table...table contains 8661250 elements.
Processing token: 0Processed 84997 tokens.
Writing cooccurrences to disk......2 files in total.
Merging cooccurrence files: processed 0 lines.0 lines.100000 lines.Merging cooccurrence files: processed 187717 lines.



**Step 3: Shuffle the Co-occurrences**

Run the shuffle executable. The parameters are as follows:
1. verbose: 0, 1, or 2 (default)
2. memory: soft limit for memory consumption, default 4
3. array-size: limit to the length of the buffer which stores chunks of data to shuffle before writing to disk

Then provide the path to the co-occurrence file we created in Step 2 followed by a file path that we'll save the shuffled co-occurrences to

In [15]:
!"glove/build/shuffle" -memory 4 -verbose 2 <"../../data/trained_word_embeddings/cooccurrence.bin"> "../../data/trained_word_embeddings/cooccurrence.shuf.bin"

SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 0 lines.processed 187717 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 0 lines.187717 lines.Merging temp files: processed 187717 lines.



**Step 4: Train GloVe model**

Run the glove executable. There are many parameter options, but the top ones are listed below:
1. verbose: 0, 1, or 2 (default)
2. vector-size: dimension of word embeddings (50 is default)
3. threads: number threads, default 8
4. iter: number of iterations, default 25
5. eta: learning rate, default 0.05
6. binary: whether to save binary format (0: text = default, 1: binary, 2: both)
7. x-max: cutoff for weighting function, default is 100
8. vocab-file: file containing vocabulary as produced in Step 1
9. save-file: filename to save vectors to 
10. input-file: filename with co-occurrences as returned from Step 3

In [16]:
!"glove/build/glove" -save-file "../../data/trained_word_embeddings/GloVe_vectors" -threads 8 -input-file \
"../../data/trained_word_embeddings/cooccurrence.shuf.bin" -x-max 10 -iter 15 -vector-size 50 -binary 2 \
-vocab-file "../../data/trained_word_embeddings/vocab.txt" -verbose 2

TRAINING MODEL
Read 187717 lines.
Initializing parameters...done.
vector size: 50
vocab size: 2943
x_max: 10.000000
alpha: 0.750000
05/06/19 - 08:43.06PM, iter: 001, cost: 0.078361
05/06/19 - 08:43.06PM, iter: 002, cost: 0.072087
05/06/19 - 08:43.06PM, iter: 003, cost: 0.070096
05/06/19 - 08:43.06PM, iter: 004, cost: 0.067155
05/06/19 - 08:43.06PM, iter: 005, cost: 0.063527
05/06/19 - 08:43.06PM, iter: 006, cost: 0.060744
05/06/19 - 08:43.06PM, iter: 007, cost: 0.058127
05/06/19 - 08:43.06PM, iter: 008, cost: 0.056086
05/06/19 - 08:43.06PM, iter: 009, cost: 0.054027
05/06/19 - 08:43.06PM, iter: 010, cost: 0.051815
05/06/19 - 08:43.06PM, iter: 011, cost: 0.049561
05/06/19 - 08:43.06PM, iter: 012, cost: 0.047386
05/06/19 - 08:43.06PM, iter: 013, cost: 0.045234
05/06/19 - 08:43.06PM, iter: 014, cost: 0.043166
05/06/19 - 08:43.06PM, iter: 015, cost: 0.041163


### Inspect Word Vectors

Like we did above for the word2vec and fastText models, let's now inspect our word embeddings

In [17]:
#load in the saved word vectors.
glove_wv = {}
with open("../../data/trained_word_embeddings/GloVe_vectors.txt", encoding='utf-8') as f:
    for line in f:
        split_line = line.split(" ")
        glove_wv[split_line[0]] = [float(i) for i in split_line[1:]]

In [18]:
# 1. Let's see the word embedding for "apple" by passing in "apple" as the key.
print("Embedding for apple:", glove_wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words.
print("\nFirst 30 vocabulary words:", list(glove_wv.keys())[:20])

Embedding for apple: [0.106778, -0.054364, 0.071933, 0.107653, 0.029326, -0.112638, 0.026381, 0.022774, -0.05421, 0.074924, -0.043991, 0.140131, 0.034497, 0.072747, 0.051646, -0.082894, 0.064107, -0.063781, 0.028841, -0.001721, 0.132145, 0.133201, 0.066409, 0.121694, -0.232013, 0.016408, 0.020115, 0.126312, -0.136232, -0.084801, 0.048155, -0.072803, -0.100928, 0.066932, 0.002329, -0.022856, 0.042825, -0.039268, -0.030836, -0.071207, 0.060238, -0.032366, 0.021732, -0.120279, 0.132316, 0.069412, -0.067842, 0.04007, -0.177785, -0.001198]

First 30 vocabulary words: ['.', ',', 'man', '-', 'woman', "'", 'said', 'dog', '"', 'playing', ':', 'white', 'black', '$', 'killed', 'percent', 'new', 'syria', 'people', 'china']


# Concluding Remarks

In this notebook we have shown how to train word2vec, GloVe, and fastText word embeddings on the STS Benchmark dataset. FastText is typically regarded as the best baseline for word embeddings (see [blog](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)) and is a good place to start when generating word embeddings. Now that we generated word embeddings on our dataset, we could also repeat the baseline_deep_dive notebook using these embeddings (versus the pre-trained ones from the internet). 