# Developing Word Embeddings

Rather than use pre-trained embeddings (as we did in the sentence similarity baseline_deep_dive [notebook](../sentence_similarity/baseline_deep_dive.ipynb)), we can train word embeddings using our own dataset. In this notebook, we demonstrate the training process for producing word embeddings using the word2vec, GloVe, and fastText models. We'll utilize the STS Benchmark dataset for this task. 

# Table of Contents
* [Data Loading and Preprocessing](#Load-and-Preprocess-Data)
* [Word2Vec](#Word2Vec)
* [fastText](#fastText)
* [GloVe](#GloVe)
* [Concluding Remarks](#Concluding-Remarks)

In [2]:
import gensim
import sys
import os

# Set the environment path
NLP_PATH = os.path.abspath('../../')
if NLP_PATH not in sys.path:
    sys.path.insert(0, NLP_PATH)

import numpy as np
from utils_nlp.dataset.preprocess import (
    to_lowercase,
    to_spacy_tokens,
    rm_spacy_stopwords,
)
from utils_nlp.dataset import stsbenchmark
from utils_nlp.common.timer import Timer
from gensim.models import Word2Vec
from gensim.models.fasttext import FastText

In [3]:
# Set the path for where your datasets are located
BASE_DATA_PATH = os.path.join(NLP_PATH, "data/")
# Location to save embeddings
SAVE_FILES_PATH = os.path.join(BASE_DATA_PATH, "trained_word_embeddings/")
if not os.path.exists(SAVE_FILES_PATH):
    os.makedirs(SAVE_FILES_PATH)

## Load and Preprocess Data

In [5]:
# Produce a pandas dataframe for the training set
train_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split="train")

# Clean the sts dataset
sts_train = stsbenchmark.clean_sts(train_raw)

100%|██████████| 401/401 [00:01<00:00, 205KB/s]  

Data downloaded to /data/home/yijichen/notebooks/nlp_repo/nlp/data/raw/stsbenchmark





In [6]:
sts_train.head(5)

Unnamed: 0,score,sentence1,sentence2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.


In [7]:
# Check the size of our dataframe
sts_train.shape

(5749, 3)

#### Training set preprocessing

In [8]:
# Convert all text to lowercase
df_low = to_lowercase(sts_train)  
# Tokenize text
sts_tokenize = to_spacy_tokens(df_low) 
# Tokenize with removal of stopwords
sts_train_stop = rm_spacy_stopwords(sts_tokenize) 

In [9]:
# Append together the two sentence columns to get a list of all tokenized sentences.
all_sentences =  sts_train_stop[["sentence1_tokens_rm_stopwords", "sentence2_tokens_rm_stopwords"]]
# Flatten two columns into one list and remove all sentences that are size 0 after tokenization and stop word removal.
sentences = [i for i in all_sentences.values.flatten().tolist() if len(i) > 0]

In [10]:
len(sentences)

11498

In [11]:
sentence_lengths = [len(i) for i in sentences]
print("Minimum sentence length is {} tokens".format(min(sentence_lengths)))
print("Maximum sentence length is {} tokens".format(max(sentence_lengths)))
print("Median sentence length is {} tokens".format(np.median(sentence_lengths)))

Minimum sentence length is 1 tokens
Maximum sentence length is 43 tokens
Median sentence length is 6.0 tokens


In [12]:
sentences[:10]

[['plane', 'taking', '.'],
 ['air', 'plane', 'taking', '.'],
 ['man', 'playing', 'large', 'flute', '.'],
 ['man', 'playing', 'flute', '.'],
 ['man', 'spreading', 'shreded', 'cheese', 'pizza', '.'],
 ['man', 'spreading', 'shredded', 'cheese', 'uncooked', 'pizza', '.'],
 ['men', 'playing', 'chess', '.'],
 ['men', 'playing', 'chess', '.'],
 ['man', 'playing', 'cello', '.'],
 ['man', 'seated', 'playing', 'cello', '.']]

## Word2Vec

Word2vec is a predictive model for learning word embeddings from text (see [original research paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)). Word embeddings are learned such that words that share common contexts in the corpus will be close together in the vector space. There are two different model architectures that can be used to produce word2vec embeddings: continuous bag-of-words (CBOW) or continuous skip-gram. The former uses a window of surrounding words (the "context") to predict the current word and the latter uses the current word to predict the surrounding context words. See this [tutorial](https://www.guru99.com/word-embedding-word2vec.html#3) on word2vec for more detailed background on the model.

The gensim Word2Vec model has many different parameters (see [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) but the ones that are useful to know about are:  
- size: length of the word embedding/vector (defaults to 100)
- window: maximum distance between the word being predicted and the current word (defaults to 5)
- min_count: ignores all words that have a frequency lower than this value (defaults to 5)
- workers: number of worker threads used to train the model (defaults to 3)
- sg: training algorithm; 1 for skip-gram and 0 for CBOW (defaults to 0)

In [13]:
# Set up a Timer to see how long the model takes to train
t = Timer()

In [14]:
t.start()

# Train the Word2vec model
word2vec_model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=3, sg=0)

t.stop()

In [15]:
print("Time elapsed: {}".format(t))

Time elapsed: 0.6343


Now that the model is trained we can:

1. Query for the word embeddings of a given word. 
2. Inspect the model vocabulary
3. Save the word embeddings

In [16]:
# 1. Let's see the word embedding for "apple" by accessing the "wv" attribute and passing in "apple" as the key.
print("Embedding for apple:", word2vec_model.wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words.
print("\nFirst 30 vocabulary words:", list(word2vec_model.wv.vocab)[:20])

# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format.
word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+"word2vec_model", binary=True)  # binary format
word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+"word2vec_model", binary=False)  # ASCII format

Embedding for apple: [ 8.94748047e-02  2.41466127e-02  1.40242130e-01 -1.01290472e-01
 -3.72909606e-02 -8.42960998e-02  1.04651630e-01 -1.51196137e-01
 -9.76017863e-02  5.58857433e-02 -1.05379172e-01  1.96037576e-01
  3.54142562e-02 -1.19922802e-01 -3.23171727e-02  1.97936416e-01
 -1.08599395e-01 -2.02625617e-02 -1.81590347e-03  1.17398715e-02
 -1.33693948e-01  1.20712914e-01 -7.43050948e-02  2.78903022e-02
  2.78636813e-02  6.78229854e-02 -2.38574557e-02 -7.83610195e-02
  8.14385060e-03 -8.23858902e-02 -1.06596418e-01  5.22979647e-02
  1.03389891e-02  9.60147753e-02  6.47476837e-02  2.17621550e-01
 -8.09327960e-02  6.91598132e-02  6.26451075e-02 -1.32119164e-01
 -8.17936435e-02 -1.01129502e-01  3.28128450e-02  1.44652754e-01
  4.50415276e-02  4.17685788e-03  2.75705159e-02 -1.73147812e-01
 -2.11286023e-02 -5.13567813e-02  1.62356552e-02  4.48382348e-02
 -4.29275855e-02 -6.81729009e-03 -5.25982417e-02  3.98872010e-02
  1.32774189e-01  9.31772217e-02  8.80175233e-02 -1.74944147e-01
  4.

## fastText

fastText is an unsupervised algorithm created by Facebook Research for efficiently learning word embeddings (see [original research paper](https://arxiv.org/pdf/1607.04606.pdf)). fastText is significantly different than word2vec or GloVe in that these two algorithms treat each word as the smallest possible unit to find an embedding for. Conversely, fastText assumes that words are formed by an n-gram of characters (i.e. 2-grams of the word "language" would be {la, an, ng, gu, ua, ag, ge}). The embedding for a word is then composed of the sum of these character n-grams. This has advantages when finding word embeddings for rare words and words not present in the dictionary, as these words can still be broken down into character n-grams. Typically, for smaller datasets, fastText performs better than word2vec or GloVe. See this [tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html) on fastText for more detail.

The gensim fastText model has many different parameters (see [here](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText)) but the ones that are useful to know about are:  
- size: length of the word embedding/vector (defaults to 100)
- window: maximum distance between the word being predicted and the current word (defaults to 5)
- min_count: ignores all words that have a frequency lower than this value (defaults to 5)
- workers: number of worker threads used to train the model (defaults to 3)
- sg: training algorithm- 1 for skip-gram and 0 for CBOW (defaults to 0)
- iter: number of epochs (defaults to 5)


In [17]:
# Set up a Timer to see how long the model takes to train
t = Timer()

In [18]:
t.start()

# Train the FastText model
fastText_model = FastText(size=100, window=5, min_count=5, sentences=sentences, iter=5)

t.stop()

In [19]:
print("Time elapsed: {}".format(t))

Time elapsed: 11.1728


We can utilize the same attributes as we saw above for word2vec due to them both originating from the gensim package

In [20]:
# 1. Let's see the word embedding for "apple" by accessing the "wv" attribute and passing in "apple" as the key.
print("Embedding for apple:", fastText_model.wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words.
print("\nFirst 30 vocabulary words:", list(fastText_model.wv.vocab)[:20])

# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format.
fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+"fastText_model", binary=True)  # binary format
fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+"fastText_model", binary=False)  # ASCII format

Embedding for apple: [-0.23510154 -0.15507501  0.0056565   0.45181453  0.5000084  -0.2648049
 -0.25791287 -0.4212533  -0.06907137  0.0013695   0.3194571   0.01570429
 -0.03375538 -0.07636142  0.15745506  0.2511224   0.04350953  0.08955397
 -0.11049644 -0.5870106  -0.14050071 -0.03914891 -0.05926621 -0.48968792
 -0.15853383 -0.07664221  0.11611713  0.13797617 -0.43066472  0.2673129
  0.06168905 -0.04650382 -0.01283566  0.09944137  0.14161733  0.15692197
  0.0488883  -0.17440423 -0.42009622 -0.25779897  0.29067218  0.4241775
  0.28518778 -0.17275187  0.10912739 -0.092472   -0.42640597  0.30356327
  0.03260724  0.14312139 -0.09600725 -0.233319   -0.71152973 -0.4668092
 -0.15484177  0.083478    0.14034158 -0.32355824 -0.45780435  0.2399303
 -0.3201641  -0.34011903  0.09115782  0.25974855 -0.08718303  0.05970525
 -0.10188221  0.13411698  0.32321262 -0.1038212   0.32776234 -0.0280938
 -0.181011   -0.20158029  0.15832287  0.20536025  0.0343249  -0.2551625
  0.00600469 -0.3237772   0.0900947  

## GloVe

GloVe is an unsupervised algorithm for obtaining word embeddings created by the Stanford NLP group (see [original research paper](https://nlp.stanford.edu/pubs/glove.pdf)). Training occurs on word-word co-occurrence statistics with the objective of learning word embeddings such that the dot product of two words' embeddings is equal to the words' probability of co-occurrence. See this [tutorial](https://nlp.stanford.edu/projects/glove/) on GloVe for more detailed background on the model. 

Gensim doesn't have an implementation of the GloVe model and the other python packages that implement GloVe are unstable, so we leveraged the code directly from the Stanford NLP [repo](https://github.com/stanfordnlp/GloVe). 

In [None]:
glove_model_path = os.path.join(NLP_PATH, "utils_nlp/models/glove/")
!cd $glove_model_path && make 

### Train GloVe vectors

Training GloVe embeddings requires some data prep and then 4 steps (also documented in the original Stanford NLP repo [here](https://github.com/stanfordnlp/GloVe/tree/master/src)).

**Step 0: Prepare Data**
   
In order to train our GloVe vectors, we first need to save our corpus as a text file with all words separated by 1+ spaces or tabs. Each document/sentence is separated by a new line character.

In [22]:
# Save our corpus as tokens delimited by spaces with new line characters in between sentences.
training_corpus_file_path = os.path.join(SAVE_FILES_PATH, "training-corpus-cleaned.txt")
with open(training_corpus_file_path, 'w', encoding='utf8') as file:
    for sent in sentences:
        file.write(" ".join(sent) + "\n")

In [23]:
# Set up a Timer to see how long the model takes to train
t = Timer()
t.start()

**Step 1: Build Vocabulary**

Run the vocab_count executable. There are 3 optional parameters:
1. min-count: lower limit on how many times a word must appear in dataset. Otherwise the word is discarded from our vocabulary.
2. max-vocab: upper bound on the number of vocabulary words to keep
3. verbose: 0, 1, or 2 (default)

Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the vocabulary to 

In [25]:
vocab_count_exe = os.path.join(glove_model_path, 'build/vocab_count')
vocab_file_path = os.path.join(SAVE_FILES_PATH, "vocab.txt")

In [26]:
!$vocab_count_exe -min-count 5 -verbose 2 <$training_corpus_file_path> $vocab_file_path

BUILDING VOCABULARY
Processed 0 tokens.[0GProcessed 85334 tokens.
Counted 11716 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 2943.



**Step 2: Construct Word Co-occurrence Statistics**

Run the cooccur executable. There are many optional parameters, but we list the top ones here:
1. symmetric: 0 for only looking at left context, 1 (default) for looking at both left and right context
2. window-size: number of context words to use (default 15)
3. verbose: 0, 1, or 2 (default)
4. vocab-file: path/name of the vocabulary file created in Step 1
5. memory: soft limit for memory consumption, default 4
6. max-product: limit the size of dense co-occurrence array by specifying the max product (integer) of the frequency counts of the two co-occurring words

Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the co-occurrences to

In [27]:
cooccur_exe = os.path.join(glove_model_path, 'build/cooccur')
cooccurrence_file_path = os.path.join(SAVE_FILES_PATH, "cooccurrence.bin")

In [28]:
!$cooccur_exe -memory 4 -vocab-file $vocab_file_path -verbose 2 -window-size 15 <$training_corpus_file_path> $cooccurrence_file_path

COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "/data/home/yijichen/notebooks/nlp_repo/nlp/data/trained_word_embeddings/vocab.txt"...loaded 2943 words.
Building lookup table...table contains 8661250 elements.
Processing token: 0[0GProcessed 85334 tokens.
Writing cooccurrences to disk......2 files in total.
Merging cooccurrence files: processed 0 lines.[39G0 lines.[39G100000 lines.[0GMerging cooccurrence files: processed 188154 lines.



**Step 3: Shuffle the Co-occurrences**

Run the shuffle executable. The parameters are as follows:
1. verbose: 0, 1, or 2 (default)
2. memory: soft limit for memory consumption, default 4
3. array-size: limit to the length of the buffer which stores chunks of data to shuffle before writing to disk

Then provide the path to the co-occurrence file we created in Step 2 followed by a file path that we'll save the shuffled co-occurrences to

In [29]:
shuffle_exe = os.path.join(glove_model_path, 'build/shuffle')
cooccurrence_shuf_file_path = os.path.join(SAVE_FILES_PATH, "cooccurrence.shuf.bin")

In [30]:
!$shuffle_exe -memory 4 -verbose 2 <$cooccurrence_file_path> $cooccurrence_shuf_file_path

SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 0 lines.[22Gprocessed 188154 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 0 lines.[31G188154 lines.[0GMerging temp files: processed 188154 lines.



**Step 4: Train GloVe model**

Run the glove executable. There are many parameter options, but the top ones are listed below:
1. verbose: 0, 1, or 2 (default)
2. vector-size: dimension of word embeddings (50 is default)
3. threads: number threads, default 8
4. iter: number of iterations, default 25
5. eta: learning rate, default 0.05
6. binary: whether to save binary format (0: text = default, 1: binary, 2: both)
7. x-max: cutoff for weighting function, default is 100
8. vocab-file: file containing vocabulary as produced in Step 1
9. save-file: filename to save vectors to 
10. input-file: filename with co-occurrences as returned from Step 3

In [31]:
glove_exe = os.path.join(glove_model_path, 'build/glove')
glove_vector_file_path = os.path.join(SAVE_FILES_PATH, "GloVe_vectors")

In [32]:
!$glove_exe -save-file $glove_vector_file_path -threads 8 -input-file \
$cooccurrence_shuf_file_path -x-max 10 -iter 15 -vector-size 50 -binary 2 \
-vocab-file $vocab_file_path -verbose 2

TRAINING MODEL
Read 188154 lines.
Initializing parameters...done.
vector size: 50
vocab size: 2943
x_max: 10.000000
alpha: 0.750000
08/01/19 - 08:43.48PM, iter: 001, cost: 0.078576
08/01/19 - 08:43.48PM, iter: 002, cost: 0.072297
08/01/19 - 08:43.48PM, iter: 003, cost: 0.070183
08/01/19 - 08:43.48PM, iter: 004, cost: 0.066722
08/01/19 - 08:43.49PM, iter: 005, cost: 0.063421
08/01/19 - 08:43.49PM, iter: 006, cost: 0.060710
08/01/19 - 08:43.49PM, iter: 007, cost: 0.058076
08/01/19 - 08:43.49PM, iter: 008, cost: 0.056020
08/01/19 - 08:43.49PM, iter: 009, cost: 0.053900
08/01/19 - 08:43.49PM, iter: 010, cost: 0.051753
08/01/19 - 08:43.49PM, iter: 011, cost: 0.049556
08/01/19 - 08:43.49PM, iter: 012, cost: 0.047366
08/01/19 - 08:43.49PM, iter: 013, cost: 0.045190
08/01/19 - 08:43.49PM, iter: 014, cost: 0.043074
08/01/19 - 08:43.49PM, iter: 015, cost: 0.041052


In [33]:
t.stop()

In [34]:
print("Time elapsed: {}".format(t))

Time elapsed: 76.7739


### Inspect Word Vectors

Like we did above for the word2vec and fastText models, let's now inspect our word embeddings

In [35]:
#load in the saved word vectors.
glove_wv = {}
glove_vector_txt_file_path = os.path.join(SAVE_FILES_PATH, "GloVe_vectors.txt")
with open(glove_vector_txt_file_path, encoding='utf-8') as f:
    for line in f:
        split_line = line.split(" ")
        glove_wv[split_line[0]] = [float(i) for i in split_line[1:]]

In [36]:
# 1. Let's see the word embedding for "apple" by passing in "apple" as the key.
print("Embedding for apple:", glove_wv["apple"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words.
print("\nFirst 30 vocabulary words:", list(glove_wv.keys())[:20])

Embedding for apple: [-0.031563, -0.002223, -0.02678, 0.032331, -0.052551, 0.033806, 0.027273, -0.040919, -0.032278, 0.144711, -0.056508, -0.006664, 0.18226, 0.087466, -0.072589, -0.003345, 0.058924, 0.054427, 0.012867, 0.013986, -0.083175, -0.02865, 0.044466, -0.095792, -0.042537, 0.019642, -0.032161, -0.038906, 0.253484, 0.090387, -0.03093, 0.081777, -0.085152, -0.113663, 0.10768, 0.068018, -0.05191, -0.177092, -0.06608, -0.223371, -0.016508, 0.232133, 0.002664, -0.132106, 0.078042, 0.111132, 0.052315, 0.010395, -0.031505, 0.128816]

First 30 vocabulary words: ['.', ',', 'man', '-', '"', 'woman', "'", 'said', 'dog', 'playing', ':', 'white', 'black', '$', 'killed', 'percent', 'new', 'syria', 'people', 'china']


# Concluding Remarks

In this notebook we have shown how to train word2vec, GloVe, and fastText word embeddings on the STS Benchmark dataset. We also inspected how long each model took to train on our dataset: word2vec took 0.39 seconds, GloVe took 8.16 seconds, and fastText took 10.41 seconds.

FastText is typically regarded as the best baseline for word embeddings (see [blog](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)) and is a good place to start when generating word embeddings. Now that we generated word embeddings on our dataset, we could also repeat the baseline_deep_dive notebook using these embeddings (versus the pre-trained ones from the internet). 