# Study Notes on Word Embedding and Word2Vec

**Outline**

* [What is word embedding?](#intro)
* [Learn word embedding](#learn)
    * [Train neural network models from Scratch](#example0)
    * [Word2Vec](#word2vec)
        * Skip Gram & CBOW
        * [Example 1: training word2vec on dummy sentences](#example1)
        * [Example 2: word2vec using pretrain weights on dummy sentences](#example2)
            * Using the pre-train weights from Google new data
            * Using the pre-train weights from GloVe
    * [Text Classificaiton by using pre-trained weights](#example3)
* [Reference](#refer)
---

## <a id='intro'>What is word embedding?</a>

The goal of word embedding is to represent word in a vector space that can be used to know which words are semantically more similar to the others. For example, we can use word embedding to know *man* vs *woman* is equal to *king* to *??*. 

If we have a document with 10,000 words, the most simple way that we can represent word as a vector is by using one-hot encoding. However, doing this doesn't help our goal since the distance between all of the words are the same. *Orange* to *Apple* would be as different as *Orange* to *Table*.

If we have several documents, it may seem that we could leverage TF-IDF to obtain word embedding. However, the output of TF-IDF is a vector for each word in a document. The same word in different documents can be with different value.  Hence, this can not be the word embedding that we would like to use.

Most dominating ways to obtain good embedding are by using neural networks. There are several ways we can obtain the embeddings:

1. train neural network models from Scratch
2. train Word2Vec model
3. train model with pretrain weights

The word "embedding" comes from the concept that when each word is represented in a hyper-dimensional space, it is somehow "embedded" inside the cube, as we can seen from the following picture

<img src="https://www.tensorflow.org/images/linear-relationships.png" style="width: 500px;height: 200px;"/>

## <a id='learn'>Learn Word Embedding</a>

### <a id='example0'>Train neural network models from scratch</a>

If we have the following document

```
['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
```

with label 
[1,1,1,1,1,0,0,0,0,0]

The input of the model would be a one hot vector representing each word, and goal is to use those one-hot vectors to predict the outcome, which in this case is 1 (positive sentiment) or 0 (negative sentiment).

The goal here, however, it's not to actually predict the label when a new word comes in. The goal is to get the embedding matrix trained from the model. This "fake" model can help us know which two words are similar with each other in the vector space.

In [255]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])

# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

# summarize the model
print(model.summary())

# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

[[19, 38], [39, 25], [44, 34], [12, 25], [8], [49], [36, 34], [17, 39], [36, 25], [31, 48, 38, 36]]
[[19 38  0  0]
 [39 25  0  0]
 [44 34  0  0]
 [12 25  0  0]
 [ 8  0  0  0]
 [49  0  0  0]
 [36 34  0  0]
 [17 39  0  0]
 [36 25  0  0]
 [31 48 38 36]]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 4, 8)              400       
_________________________________________________________________
flatten_10 (Flatten)         (None, 32)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 89.999998


In [256]:
encoded_docs

[[19, 38],
 [39, 25],
 [44, 34],
 [12, 25],
 [8],
 [49],
 [36, 34],
 [17, 39],
 [36, 25],
 [31, 48, 38, 36]]

In [257]:
# access the embedding layer through the constructed model 
# first `0` refers to the position of embedding layer in the `model`
# The shape of our embedding matrix is 50*8 with 50 words be represented in a 8 dimensional space
embeddings = model.layers[0].get_weights()[0]
print(embeddings.shape)

(50, 8)


In [259]:
from scipy import spatial
# good & well are with high cosine similarity
print(1 - spatial.distance.cosine(model.get_weights()[0][19], model.get_weights()[0][39]))

# good & poor are with negative cosine similarity
print(1 - spatial.distance.cosine(model.get_weights()[0][19], model.get_weights()[0][36]))

0.751288831234
-0.362344622612


## <a id='word2vec'>Word2Vec</a>

Word2Vec is a famous and widely used algorithm to learn word embedding, and is developed by [Tomas Mikolov](https://www.linkedin.com/in/tomas-mikolov-59831188/) while he worked at Google. 

There are 2 method proposed by word2vec: Skip Grams & CBOW
Both of them have a concept of context and target word. Let's what it means and what the input/output of the model is
The following paragraph is borrowed from the [blog post by Lilian Weng](https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html)

The following example demonstrates multiple pairs of target and context words as training samples, generated by a 5-word window sliding along the sentence.

**“The man who passes the sentence should swing the sword.” – Ned Stark**

| Sliding window (size = 5)| 	Target word |	Context|
| --- | --- | ---| 
|[The man who]|	the	|man, who|
|[The man who passes]|	man|	the, who, passes|
|[The man who passes the]|	who|	the, man, passes, the|
|[man who passes the sentence]|	passes|	man, who, the, sentence|
|…	|…	|…|
|[sentence should swing the sword]|	swing|	sentence, should, the, sword|
|[should swing the sword]|	the|	should, swing, sword|
|[swing the sword]|	sword|	swing, the|

Each context-target pair is treated as a new observation in the data. For example, the target word “swing” in the above case produces four training samples: (“swing”, “sentence”), (“swing”, “should”), (“swing”, “the”), and (“swing”, “sword”).

When training these nn models, the input is the context words and the output is the target words. The length of both vecters equals to the total number of uniqle words that we have in the document. We basically want to use the context word to predict the target words. The goal, however, is not to actually predict the target words, but to obtain the embedding matrix with, for example, 300*10k, using this matrix as a look-up table for each word. The column of this matrix is word embedding that we can use to see which word are semantically similar to each other. 

The concept of "word" to the "document/sentence" can be generalized to "Song" to "playlist". We can using the same approach to know which songs are similar to each other. The assumption is that songs that people listen not far away to each other are the songs similar to each other (or people would like to listen to). These "song embedding" can also be rolled up to a user level, by average all the songs that a user has listened to during a certain period of time and use the vector to represent their "taste". 

> **Skip Gram**

<img src="https://lilianweng.github.io/lil-log/assets/images/word2vec-skip-gram.png" style="width: 500px;height: 240px;"/>

This is the actual architecture of the skip-gram model. Both the input vector x and the output y are one-hot encoded word representations. The hidden layer is the word embedding of size N.

The hidden layer neurons just copy the weighted sum of inputs to the next layer. There is no activation like sigmoid, tanh or ReLU. The only non-linearity is the softmax calculations in the output layer.

Note on the details of the Skip Gram method
* If we have 10k words, the input would be an one-hot vector with 1*10k. The number of nodes in the hidden layer are the dimension we want to have for the embedding, for example, 300 is a number that is used in the original [paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). The output is also a `1*10k` vector, each value is a probability (the value for the target word should be close to 1).
* Model structure: `1*10k` -> #params (10K, 300) -> hidden (300 nodes) -> #params (300, 10k) -> `1*10k` -> softmax activation to obtain prob -> `1*10k`
* each input is a pair of the (context, target) word that we randomly select
* There are heuristics to select context words to prevent us from having most of the context words being common words such as *the*, *or*, ...etc.
* The assumption of the Skip Gram is that words that are more similar to each other should appear not far away from each other.

Notice that when using this architecture, the number of #weights we have is very large, and all of which would be updated slightly by every one of our billions of training samples.

That's why there are method to reduce the burden we have when training a word2vec model.
Approached proposed by the author includes
* Treating common word pairs or phrases as single “words” in their model.
* Subsampling frequent words to decrease the number of training examples.
* Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.
    * Some more note copied from [Chris McCormick's post](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)
    * With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0 for). We will also still update the weights for our “positive” word (which is the word “quick” in our current example).
    * The paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.
    * Recall that the output layer of our model has a weight matrix that’s 300 x 10,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer!
    * In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not).

> **Continuous Bag of Words (CBOW)**

<img src="https://lilianweng.github.io/lil-log/assets/images/word2vec-cbow.png" style="width: 500px;height: 280px;"/>

The Continuous Bag-of-Words (CBOW) is another similar model for learning word vectors. It predicts the target word (i.e. “swing”) from source context words (i.e., “sentence should the sword”).

As we can seen from the picture above, because there are multiple contextual words, we average their corresponding word vectors, constructed by the multiplication of the input vector and the matrix W. Because the averaging stage smoothes over a lot of the distributional information, some people believe the CBOW model is better for small dataset.

### <a id='example1'>Example 1: training word2vec on dummy sentences</a>

[Gensim](https://radimrehurek.com/gensim/models/word2vec.html) package made training a word2vec model very simple. There are a few parameters we can use for the `Word2Vec` function provided by the package.

* **size** (int, optional) – Dimensionality of the word vectors.
* **window** (int, optional) – Maximum distance between the current and predicted word within a sentence.
* **min_count** (int, optional) – Ignores all words with total frequency lower than this.
* **sg** ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW. Default as 0 (CBOW)
* **hs** ({0, 1}, optional) – If 1, hierarchical softmax (a method to make the training faster) will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
* **negative** (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
* **cbow_mean** ({0, 1}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
* **iter** (int, optional) – Number of iterations (epochs) over the corpus. Default as 5
* **compute_loss** (bool, optional) – If True, computes and stores loss value which can be retrieved using get_latest_training_loss(). Can set to True if we want to plot out the interation by loss curve to see how the training works.

In [173]:
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]

# train model
# we'll set is to specify use all possible cpu to train the model
workers = cpu_count()
model = Word2Vec(sentences, min_count=1, workers = workers)

# summarize the loaded model
print(model)

# summarize vocabulary
words = list(model.wv.vocab)
print(words)
print('\n')

# access vector for one word
print('Embedding vector for the word "sentence"')
print(model.wv['sentence'])

Word2Vec(vocab=14, size=100, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']


Embedding vector for the word "sentence"
[ 0.00048939  0.00408542 -0.00023339 -0.00396113 -0.00327537 -0.00412497
 -0.00178292 -0.00437832  0.00489555 -0.00091878 -0.00305983 -0.00026808
 -0.00035443 -0.00381484 -0.00484873  0.00288246  0.00254953  0.00249098
  0.00020342 -0.00273202  0.00166556  0.00313711 -0.0017676   0.00158689
  0.00158976  0.00427394 -0.00336939  0.003442    0.0021848  -0.00069696
  0.00439488  0.00206062 -0.00347071 -0.00197479  0.00442823  0.00494515
  0.00415632  0.00094811  0.00350606  0.0021377  -0.0013946  -0.00313342
 -0.00186622 -0.00114939 -0.00458247  0.00200281 -0.00364599  0.00444268
  0.00164236 -0.00240062 -0.00029585  0.00401085 -0.00405918 -0.00022826
 -0.00112923 -0.0015245   0.00293028  0.00094623  0.00427847  0.00032669
 -0.0021502  -0.0049019   0.00012381 -0.00447032  0.00367364 -0.

In [175]:
# obtain the learned word vectors (.wv.syn0)
# and the vocabulary/word that corresponds to each word vector
word_vectors = pd.DataFrame(model.wv.vectors, index = model.wv.index2word)
print('word vector dimension: ', word_vectors.shape)
word_vectors.head()

word vector dimension:  (14, 100)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
sentence,0.000489,0.004085,-0.000233,-0.003961,-0.003275,-0.004125,-0.001783,-0.004378,0.004896,-0.000919,...,0.003318,-0.00482,0.00441,-0.00069,0.002779,0.001763,-0.00392,0.003755,-0.004193,-0.00231
the,0.002589,-0.004835,-0.000137,0.002902,-0.001457,-0.0016,-0.001007,0.001201,0.001009,0.001874,...,-0.001787,-0.002109,-0.003923,-0.002521,-0.001706,-0.004513,0.001444,-0.000483,0.00137,-0.002054
this,0.001551,0.003661,-0.003675,-4.3e-05,0.003793,0.000991,-0.004843,-0.00362,0.004297,0.000726,...,0.00248,3e-05,0.003841,-0.003818,-0.002045,0.000897,0.003276,0.001593,0.004817,-0.003775
is,-0.001222,0.002017,-0.000802,-0.004712,-0.004576,-0.003952,-0.001933,0.002939,-0.002747,-0.001481,...,0.002226,-0.000846,-0.00147,0.002327,-0.000778,-0.003111,0.003591,0.002167,-0.004706,0.0045
first,-0.000299,-0.001017,0.003187,0.000733,0.004506,-0.003052,-0.000668,-0.00084,-0.001082,0.00442,...,0.003569,-0.004474,-0.004653,-0.000127,0.0027,0.000683,0.001394,-0.00327,-0.003773,0.003823


In [176]:
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Word2Vec(vocab=14, size=100, alpha=0.025)


### <a id='example2'>Example 2: word2vec using pretrain weights on dummy sentences</a>

The following content are from the post [How to Develop Word Embeddings in Python with Gensim, by Jason](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/). Big thanks!

> **Using the pre-train weights from Google news data**

Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the Word2Vec Google Code Project.

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors.

It is a 1.53 Gigabytes file. You can download it from [here, GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/uc?export=download&confirm=0-86&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM). Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.

The Gensim library provides tools to load this file. Specifically, you can call the KeyedVectors.load_word2vec_format() function to load this model into memory, for example:

In [189]:
from gensim.models import KeyedVectors
filename = './data/GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [190]:
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

[('queen', 0.7118192911148071)]


> **Using the pre-train weights from GloVe**

Stanford researchers also have their own word embedding algorithm like word2vec called Global Vectors for Word Representation, or GloVe for short.

Generally, NLP practitioners seem to prefer GloVe at the moment based on results.

Like word2vec, the GloVe researchers also provide pre-trained word vectors, in this case, a great selection to choose from.

You can download the GloVe pre-trained word vectors ([glove.6B.zip](https://nlp.stanford.edu/projects/glove/)) and load them easily with gensim.

The first step is to convert the GloVe file format to the word2vec file format. The only difference is the addition of a small header line. This can be done by calling the glove2word2vec() function. For example:

In [187]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = './data/glove.6B/glove.6B.100d.txt'
word2vec_output_file = './data/glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

(400000, 100)

In [188]:
from gensim.models import KeyedVectors
# load the Stanford GloVe model
filename = './data/glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

[('queen', 0.7698541283607483)]


### <a id='example3'>Text Classification by using pre-trained weights from GloVe</a>

We can leverage the pre-train weight to help us perform a text classification problem. The concept is the same as transfer learning, which we freeze the weight of the embedding matrix and only train on the final dense layer. Notice that the goal here is to perform text classfication. 

If we only want to know the word embedding, we don't need to train any model. All we need to do is to obtain the weights directly, just like the previous exmaples.

In the following example, we can see that the accuracy has increased from ~90% to 100% by using the pre-trained weights from GloVe.

In [191]:
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]
[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]


In [192]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./data/glove.6B/glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))
for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.


In [197]:
# define model
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

# summarize the model
print(model.summary())

# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 4, 100)            1500      
_________________________________________________________________
flatten_9 (Flatten)          (None, 400)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 401       
Total params: 1,901
Trainable params: 401
Non-trainable params: 1,500
_________________________________________________________________
None
Accuracy: 100.000000


In [206]:
model.layers[0].get_weights()[0].shape

(15, 100)

**The wegihts from the embedding matrix are freezed**

In [253]:
from scipy import spatial
# well & good are with high cosine similarity
print(1 - spatial.distance.cosine(model.get_weights()[0][6], model.get_weights()[0][3]))

# excellent & weak are with lower cosine similarity
print(1 - spatial.distance.cosine(model.get_weights()[0][9], model.get_weights()[0][10]))

0.80795609951
0.432923734188


In [254]:
# the cosine similarity of these words are the same before and after the training
print(1 - spatial.distance.cosine(embeddings_index.get('well'), embeddings_index.get('good')))
print(1 - spatial.distance.cosine(embeddings_index.get('excellent'), embeddings_index.get('weak')))

0.80795609951
0.432923734188


## <a id='refer'>Reference</a>

* machinelearningmastery, Jason Brownlee
    * [How to Use Word Embedding Layers for Deep Learning with Keras](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)
    * [How to Develop Word Embeddings in Python with Gensim](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/)
* [Chris Mccormick](http://mccormickml.com/)
    * [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
    * [Word2Vec Tutorial Part 2 - Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)
    * [Applying word2vec to Recommenders and Advertising](http://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/)
* [Deep Learning with Python: 6.1-using-word-embeddings.ipynb](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.1-using-word-embeddings.ipynb)
* [Andrew Ng, deeplearning.ai video: word2vec](https://www.coursera.org/learn/nlp-sequence-models/lecture/8CZiw/word2vec)
* [Cross Validated: How does Keras 'Embedding' layer work?](https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work)
    * Useful explaination of the shape from output of the embedding layer.
* [Word2vec (Skipgram) post by Ethen Liu](https://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/deep_learning/word2vec/word2vec_detailed.ipynb#Word2vec-(Skipgram)
* [Machine learning @ Spotify - Madison Big Data Meetup](https://www.slideshare.net/AndySloane/machine-learning-spotify-madison-big-data-meetup)
    * See page 34 for how spotify leverage word2vec in building it's recommender system
* Useful resources contraining the actual architecture and explaining how Skip Gram and CBOW works.
    * [Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
    * [Learning Word Embedding](https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html)
* [Gensim word2vec documentation](https://radimrehurek.com/gensim/models/word2vec.html)    