![](images/book.png)

---
By The End Of This Session You Should Be Able To:
---

- Explain why word2vec is powerful and popular
- Describe how word2vec is a neural network
- Identify the common architectures of word2vec
- Apply word2vec to dataset

---
Pop Quiz
---

Do computers prefer numbers or words?



__Numbers__
<br>
<br>
word2vec is a series of algorithms to map words (strings) to numbers (lists of floats).
</details>

word2vec Result
-----
![](images/family.png)

---
Check for understanding
---

How many dimensions are data represented in? How many dimensions would we need to represent for typical word vectors?

They are respresented in 2 dimensions.

Typically you would need word vectors would need n-1 (a baseline word can be represented as all zeros). 

Why is word2vec so popular?
----

1. Creates a word "cloud", organized by semantic meaning.

2. Converts text into a numerical form that machine learning algorithms and Deep Learning Neural Nets can then use as input.

3. word2vec creates dense representations

----
<img src="images/firth.png" style="width: 300px;"/>

>“You shall know a word
>by the company it keeps”

> \- J. R. Firth 1957

__Distributional Hypothesis__: Words that occur in the same contexts tend to have similar meanings

__Example:__  
> ... government debt problems are turning into __banking__ crises...  

> ... Europe governments needs unified __banking__ regulation to replace the hodgepodge of debt regulations...

The words: _government_, _regulation_, and _debt_ probably represent some aspect of _banking_ since they frequently appear near by.

How does word2vec model the Distributional Hypothesis?
---

word2Vec is a very simple neural network:
![](images/w2v_neural_net.png)

[Source](http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf)

---
Updating the weights
----

![](images/basic_neural_net.png)

1. Initialize random weights.  
2. Calculate loss function.  
3. Update weights via back propagation.

![](images/w2v_neural_net.png)

The bow-tie shape is an __autoencoder__. 

Autoencoders compress sparse representations into dense representation. 

The neural network learns the mapping that best preserves the structure of the original space.

---
Check for understanding
---

What is a loss function? Give a couple of examples.

A lost function is how you weight your errors.

For example, sum of squared residuals heavily penalizes large misses. 

While hinge loss ignores some errors all together.

How does this look during training?
---

In [1]:
from IPython.display import display, VimeoVideo

display(VimeoVideo(112168934))

---
The 2 architectures of word2vec
----

1. “Continuous bag of words”: Predict a missing word in a sentence based on the surrounding context

2. “Skip-gram”: Each current word as an input to a log-linear classifier to predict words within a certain range before and after that current word

Continuous bag of words (CBOW) architecture
----

<img src="images/cbow.png" style="width: 400px;"/>
Given the context (surronding words), predict the current word.

[Detailed explanation](http://alexminnaar.com/word2vec-tutorial-part-ii-the-continuous-bag-of-words-model.html)

Skip-gram architecture
----

<img src="images/skip-gram.png" style="width: 300px;"/>
Given the current word, predict the context (surrounding words).

---
Skip-gram example
---

>“Insurgents killed in ongoing fighting”


In [6]:
bigrams = ["insurgents killed", "killed in", "in ongoing", "ongoing fighting"]

skip_2_bigrams = ["insurgents killed", "insurgents in", "insurgents ongoing", 
                  "killed in", "killed ongoing", "killed fighting", 
                  "in ongoing", "in fighting", 
                  "ongoing fighting"] 

>“Insurgents killed in ongoing fighting”

In [7]:
tri_grams = ["insurgents killed in", "killed in ongoing", "in ongoing fighting"]

skip_2_tri_grams = ["insurgents killed in", "insurgents killed ongoing", "insurgents killed fighting", "insurgents in ongoing", "insurgents in fighting", "insurgents ongoing fighting",
                    "killed in ongoing", "killed in fighting", "killed ongoing fighting", 
                    "in ongoing fighting"] 

Skip-Gram architecture, deep dive
----

![](images/skip_gram_detailed.png)

The target word is now at the input layer, and the context words are on the output layer.

On the output layer, instead of outputing one multinomial distribution, we are outputing C multinomial distributions. Each output is computed using the same hidden to output matrix.

Objective function:
![](images/multinomial_distributions.png)

where $u_j$ is the computed a score for each word in the vocabulary, Using these weights, we can 

<img src="images/u.png" style="width: 150px;"/>

Because the output layer panels share the same weights,

Loss function:
![](images/skip_gram_loss.png)

CBOW vs. Skip-gram
---
CBOW is several times faster to train than the skip-gram and has slightly better accuracy for the frequent words.  

Skip-gram works well with small amount of the training data and well represents rare words.

Skip-gram tends to be the most commmon architecture.

Now that we have word vectors, what can we do?
----



Math with words!



<img src="images/math.jpg" style="width: 300px;"/>

Types of Word Math
----

1. Distance
2. Arithmetic
3. Clustering

1. Distance
---
<br>
<img src="images/family.png" style="width: 300px;"/>

Words that are related will be closer than unrelated words, thus the relationships between words can encoded as distance through the space.

----
Ways to measure distance
----

<img src="http://i1.wp.com/dataaspirant.com/wp-content/uploads/2015/04/euclidean.png?w=600" style="width: 400px;"/>

<img src="http://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/manhattan.png?w=600" style="width: 400px;"/>

---
Check for Understanding
----

Can Manhattan Distance be extended to more than 2 dimensions?

<img src="http://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/cosine.png?resize=610%2C468" style="width: 400px;"/>

[Read more here](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)

Cosine similarity is most often used in NLP.

Because cosine similarity is automatically normalized. It is bounded between -1 and 1, similar to a correlation.

Words closest to “Sweden”
----

<img src="images/sweden_cosine_distance.png" style="width: 300px;"/>
 
[Source](http://deeplearning4j.org/)

![cosine_sim](https://upload.wikimedia.org/math/f/3/6/f369863aa2814d6e283f859986a1574d.png)

1 meaning exactly the same  
0 indicating orthogonality (decorrelation)  
−1 meaning exactly opposite  

In [8]:
import numpy as np

def cos_sim(v1, v2):
    "Calculate cosine similarity between vector 1 and 2"
    pass # TODO: Finish function to make tests pass

def test_cos_sim():
    v1 = np.array([1, 2, 3])
    v2 = np.array([-1, -2, -3])
    v3 = np.array([0, 3])
    v4 = np.array([4, 0])
    v5 = np.array([3, 45, 7, 2])
    v6 = np.array([2, 54, 13, 15])
    assert cos_sim(v1, v1) == 1.0
    assert cos_sim(v1, v2) == -1.0
    assert cos_sim(v3, v4) == 0.0
    assert round(cos_sim(v5, v6), 4) == round(0.97228425171235, 4)
    return "tests pass :)"
    
print(test_cos_sim())

AssertionError: 

In [None]:
# A solution - using norms
def cos_sim(v1, v2):
   "Calculate cosine similarity between vector 1 and 2"
   return v1.dot(v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print(test_cos_sim())

In [None]:
import math

# A solution - pythonic but slow
def dot_product(v1, v2):
    return sum(map(lambda x: x[0] * x[1], zip(v1, v2)))

def cos_sim(v1, v2):
    "Calculate cosine similarity between vector 1 and 2"
    v1_len = math.sqrt(dot_product(v1, v1))
    v2_len = math.sqrt(dot_product(v2, v2))
    return dot_product(v1, v2) / (v1_len * v2_len)

print(test_cos_sim())

In [None]:
# A solution - unpythonic but faster
def cos_sim(v1, v2):
    "Calculate cosine similarity between vector a and b"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

print(test_cos_sim())

2. Arithmetic: Word analogies
---

![](http://multithreaded.stitchfix.com/assets/images/blog/vectors.gif)

[Demo](http://rare-technologies.com/word2vec-tutorial/#app)

The "Hello, world!" of word2vec:
> Man is to woman as king is to queen

$cos(w, king) - cos(w, man) + cos(w, woman) = cos(w, queen)$

Use word2vec to build data products
----

<img src="https://assets.toptal.io/uploads/blog/image/827/toptal-blog-image-1423052243609.jpg" style="width: 400px;"/>

When I worked at an employment website, I built a recommendation engine for job seekers. The job seeker would have a resume and we would suggest jobs for them. My goal was given a current job title, suggest a "better" job. This would increase platform engagement.

What improved job would you recommend to a Babysitter?


A Nanny. 

A Nanny is a Babysitter as Senior Engineer is to a Engineer.


### Plurals

![](images/plurals.png)  

Different paths through word2vec space encode different relationships. More on this next time with doc2vec

Verb Tense
----

<img src="images/verb.png" style="width: 300px;"/>

Country-Captial
----

![](images/country.png)

3) Clustering
----

<img src="http://static1.squarespace.com/static/52165be2e4b046d1ac57778c/t/55f4a66de4b016fee4ec7595/1442096821668/left.gif?format=1500w" style="width: 400px;"/>

[Source](http://douglasduhaime.com/blog/clustering-semantic-vectors-with-python)

Use your favorite!

K-means is a good start.

Word2vec implementation
---

1. Code
2. Data

Code
----

1. [Google’s TensorFlow](https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html)
2. [Python’s Gensim package](https://radimrehurek.com/gensim/)  
3. [Google’s word2vec](https://code.google.com/p/word2vec/)  

Corpus (aka, data in NLP)
----

> "Data is the world's best regularizer"

You need __a lot__ of data.

100 billion words is good start 😉. 100 million will work. 10 million is a good minimum.

---
Check for understanding
---


How do we evaluate word2vec, especially if it is built on a custom corpus?


Word2Vec is an unsupervised learning algorithm. Thus there’s no good way to objectively evaluate the result. 

One possible method is to compare analogies performance with pretrained Google vectors.

Summary
----

- word2vec: Create a dense vector representation of words that models semantic meaning based on context
- Word2Vec is a _relatively_ simple neural net with 1 input layer, 1 hidden layer, and 1 output layer.
- There are 2 common ways to represent context: 
    1. CBOW: given context, predict word
    2. skip-gram: given word, predict context
- Once trained, any vector operations can be applied to words. The most common operations are: arithmetic, distance, and clustering.
- Sets you up for machine learning and Deep Learning

<br>
<br> 
<br>


Speeding up skip-gram
---

Since Skip-gram is slow (look at the architecture), the smart people at The Google optimitizated within the architecture.



> When in doubt, throw a binary tree at it.

This particular binary tree is call _Hierarchical Softmax_. 



__What the hell is softmax?__

It is a normalized exponential.

![](images/softmax_function.png)

J is the current class. 
K is all classes.

Generalization of the logistic function to multi-class.

Used in various probabilistic multiclass classification methods:

- multinomial logistic regression
- multiclass linear discriminant analysis
- naive Bayes classifiers 
- artificial neural networks

[Source](https://en.wikipedia.org/wiki/Softmax_function)



### Okay then ... What is Hierarchical Softmax?

![](images/binary_tree.png)

Uses a binary tree as a data structure to represent all words in the vocabulary. The V words must be leaf nodes of the tree. For each leaf node, there exists an unique path from the root to the node. This path is used to estimate the probability of the word represented by the leaf node.  



<br>
<br> 
<br>

----