In [None]:
everything2vec
----

<img src="images/all_the_things.jpg" style="width: 300px;"/>

In [None]:
- add retraining trick

In [None]:
<img src="images/emjois.png" style="width: 300px;"/>

In [None]:
emjoi2vec
----
<img src="https://s3.amazonaws.com/instagram-static/engineering-blog/emoji-hashtags/tsne_map_tight.png" style="width: 400px;"/>

[Source](http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji)

----
_2vec
====

<img src="images/all_the_things.jpg" style="width: 400px;"/>

| Name | Embedding  | 
|:-------:|:------:| 
| [Char2Vec](http://arxiv.org/abs/1508.06615) | Character |
| [Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) | Word | 
| [GloVe](http://www-nlp.stanford.edu/pubs/glove.pdf) | Word | 
| [Doc2Vec](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) | Sections of text |
| [Image2Vec](Image2Vec) | Image |
| [Video2Vec](https://www.dropbox.com/s/m99k5md8461xi0s/ICIP_Paper_Revised.pdf) | Video |

[Source](http://datascienceassn.org/content/table-xx2vec-algorithms)

---
By the end of this session, you should be able to:
---

- Describe the general goal and methods for sense2vec and like2vec
- Describe the details of lda2vec
- Describe the relationship between Bayesian prior and regularization

---
Sense2vec: Revenge of POS tagging
----

<img src="images/duck.png" style="width: 400px;"/>

word2vec will model polysemy (the ability for the same literal characters as different human meaning) because it so may dimensions. 

Let's help the model by introducing priors:

`(duck|NOUN)`

`(duck|VERB)`

<img src="http://www.socher.org/uploads/Main/MultipleVectorWordEmbedding.png" style="width: 400px;"/>

In [2]:
from IPython.display import IFrame

IFrame("https://spacy.io/demos/sense2vec",
      width=800,
      height=600)


[Demo](https://spacy.io/demos/sense2vec)

[Source: Blog post](https://spacy.io/blog/sense2vec-with-spacy)  
[Source: Technical paper](http://arxiv.org/abs/1511.06388)

---
lda2vec Overview
----

<img src="images/catdog_word2vec_cropped.jpg" style="width: 400px;"/>

---
word2vec review
----

1. Set up an objective function 
2. Randomly initialize vectors
3. Do gradient descent

Given the input vector (typically word) maximize the probablity of the output vector (typically contex):

$$P(v_{OUT}|v_{IN})$$

Convert to probability:  
$$softmax(v_{IN} • v_{OUT})$$

Probability of choosing 1 of N discrete items.   
Mapping from vector space to a multinomial over words.

---
LDA review
---

<img src="http://salsahpc.indiana.edu/b649proj/images/proj3_LDA%20structure.png" style="width: 400px;"/>

<img src="https://i.ytimg.com/vi/Acs_esny-qQ/hqdefault.jpg" style="width: 400px;"/>

---
word2vec vs. LDA
----

<img src="images/compare_models.png" style="width: 400px;"/>


| algorithm | scope | prediction | numbers | visualization | density | metaphor| 
|:-------:|:------:| :------:| :------:| :------:| :------:|  :------:|
| word2vec | local | one word predicts a nearby words | real numbers | bar chart | dense | location  |
| LDA | global | documents predict global words | percentages that sum to 100%  | pie chart | sparse | mixture| 


----
lda2vec
-----

<img src="images/lda2vec.png" style="width: 400px;"/>

$v_{doc}$ is a mixture:  
$v_{doc}$ = a $v_{topic1}$ + b $v_{topic2}$ + ... 

![](images/doc_vec.png)

---
Bayesian prior as a instantiation of regularization
----

---
Review Question
---

<details><summary>
What is a Plain English definition regularization?
</summary>
It is a tax on complexity.
<br>
Small is beautiful so reduce the less helpful parts of your model
</details>

---
Example of using Bayesian prior
---

In Bayesian estimation, we come up with a distribution of possible parameters using Bayes' rule: 

$$ P(D|θ) = \frac{P(D|θ)P(θ)}{P(D)}$$

where $P(θ)$ is known as the prior. Then, to make predictions about future events, we need to integrate over this distribution of possible $θ$.

Let me give an example to make this concrete. Let's say we have a coin that comes up heads with some probability $P(θ)$. We see two heads come up. Our likelihood then becomes  $P(D|θ)=θ^2, which is clearly maximized when  $θ=1$. So, our MLE is that the coin always comes up heads, and so we predict future coins will all come up heads. We see why MLE can be a bit silly: often, it overfits the data and does not generalize well, but it is good for a first estimate.

Now, let us think about the Bayesian approach. Now, let's say we initially know (our prior) that our coin has a one-half chance of having  $θ=\frac{1}{2}$  and a one-half chance of having  $θ=1$. Let's say we observe the same two heads for coin flips. Now, let us calculate:

$P(θ=1/2|D)∝P(D|θ)P(D)=1/8$   
$P(θ=1|D)∝P(D|θ)P(D)=1/2$

Normalizing, we see that we have a 1/5 chance of having a fair coin, and a 4/5 chance of having a coin that always comes up heads. So, we estimate that a new coin flip would come up heads 1/5(1/2) + 4/5(1) = 9/10 of the time. Based on our prior beliefs, we would come up with different answers, but we see that in some sense, the Bayesian estimate is more "reasonable" than just using MLE.

This is called [Maximum a posteriori estimation (MAP)](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation)

[Source](https://www.quora.com/Intuitively-speaking-What-is-the-difference-between-Bayesian-Estimation-and-Maximum-Likelihood-Estimation)  
[Technical introduction to the concept](http://www.mit.edu/~9.520/spring09/Classes/class15-bayes.pdf)  
[Specific application](http://papers.nips.cc/paper/2160-on-the-dirichlet-prior-and-bayesian-regularization.pdf)

---
Connection to regularization
----

We wieght the sample space of parameters to reduce the chance of over-fitting to the data.

---
Constrained lda2vec
----

![](images/topic_lda2vec.png)

---
Check for understanding
---

![](images/moody_lda.png)

<details><summary>
Which word2vec architecture does Chris Moody use in lda2vec?
</summary>
skip-gram
</details>

---
lda2vec algorithm
----

![](images/defination.png)

---
Simplex sidebar
----

Generalized notion of a triangle.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Pink_triangle_up.svg/2000px-Pink_triangle_up.svg.png" style="width: 200px;"/>
2-simplex is a triangle

<img src="https://upload.wikimedia.org/wikipedia/commons/8/83/Tetrahedron.jpg" style="width: 200px;"/>
3-simplex is a tetrahedron

__REDACTED: HARD TO VISUALIZE 4 DIMENSIONS__
4-simplex is a 5-cell

---

Simplex is has coordinates are nonnegative and sum to one. 

The 2-simplex is the triangle in  $ℝ^3$  whose vertices are at the coordinates (1, 0, 0), (0, 1, 0), (0, 0, 1).

Simplex are extremely constrained and well-behaved. That makes machine learning models easier to interpret.

[Source](https://en.wikipedia.org/wiki/Simplex)

---
Check for understanding
---

<details><summary>
What other statistical concept is typically constrained to sum to 1?
</summary>
probabilities
</details>

---
lda2vec Executive Summary
----

![](images/punchline.png)

lda2vec adds additional context, defines context as topic

----
Segmentation
----

<img src="http://www.omaticsoftware.com/Portals/0/EasyDNNnews/29/iStock_000019765925_Medium.jpg" style="width: 400px;"/>

---
Bonus: Take Home Message
---

<img src="images/stat sig.gif" style="width: 400px;"/>

Balance the tension between model power and interpretation.

Models want to be powerful (thus complex). Business wants to understand the model.

Machine readable vs. Human readable

---
lda2vec implementation
----

[GitHub repo](https://github.com/cemoody/lda2vec)

---
like2vec
----

<img src="images/like2vec.png" style="width: 400px;"/>

From Galvanize & DeepMind:

- Mike Tamir
- Adam Gibson
- Marvin Bertin
- Michael Ulin
- David Ott

[Prerequisite reading: Adjacency matrix](https://en.wikipedia.org/wiki/Adjacency_matrix)

[Slide deck](https://docs.google.com/presentation/d/19QDuPmxB9RzQWKXp_t3yqxCvMBSMaOQk19KNZqUUgYQ/edit#slide=id.g11bda3c091_34_0)

---
Summary
----

- sense2vec adds POS tags for embedding
- like2vec adds adjacency matrix for embedding 
- lda2vec adds topics of document as additional context for embedding
- Bayesian prior is a fast and an easy way to regularize
- NLP now has the tools to move from academic navel gazing to adding business value

<br>
<br> 
<br>

----