# Udacity: Deep Learning

## L4: Text and Sequences

### Train a text embedding model

Words are really hard - there are lots of them, and most of them we never see. But usually it is the rarest words that provide the most information. This is a big problem in deep learning.

### Semantic Ambiguity

We would like to share parameters between things that are similar, e.g. "cat" and "kitty." But these words are not always similar!

Big problems:

1. we'd like to see the really rare words often enough that we can learn how to use them

2. we need a way to relate words that are similar in semantic meaning

We need lots of labeled data to do this - more labeled data than we can handle! What to do...

### Unsupervised Learning

Unsupervised learning means training without labels. There is lots and lots of text out there (e.g., the web) _if_ we can figure out what to learn from it. There is a powerful concept we may exploit - similar words appear in similar contexts.

### Embeddings

* "the **cat** purrs"
* "this **cat** hunts mice"

Also reasonable to say

* "the **kitty** purrs"
* "this **kitty** hunts mice"

<img src='Screen_Shot_2016-03-01_at_6.10.15_PM.png'>

We hope we can produce a model that predicts a word's context. A model that predicts context will have to treat "cat" and "kitty" similarly and tend to bring them closer together.

<img src='Screen_Shot_2016-03-01_at_6.12.13_PM.png'>

This helps us solve the sparsity and representation problem once we have encoded our words into vectors called _embeddings_. Now all cat-like things will be similar, etc. The model no longer has to learn new things for every way there is to use a word - we get _generalization_.

### Word2Vec

"The quick brown fox jumps over the lazy dog."

<img src='Screen_Shot_2016-03-01_at_6.16.21_PM.png'>

1. We map each word in the sentance into an embedding - initially a random one.
2. We will use the embedding to try to predict the context of the word.
3. In this model, the context is the words that are nearby (within a window). We choose a random word in a window around our original word, and that becomes our target.
4. Then, we train our model as though it were a supervised problem. We use _logistic classifiers_ to make our prediction.

## tSNE

It would be good to see that our training was working by clustering similar words closer together. One way to do this would be to do a nearest neighbors look-up. Another way would be to reduce the dimensionality of the representation and project into two dimensions. But we need a way of doing this that preserves the neighborhood structure (things that are close in embedding space should still be close in 2D, and things that are far should remain far). This means PCA ends up not working so well, but a technique called "t-SNE" works well.

<img src='Screen_Shot_2016-03-01_at_6.24.07_PM.png'>

<img src='Screen_Shot_2016-03-01_at_6.25.40_PM.png'>

### Word2Vec Details

<img src='Screen_Shot_2016-03-01_at_6.26.33_PM.png'>

The cosine distance is generally a better measure than $L_2$ - this because the length of the embedding vector is not relevant to the classification. In fact, it is often better to normalize all embedding vectors to have unit norm.

We can also use _sampled softmax_ to make our computations easier:

<img src='Screen_Shot_2016-03-01_at_6.29.29_PM.png'>

The idea behind sampled softmax is rather than treat our target vector such that the target word has probability 1 and every other word has probability 0, we only sample a small random subset of the whole rest of the vocabulary, and pretend the other words aren't there. This makes computations much more efficient and is almost as good as using the full vector.

### Word Analogy Game Quiz

<img src='Screen_Shot_2016-03-04_at_7.52.57_AM.png'>

* kitten
* shorter

### Analogies

Saying a puppy is to a dog as what a kitten is to a cat is an example of a _semnatic analogy_. Saying taller is to tall as shorter is to short is an example of a _syntactic analogy_.

<img src='Screen_Shot_2016-03-04_at_7.59.43_AM.png'>

### Segue to Assignment 5: Word2Vec and CBOW

Run Word2Vec and use it as a refernce to train a continuous bag of words (CBOW) model.