# Introduction to word embeddings

## 1. Why sequence models?

Models like Recurrent neural networks (RNNs) have significantly contributed in areas like Natural Language Processing and speech recognition. Some examples:

- speech recognition: given an input audio file, and the output is a sequence of words.
- Music generation: Output y is a sequence, x can be nothing or the first few notes or...
- sentiment classification: is the review good or bad? Or how many stars?
- Machine translation: google translate
- ...

Let's look at this example. Imagine we have a dataset where each observation is a sentence, and for each sentence we want a model to detect which words are associated with locations. Let's look at the sentence: 

"While we were in school, mom went shopping in the square."

This is a sentence with 11 words, or put differently, a sequence with 11 elements. If this is the first sentence in our entire input space, we would denote it $ x^{(1)}= (x^{(1)<1>},...,x^{(1)<11>})$.

$y^{(1)}$ would then be: $(0,0,0,0,1,0,0,0,0,0,1)$, with 11 elements, $(y^{(1)<1>},...,y^{(1)<11>})$.

The sentence has eleven words, $T_x^{(1)} = T_y^{(1)} = 11$. As $T_x$ and $T_y$ will differ depending on the sentence length, they need the subscript (i) too!
Note that, this just represents this particular sentence $x^{(1)}$. Other sentences will have other lengths. 

There is a problem with using one-hot vectors: each word is a thing to itself, and cannot easily generalize across words. eg, relationship between fruits, between locations,... unclear

## 2. Intro to word embeddings

### 2.1 One-hot representations.

- As we've seen before, you use the 10,000 most recurring words in the training set, or look at an online dictionary of commonly used words.
- Apply dictionary to our text to create one-hot representations per word
- each of the vector is 9,999 times 0 and 1 time 1.
- With 11 words in a sentence, 11 one-hot vectors of 10,000 units.
- If word not in in your 10,000 words dictionary, you create a new vector

Solution to one-hot vectors: word embeddings! Example:

|      | Duke  | Dutchess | cow | dog |
| -----|------ |------- |-----| ----|
|Noble |  1    | 1      | 0.03 |  0.01 |
|  food | 0.01 | -0.02 | 0.7  | 0.08 |
| Gender | -1  |  1   | 0.02 | 0.03 |

The words on the left column are "features". Now imagine "duke", "dutchess", "cow" and "dog" represent locations in our 10,000 most recurreing words vector. If Duke is on the 1265th place, we can create a n-dimensional vector (with n the amount of features) representing the word "duke". This way, our algorithm will know that "cow" and "dog" are more similar than "cow" and "duke".

Where one-hot vectors are sparse, high-dimensional and hardcoded, word embeddings are dense, lower-dimensional and learned from data (not hardcoded)!.




Word embeddings can also be visualized, using the t-SNE algorithnm to make an n-dimensional space 2D!

"While we were in school, mom went shopping in the square."

--> if we know that school and square are physical locations, similar words like street, class,... would be recognized more easily. 

How do transfer learning from word embeddings? 
- 1) learn word embeddings for large body of text (or download pre-trained embeddings online)
- 2) Transfer embedding to smaller training set
- 3) If you want, you can continue finetuning the word embeddings with new data.

word embeddings especially useful when you are doing tasks on small data sets.

### 2.2 Properties of word embeddings


1. by taking differences between word embeddings, you can get a better understanding of words that are related to each other. eg. vector difference between boy and girl is similar to the vector difference between duke and dutchess! Only difference is gender. A high degree of similarity between 

arg max $sim(e_{w}, e_{boy}-e_{duke}+e_{duchess})$, usually cosine similarity is used.

|n|   a   | ...  | cow | duke | ... | zucchini|
|--|-----|------|------- |-----| ----|-----|
|1 | gender|... | 0.03 | -1 |  0.01 | -0.02|
|2 |food | ... | -0.02 | 0.7  | 0.08 |0.99|
|... | ... |... | ... | ... | ... | ... | 
|500| furniture |...  | 0.02 | 0.01| ...|-0.03|

when you multiply one-hot vector representing "cow" with embeddingsmatrix, you will basically just keep the column with the embedding representation for cow! In practice, you'll use an embedding layer that simply pulls out a column, because performing these multiplications is much slower when we have huge vectors with all 0's and just one 1.

### 2.3 Some examples of language models

#### 2.3.1 learning word embeddings

You want to predict the next word in a sequence

1. A sentence: get the word embedding vector for each word. Embedding vector has a fixed amount of rows (=number of embeddings).

2. Feed all of them into a neural  network. Usual practice is to fix the amount of presceding words to, let's say, 5. If embedding has 500 rows, leads to 2500-dimensional feature vector going into the first layer. End of the network is a softmax.

other possibilities: last 3 words and 3 words after, last 1 word, nearby 1 word.

#### 2.3.2 word2vec

simpler way to learn embeddings: **skip-gram model**. Context-target pairs. You randomly pick a word to be context word, and then randomly pick a target word withing a specified window from the context word, let's say, 5 words apart.

"The teacher was writing on the blackboard while the principal walked in"

Context: blackboard 

target: could be principal, while, blackboard,...

We want to map our context "c" (blackboard) to our target "t"

The model looks like this (with e the embessing : softmax: 
$P(t|c) = \displaystyle \frac{e^{\Theta^T_te_c}}{\sum^{10,000}_{j=1}e^{\Theta^T_je_c}}$

loss function: $\mathcal L (\hat y, y ) = -\sum^{10,000}_{i=1}y_i \log \hat y_i$

$\Theta_t$ is the parameter associated with output t

https://www.coursera.org/learn/nlp-sequence-models/lecture/8CZiw/word2vec

Problem: slow, so hierarchical softmax is suggested!

#### 2.3.3 Negative sampling

word2vec is slow! Try this:

- given a pair of words, try to predict whether or not it is a context target pair (1 or 0). This way, label word combinations with 0 or 1. First actively label pairs of words, then create a supervised learning problem.
- logistic regression model: P(y=1|c,t) with c and t the context target pair.
- Do a small logistic regression model with about 5-6 content-target pairs (only 1 1, rest is 0 so you deliberately use your negative samples!). Repeat this 10,000 times where each time another content word is chosen. faster than working with a softmax classifier! 


#### 2.3.3 Glove

GloVe: less frequently used, but has enthusiast because so simple!

Again c,t: use x_{i,j} to define how often 2 words appear close to each other.

# 3. Words embeddings applications

- debias embeddings
- classify movie reviews