# NLP

## Word Embeddings

### Word Representation

- one-hot representation
  - cons: treats each word as its own separate entity
    - inner product of any two one-hot vectors is zero, so there are no ways for a model to learn relationships
  - not easy to generalize word meanings
- featurized representation
  - have features like gender, age, etc.
  - these features could be hand-selected, or they could be learned via neural network

visualizing word embeddings
- figuring out how to reduce the dimensionality of the features to 2D


### Using Word Embeddings

- learn word embeddings from large text corpuses (1-100B words)
- transfer embedding to new tasks with smaller training sets (100k words, eg)
- continue to fine-tune the word embeddings with new data

"face encoding" and "word embedding" have similar meanings

### Properties of Word Embeddings

- featurized word embeddings can use vector similarity to determine analogous relationships
  - cosine similarity -- u_transpose * v / length_u * length_v
  - euclidian distance -- length_u-v_squared
- eg: eman - ewoman ~ eking - equeen


### Embedding Matrix

say we have a 10k-word vocabulary (a, aaron, ..., orange, ..., zulu, `unknown`)

we have a 300 (n-features) by 10001 dimensional matrix E.

if we have E(300, 10k) * Ohv(10k, 1) = (300, 1) e6257 corresponding to the word orange.

E * oj = ej (the embedding for word j)

What we do is initialize E randomly and then use gradient descent to learn word embeddings.

It's not actually that efficient to just do matrix multiplication. we typically just look up the column.

## Learning Word Embeddings: Word2Vec and GloVe

### Learning Word Embeddings

I want a glass of orange `blank`.  
```[4343, 9665, 1, 3852, 6163, 6257, blank]```

I: e4343 = E * o4343  
want: e9665 = E * o9665  
etc.

We can then feed all the e vectors into a hidden layer which then feeds into a softmax, which can then select the most probable word to follow.

We can also train a context (eg last 4 words, 4 words on left and right, "nearby 1 word" skipgram).

### Word2Vec

We choose context and target words:

I want a glass of orange juice to go along with my cereal.

|context|target|
|---|---|
|orange|juice|
|orange|along|
|orange|my|

context c (orange, 6257) -> target t (juice, 4834)

find e6257 and e4834

feed e vectors to softmax node to output yhat

softmax: p(t|c) = e^(thetattranspose ec) / sum of all vocab (e^(thetajtranspose ec))

main drawback: very expensive to calculate softmax

can use hierarchical softmax to lessen computation

### Negative Sampling

Defining a new learning problem

|context|word|target|
|---|---|---|
|orange|juice|1|
|orange|king|0|
|orange|book|0|
|orange|the|0|

We create supervised learning problem where we insert pair of words, predict whether they are associated with each other in context.

e6257 fed into 10k neuron layer that predicts association, but we only train a selected number of negative examples.

how can we choose our negative samples?



### GloVe

global vectors for word representation

see how related words are to each other by seeing how often they appear next to each other

