# Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 6 – Language Models and RNNs
- [Lecture 6](https://www.youtube.com/watch?v=iWea12EAu6U) – Language Models and RNNs
- [Lecture 7](https://www.youtube.com/watch?v=QEw0qEa0E50) – Vanishing Gradients, Fancy RNNs
- [details](http://web.stanford.edu/class/cs224n/index.html#schedule) - assignments

## Language Model

We use Language model to get probobility of next word based on previous words

### Motivation
- benchmark task - that helps us measure our progress on understanding language
- subcomponent of tasks generating text, estimating probability of text and etc.


### Applications
- sensenses classification (who is speaking)
- encoding module (e.g QA system)


### n-gram
#### trade-off between size of n-gram and sparsety
- sometimes last n words isn't enough to know what should be next - **solution:** increase n
- but when we get bigger $n$ we get bigger data sparsity
- but because word `w` never occure in context of words' sequence $w_1,...,w_i$ there will be `0` chance to get that word -> **solution:** add small delta to all words - smooth probobility
- but if context of works $w_1,...,w_i$ never was in data set we would have devision by `0` -> **solution:** backoff to smaller n, $w_2,...,w_i$

## a fixed-window neural Language Model
### Steps
1. get words (in window)
2. one-hot vector for each word (x)
3. reflect to embeding ($e_i$)
4. concatenate together ($e = [e_1,...e_i]$)
5. hidden layer: $h = f(We + b_1)$
6. output distribution: $\hat{y} = softmax(Uh + b_2) \in R^{|V|}$

### Improvements
- no sparsity problem
- don't need to store all observer n-grams

### Remaining problems
- window never can be large enough
- enlarging window enlarges $W$
- $x^{(1)}$ and $x^{(2)}$ are multiplied by completely different weights in W. No symmetry in how the inputs are processed. - so what you learn in one section of metrix doesn't share with other sections, what leads to inefficient learning of word embedding in W

## RNN
1. get words (in window)
2. one-hot vector for each word (x)
3. reflect to embeding ($e_i$)
4. 
$$
h^{(t)} = \sigma{(W_{h}h^{(t-1)} + W_{e}e^{t} + b_1)}
$$
$h^{(0)}$ - the initial hidden state
5. output $\hat{y} = softmax(Uh^{(t)} + b_2) \in R^{|V|}$

### Steps
- big corpus of texts $x^{(1)},...,x^{(T)}$
- feed into RNN-LM (Language Model); compute output distribution $\hat{t^{(t)}}$ for every step $t$, predict probability dist of every word, given words so far
- loss function on step $t$ is cross-entropy between predicted probability distribution $\hat{y}^{(t)}$, and the true next word $y^{(t)}$ (one-hot for $x^{(t+1)}$):
$$
J^{(t)}(\theta) = CE(y^{(t)}, \hat{y}^{(t)}) = - \sum_{w \in V} y_w^{(t)} \log \hat{y}^{(t)}_w = - \log \hat{y}^{(t)}_{t+1}
$$
- average this to get overall loss for entire training set:
$$
J(\theta) = \frac{1}{T} \sum_{t=1}^{T} J^{(t)}(\theta) = - \frac{1}{T} \sum_{t=1}^{T} \log \hat{y}^{(t)}_{t+1}
$$
- coputing loss for entire corpus is too expensive, so we use sentense or document from corpus (Stochastic Gradient Descent)
- derivative of $J(\theta)$ w.r.t. (with respect to) the repeated weight matrix $W_h$

$$
\frac{\partial{J^{(t)}}}{\partial{W_h}} = \sum_{i=1}^{t}\frac{\partial{J^{(t)}}}{\partial{W_h}}|_{i}
$$
backpropagate over timesteps $i=t,...,0$, summing gradients as you go. This algorithm is called "backpropagation through time"

### pros
- can process **any length input**
- computation for step `t` can (in theory) use information from **many steps back**
- **model size doesn't increase** for longer input
- same weights apply one every timestep, so there is symmetry in how inputs are processed

### cons
- can't process sequence in parallel (slow)
- in practice, it's difficult to access information many steps back

### Evaluation Language Model
the lower is better
$$
perplexity = \prod_{t=1}^{T}(\frac{1}{P_{LM}(x^{(t+1)} | x^{(t)},...,x^{(1)}})^{1/T}
$$

- inverse probability of corpus, according to Language Model
- $1/T$ - normalized by number of words - need it because with bigger corpus it would be smaller and smaller

is equal to the exponential of the cross-entropy loss $J(\theta)$:
$$
= \prod_{t=1}^{T}(\frac{1}{\hat{y}^{(t)}_{x_{t+1}}}) = \exp (1/T \sum_{t=1}^{T} - \log \hat{y}^{(t)}_{x_{t+1}} ) = exp(J(\theta))
$$

comparision different model in Facebook: [Building an Efficient Neural Language Model Over a Billion Words](https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/)

http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture06-rnnlm.pdf

https://youtu.be/iWea12EAu6U?t=3251