### Recurrent Neural Network

* Hidden states
    * To compute the first hidden state, compute based on previous hidden state and current input
    * Initial hidden state is h_0, and can learn how to initialize (or zero vector).
    
    * And to compute the new one 
        * Do linear transformation on previous hidden state, and on the current input, and then add a bias
        * $h(t) = \sigma(W_h h^{t - 1} + W_\epsilon e^{(t)} + b_1)$
            * This produces the new hidden state
            
* And then compute the next hidden state, and so on. 
* So then to predict next word, can take the current hidden state, with a linear layer (with softmax) to get output distribution (predicting word)


* And we'll have to learn both $W_e$ and $W_h$.

1. Advantages
    * Can process any length input
    * Computation for step t can (in theory) use information from many steps back.
    * Model size does not increase for longer input
    (Size of the model fixed (W_h and W_e biases).  And do not get bigger with longer inputs, because just apply the same weights repeatedly
    * Apply same weights on every time steps, (same transformation to all inputs) so can share learnings from previous inputs
    
2. Disadvantages
    * Recurrent computation is **slow**
        * Need to compute hidden state based on previous hidden state (so over a long sequence)
        * In practice, difficult to access information from many steps back (exploding/vanishing gradients)
        

* How can you apply this same weight matrix to all of these different words?
    * Idea is to learn a general function (not just how to deal with a specific word), given the words so far.  How to deal. with language given context so far
    * And all of words are vectors of length 500

    

### Backpropagation for RNNs?

* What's the derivative of $J^t(\theta)$ wrt the repeated weight matrix?

The gradient with respect to a repeated weight is the sum of the gradient wrt each time it appears.

<img src="./backprop.png" width="50%">

$\frac{\delta J^{(t)} }{\delta W_h} = \sum_{i = 1}^t \frac{\delta J ^{(t)}}{\delta W_h}$

Backprop through time (same as backprop before (just for rnn))

### Size of batch

* The size of the batch matters.  True.  And this is why stochastic gradient descent is only an approximation of true gradient descent.  Because the gradient you compute is just an approximation of true gradient of (so why shuffling data is important).  

* But over many steps should minimize your loss.

### Using RNN for sentence classification (eg sentiment)

* Encode the text using the rnn. So then have some kind of sentence encoding, so that can output the label for the sentence.  Would be useful to have a single vector to represent the sentence rather than lot of separate vectors.

* So we can use the final hidden state as the sentence encoding.  Remember the final hidden state predicts what comes next.  So we presume it knows about everything that came so far.

* But better is to take an elementwise max or mean of all hidden states.

### RNNs as general encoder module

* Question answering: 
    * What nationality was beethoven?
    1.  Use RNN to process the question.
        * Use the hidden states as representation of the question
        * So the hidden states represent the question.

[RNN Medium Pytorch](https://towardsdatascience.com/sentiment-analysis-using-lstm-step-by-step-50d074f09948)

[multivariable chain rule khan academy](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version)

[Stanford RNN](https://www.youtube.com/watch?v=6niqTuYFZLQ)

[Stanford nlp rnn](https://youtu.be/iWea12EAu6U?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&t=953)

[Colab RNN](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb)

[Andrej Code](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

[Jeremy Howard RNN](https://www.youtube.com/watch?v=H3g26EVADgY)

[Sentiment Pytorch RNN Code](https://github.com/scoutbee/pytorch-nlp-notebooks/blob/develop/3_rnn_text_classification.ipynb)

[Pytorch NLP Videos](https://github.com/scoutbee/pytorch-nlp-notebooks)

[Olah LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

[Tabular dataset](https://averdones.github.io/reading-tabular-data-with-pytorch-and-training-a-multilayer-perceptron/)