Sure! Here’s a simplified, complete exam note for your Lecture 8 on Recurrent Neural Networks (RNNs) from Elena Raponi’s Neural Computing course. I added examples to make concepts clearer.

---

# Lecture 8: Recurrent Neural Networks (RNNs)

**Neural Computing Course – Leiden University**
**Lecturer: Elena Raponi**
Date: 14/04/2025

---

## 1. Motivation – Time Series Forecasting

* Many data come as sequences (time series), like weather data, stock prices, or text sentences.
* Data points depend on previous points (temporal dependency).
* **Question:** Can normal multilayer neural networks (NNs) handle sequences?
* **Problem:**

  * How much past data should we use?
  * Multilayer NNs ignore order and time dependency.
  * Relationships in data may change over time.

*Example:* Predict tomorrow's temperature using past week’s data — simple NNs don’t consider order, but time matters!

---

## 2. Introduction

* Traditional multilayer NNs have no cycles (no feedback).
* What if we add cycles?
* Cycles mean the network’s output depends on previous outputs (memory).
* Given the same input, output changes over time forming a sequence.

---

## 3. Hopfield Networks (Memory example)

* A type of recurrent network used for associative memory.
* Uses activation functions like tanh or sigmoid.
* The network “settles” into stable states called attractors — these represent stored memories.
* Given a noisy or incomplete input, it retrieves the closest stored pattern.

*Example:* Remember patterns like handwritten digits; if input is noisy ‘3’, it recalls a clean ‘3’.

* **Discrete Hopfield Network:** outputs limited to {-1, 1} using sign function.
* Weight update: $w_{ij} \propto x_i x_j$ to store patterns.

---

## 4. Vanilla Recurrent Neural Network

* Adds a **hidden state** $h(t)$ that acts as short-term memory, remembers past info.
* At each time step $t$, output $O(t)$ depends on input $x(t)$ and previous hidden state $h(t-1)$.

*Example:* In language, to understand the current word, the RNN remembers previous words.

---

## 5. Sentiment Classification (Simple NLP task)

* Input: sequence of words (e.g., a movie review).
* Output: classify review as **positive** or **negative**.
* The RNN processes words one by one, updating hidden state, then outputs sentiment prediction.

---

## 6. Training RNNs – Backpropagation Through Time (BPTT)

* Unfold the RNN over time steps into a long feedforward network.
* Compute gradients at each step, then average them.
* Update weights accordingly.

---

## 7. Problems with Vanilla RNNs

* **Vanishing gradients:** gradients become too small, so the network "forgets" early inputs.
* **Exploding gradients:** gradients become too large, causing unstable training.
* Depends on the spectral norm (largest singular value) of weight matrix $W$.

*Solution:* Use **LSTM** networks to handle long-term dependencies better.

---

## 8. Long Short-Term Memory Networks (LSTMs)

* Introduced by Hochreiter & Schmidhuber (1997).
* Key idea: use a **cell state** $C_t$ that carries long-term memory through the network.
* Use **gates** (sigmoid units) to control flow of information:

### Gates in LSTM:

* **Forget gate $f_t$:** Decides what information to throw away from the cell state.
* **Input gate $i_t$:** Decides what new information to add to the cell state.
* **Output gate $o_t$:** Decides what part of the cell state to output.

*Why sigmoid?* It outputs values between 0 and 1, which act as "filters" or "scales" for information flow.

*Why tanh for input?* To keep the added information in a suitable range (-1 to 1).

---

### Cell State – “Constant Error Carousel”

* The cell state lets gradients flow unchanged through many time steps, solving vanishing gradient problems.
* But, if forget gate values $f_t$ are close to 1 for many steps, old info is preserved well; if close to 0, info is quickly forgotten.
* For very long sequences, this still can be tricky, leading to the use of **attention mechanisms**.

---

## 9. Sequence-to-Sequence (Seq2seq) Models

* Popular for tasks like machine translation and multi-step forecasting.
* Consist of two parts:

### Encoder

* Encodes input sequence into a hidden representation.
* Uses word embeddings + RNN/LSTM layers.
* Initial hidden state often random.

### Decoder

* Generates output sequence from the encoder’s final hidden state.
* Uses previous outputs as inputs for next steps.

*Example:* Translate English sentence to French word by word.

---

# Summary of Key Points

| Topic                         | Main Idea                                          | Example                         |
| ----------------------------- | -------------------------------------------------- | ------------------------------- |
| Time Series Forecasting       | Data has order & time dependency                   | Predict stock price over time   |
| Hopfield Network              | Associative memory, stable states                  | Recall handwritten digits       |
| Vanilla RNN                   | Hidden state remembers short past                  | Sentiment classification        |
| BPTT                          | Training method unfolding RNN in time              | Backprop through time steps     |
| Vanishing/Exploding Gradients | Problems in training vanilla RNNs                  | Forgetting or unstable training |
| LSTM                          | Uses gates & cell state to remember long-term info | Language modeling, speech       |
| Seq2seq Model                 | Encoder-decoder for sequence translation           | Machine translation             |

---

If you want, I can also make a diagram summary or help with practice questions! Would that help?
