# Encoding, Decoding, and Learning: The Math of Seq2Seq Translation







In this post, you’ll learn the **mathematical foundations of sequence-to-sequence models** for machine translation 🌍. We’ll focus on a type of neural network called the **RNN Encoder–Decoder**, which is made up of two parts:

- 🧠 An **encoder RNN** that takes a variable-length input sequence and turns it into a fixed-length vector.
- 🧾 A **decoder RNN** that takes this vector and generates a variable-length output sequence.

The model is trained to **predict the target sequence given the input** 🎯, by learning the conditional probability of the output based on the input. 

As an example, we’ll see how this model can be trained to translate English phrases into French 🇬🇧➡️🇫🇷.

:::{tip} Preliminary: Recurrent Neural Networks

![](../images/RNN.png)
**Figure 1**: RNN (Image by the author).




## The input $x_t$
In a Recurrent Neural Network (RNN), the input at time step $t$, denoted as $x_t$, represents the data fed into the network at that point in the sequence. It is typically a vector (e.g., a word embedding) in $\mathbb{R}^d$, where $d$ is the embedding dimension.



##### Example: A 4-Word Sentence

Consider the sentence:

$$
\text{``Cryptocurrency is the future''}
$$

This sequence has length $m = 4$. The corresponding inputs are:

$
\begin{aligned}
x_1 &= \text{embedding}(\text{``Cryptocurrency''}) \\
x_2 &= \text{embedding}(\text{``is''}) \\
x_3 &= \text{embedding}(\text{``the''}) \\
x_4 &= \text{embedding}(\text{``future''})
\end{aligned}
$

## Hidden State $h_t$

The hidden state serves as the network's internal memory, encoding contextual information from the sequence up to time $t$. It is updated as shown above, and plays a key role in maintaining temporal dependencies.At each time step, the hidden state $h_t$ is updated based on the current input $x_t$ and the previous hidden state $h_{t-1}$:

$$
h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b_h)
$$

Here:
- $W_{xh}$ and $W_{hh}$ are weight matrices,
- $b_h$ is a bias vector,
- $f$ is a nonlinear activation function such as $\tanh$ or ReLU.

## Initial Hidden State $h_0$

The initial hidden state $h_0$ represents the starting memory of the RNN before any input is processed. It is typically initialized in one of the following ways:

- As a zero vector:  
  $
  h_0 = \mathbf{0} \in \mathbb{R}^n
  $
  where $n$ is the dimensionality of the hidden state.

- With small random values (e.g., sampled from a normal distribution):  
  $
  h_0 \sim \mathcal{N}(0, \sigma^2 I)
  $

- As a learned parameter:  
  $h_0$ can also be treated as a trainable vector that is learned during training, allowing the model to adapt its initial memory based on the data.

The choice depends on the specific task, model design, and desired behavior at the start of the sequence.


## Output $y_t$

The output at time step $t$ is computed from the hidden state:

$$
y_t = g(W_{hy} h_t + b_y)
$$

Where:
- $W_{hy}$ is the output weight matrix,
- $b_y$ is a bias vector,
- $g$ is an activation function such as softmax or sigmoid, depending on the task.



:::