# Encoding, Decoding, and Learning: The Math of Sequence-to-Sequence Translation





In this post, you'll explore the **mathematical foundations of sequence-to-sequence (seq2seq) models** for machine translation. We'll focus on the **RNN Encoder–Decoder architecture**, which consists of two main components: an **encoder RNN** and a **decoder RNN**. As an example, we'll see how this model can be trained to translate English phrases into French 🇬🇧➡️🇫🇷.


## Definition

#### $\textcolor{green}{\text{Sequence-to-Sequence (Seq2Seq) Translation }}$

Sequence-to-Sequence (Seq2Seq) is a deep learning framework designed to convert an input sequence (e.g., a sentence in French) into an output sequence (e.g., its English translation). It consists of two main components:

- **$\textcolor{green}{\text{ Encoder}}$** – Processes the input sequence and compresses it into a $\textcolor{green}{\text{context vector}}$, a fixed-length representation capturing the input's semantics.

- **$\textcolor{green}{\text{ Decoder}}$** – Generates the output sequence $\textcolor{green}{\text{step-by-step}}$, conditioned on the context vector from the encoder.



#### $\textcolor{green}{\text{Key Features }}$

- **Handles variable-length input and output** sequences (e.g., translating sentences of different lengths).
- **Supports various architectures**: RNNs, LSTMs, GRUs, and more recently, **Transformers** (state-of-the-art).
- **Trained end-to-end** using *teacher forcing*  i.e., during training, the decoder is fed the true previous token rather than its own prediction.


$\textcolor{green}{\text{ }}$
$\textcolor{green}{\text{ }}$

## Preliminaries:

:::{tip}  Recurrent Neural Networks

![](../images/RNN.png)
**Figure 1**: RNN (Image by the author).




### The input $\textcolor{green}{x_t}$
In a Recurrent Neural Network (RNN), the input at time step $t$, denoted as $\textcolor{green}{x_t}$, represents the data fed into the network at that point in the sequence. It is typically a vector (e.g., a word embedding) in $\mathbb{R}^d$, where $d$ is the embedding dimension.


As an example, lte's consider the 4-word sentence: $ \textcolor{green}{\text{``Cryptocurrency is the future''}}$. This sequence has length $m = 4$. The corresponding inputs are:

$$
\textcolor{green}{x_1} = \text{embedding}(\textcolor{green}{\text{``Cryptocurrency''}});\quad  
\textcolor{green}{x_2} = \text{embedding}(\textcolor{green}{\text{``is''}}) ; \quad  
\textcolor{green}{x_3} = \text{embedding}(\textcolor{green}{\text{``the''}}); \quad   
\textcolor{green}{x_4} = \text{embedding}(\textcolor{green}{\text{``future''}})
$$

### Hidden State $\textcolor{red}{h_t}$

The hidden state serves as the network's internal memory, encoding contextual information from the sequence up to time $t$. It is updated as shown above, and plays a key role in maintaining temporal dependencies. At each time step, the hidden state $\textcolor{red}{h_t}$ is updated based on the current input $\textcolor{green}{x_t}$ and the previous hidden state $\textcolor{red}{h_{t-1}}$:

$$
\textcolor{red}{h_t} = f(W_{hh} \textcolor{red}{h_{t-1}} + W_{xh} \textcolor{green}{x_t} + b_h)
$$

Here $W_{xh}$ and $W_{hh}$ are weight matrices, $b_h$ is a bias vector, and $f$ is a nonlinear activation function such as $\tanh$ or ReLU.

### Initial Hidden State $\textcolor{red}{h_0}$

The initial hidden state $\textcolor{red}{h_0}$ represents the starting memory of the RNN before any input is processed. It is typically initialized in one of the following ways:

- As a zero vector:  
  $
  \textcolor{red}{h_0} = \mathbf{0} \in \mathbb{R}^n
  $
  where $n$ is the dimensionality of the hidden state.

- With small random values (e.g., sampled from a normal distribution):  
  $
  \textcolor{red}{h_0} \sim \mathcal{N}(0, \sigma^2 I)
  $

- As a learned parameter:  
  $\textcolor{red}{h_0}$ can also be treated as a trainable vector that is learned during training, allowing the model to adapt its initial memory based on the data.

The choice depends on the specific task, model design, and desired behavior at the start of the sequence.


### Output $\textcolor{blue}{y_t}$

The output at time step $t$ is computed from the hidden state:

$$
\textcolor{blue}{y_t} = g(W_{hy} \textcolor{red}{h_t} + b_y)
$$

where $W_{hy}$ is the output weight matrix, $b_y$ is a bias vector and $g$ is an activation function such as softmax or sigmoid, depending on the task.



:::

:::{tip}  Sequence Modeling vs. Sequence Prediction




- **Sequence modeling** refers to learning the joint probability distribution of a sequence:
  
  $$
  p(\textcolor{green}{x_1}, \textcolor{green}{x_2}, \dots, \textcolor{green}{x_T}) = \prod_{t=1}^T p(\textcolor{green}{x_t} \mid \textcolor{green}{x_1}, \dots, \textcolor{green}{x_{t-1}})
  $$

  In this context, the model learns to estimate each conditional distribution $p(\textcolor{green}{x_t} \mid  \textcolor{green}{x_{<t}} )$. The output $\textcolor{blue}{y_t}$ is interpreted as:

  $$
  \textcolor{blue}{y_t} \approx p(\textcolor{green}{x_t} \mid  \textcolor{green}{x_{<t}} )
  $$

  That is, $\textcolor{blue}{y_t}$ is a **model-generated approximation** of the next-token distribution, based on past inputs.

- **Sequence prediction**, on the other hand, is a downstream task that uses the output $\textcolor{blue}{y_t}$ to **predict** the next token:

  $$
  \textcolor{green}{\hat{x}_t} = \arg\max_{\textcolor{blue}{j}} \textcolor{blue}{y_{t,j}} 
  $$

  Here, $\textcolor{blue}{y_t}$ is used to choose the most likely next symbol (i.e., prediction), but it still represents a distribution over the vocabulary.


:::

## RNN Encoder–Decoder