# Attention-based text generation

### Introduction

**Why is Text Generation So Hard for AI?**

Text generation is one of the toughest challenges in AI. Unlike humans, who can easily recall information, pick up on context, and make inferences, machines struggle to stay coherent and consistent over long sequences. Keeping track of the bigger picture while generating meaningful text is a difficult task for AI.

**RNNs: A Game-Changer (But Not Perfect)**

One of the key breakthroughs in text generation was the introduction of Recurrent Neural Networks (RNNs). They revolutionized Natural Language Processing (NLP) by making it possible to handle sequence-based tasks like **machine translation, text summarization, and even creative writing.

**Why are RNNs useful?**  

- They **process words sequentially**, meaning they understand how words relate to each other in order.  
- They **capture dependencies** between words, which helps maintain logical flow.  

Despite their advantages, RNNs face serious limitations. The biggest issue is the vanishing gradient problem, which makes it hard for the network to remember information from earlier in a long sequence. This means the longer the text, the harder it is for the model to stay relevant and coherent. Even improved versions like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) struggle with long-range dependencies.

**Attention Mechanisms: The Game-Changer**  

To solve this issue, attention mechanisms were introduced. These allow models to dynamically focus on the most relevant parts of the input sequence instead of treating all words equally. This breakthrough was a major milestone in text generation, leading to much better results.

Some key papers that introduced attention-based text generation include:  
- [Show, Attend and Tell](https://arxiv.org/pdf/1502.03044) – A model that applied attention to image captioning, allowing it to describe images in a more natural way.  
- Data-to-Text Generation with Attention Recurrent Units – A system that used attention to generate summaries of NBA games from game statistics.

Thanks to attention mechanisms, text generation models are now far more powerful, making them essential for applications like chatbots, summarization, and machine translation.

---

### What Will We Cover?
In this notebook, we will:
- Introduce attention mechanisms.
- Explore different types of input data for text generation:
  - Text: Machine translation and language modeling.
  - Images: Generating captions using vision-based models (e.g., ["Show, Attend and Tell"](https://arxiv.org/pdf/1502.03044)).

Let's get started !


In [42]:
!git clone https://github.com/javrit/notebook_tuto_captioning.git

fatal: destination path 'notebook_tuto_captioning' already exists and is not an empty directory.


## A Dive into Attention Mechanisms

### Introduction
Attention is a technique in machine learning that helps models figure out which parts of a sequence are the most important. In natural language processing (NLP), this means assigning different weights to words in a sentence based on their relevance. More broadly, attention helps map connections between different parts of a sequence—whether it’s a few words or millions of tokens.

This idea comes from how humans pay attention. Instead of treating all words equally, attention lets a model focus on the right parts at the right time. It was created to fix a big problem with Recurrent Neural Networks (RNNs)—they tend to focus too much on recent information while forgetting earlier details. Unlike RNNs, attention allows models to access any word in a sentence directly, making it much better at handling long-range dependencies.

In simple terms, attention helps a model zoom in on the most important parts of an input. Whether it's a sentence, an image, or even a game summary, attention figures out what matters most. For example, in machine translation, attention assigns different importance levels to words, making sure the right ones are considered at each step of translation.

![Attention mechanism](img/attention.png)
*https://datascience.stackexchange.com/questions/66913/how-does-attention-mechanism-learn*

Let's have a quick look of how it mathematically works.

### Mathematical Explanation of Attention 

Attention mechanisms compute a **weighted sum of values** based on the similarity between **queries** and **keys**, allowing a model to dynamically focus on the most relevant parts of an input sequence. This process consists of three main steps:

---

##### **1. Score Calculation (Alignment Scores)**
Each token in the input sequence is represented by three vectors:  
- **Query (Q)** – Represents the current token looking for relevant information.  
- **Key (K)** – Represents all tokens in the input sequence, serving as references.  
- **Value (V)** – Contains the actual token representations that will be combined to form the final output.  

The attention mechanism first computes a **score** for each token pair to determine its relevance. In the **encoder-decoder attention** setting, this score measures how well an input token $h_i$ aligns with a decoder state $s_t$:

$$
    e_{ti} = f(s_t, h_i)
$$

where $f$ is a scoring function, often defined as:  
- **Dot-product**: $e_{ti} = s_t^T h_i$  
- **Additive (Bahdanau)**: $e_{ti} = v^T \tanh(W_s s_t + W_h h_i)$  

In **self-attention (used in Transformers)**, the score is computed using **dot-product similarity** between queries and keys:

$$
    e_{ti} = Q K^T
$$

where $Q = X W_Q$ and $K = X W_K$ are obtained from the input sequence $X$.

---

##### **2. Weight Assignment (Softmax Normalization)**
The raw alignment scores are then normalized using the **softmax function**, ensuring that they sum to 1 and act as attention weights:

$$
    \alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j} \exp(e_{tj})}
$$

In **self-attention**, we add a scaling factor to stabilize training:

$$
    \alpha_{ti} = \text{softmax} \left(\frac{Q K^T}{\sqrt{d_k}} \right)
$$

where $d_k$ is the dimension of the key vectors.

---

##### **3. Context Vector Computation**
Finally, the attention mechanism computes a **weighted sum of values** to create a **context vector**:

$$
    c_t = \sum_{i} \alpha_{ti} h_i
$$

In self-attention, this is expressed as:

$$
    \text{Attention}(Q, K, V) = \alpha V
$$

where $V = X W_V$ represents the values that encode the actual input token information.

---

#### **Summary**
<table>
    <tr>
        <td><img src="img/attentionkey.png" width="500"></td>
        <td><img src="img/selfqttention.png" width="500"></td>
    </tr>
</table>

*https://www.linkedin.com/feed/update/urn:li:activity:7113403245703663616/*
- The **score function** determines how relevant each token is to the current step.
- **Softmax assigns attention weights**, controlling focus on different tokens.
- **A weighted sum of values produces the final attention output**, dynamically adapting to context.

This mechanism enables models to effectively capture long-range dependencies.

OK ! Now that we have seen a small reminder on how attention works, let's jump into practice !

## Generating text given a context sequence of text

In this section, we will implement a model that allows the generation of text with a context sequence of text. It is a really basic example to see how it works. The goal is to generate answers to common french sentences (salut ça va ?).

In [55]:
import torch
import torch.nn as nn
import torch.optim as optim
import random

# Set the device (use GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)


Using device: cpu


### Data Preparation

We create a simple character-level vocabulary which includes:
- **Special tokens:** `<pad>`, `<sos>` (start-of-sequence), and `<eos>` (end-of-sequence)
- **Letters:** A space followed by the letters `a` to `z` (all in lowercase)

Next, we define two functions:
- `sentence_to_indices`: Converts a sentence into a list of indices by mapping each character to its index (adding `<sos>` at the beginning and `<eos>` at the end).
- `indices_to_sentence`: Reconstructs a sentence from a sequence of indices while ignoring the special tokens.

Finally, we create a small dataset consisting of pairs (context → target sentence).


In [56]:
# Define a simple character-level vocabulary
special_tokens = ["<pad>", "<sos>", "<eos>"]
letters = list(" abcdefghijklmnopqrstuvwxyz")  # Note: space is included
vocab = special_tokens + letters
vocab_size = len(vocab)
print("Vocabulary Size:", vocab_size)

# Create mapping dictionaries
word2index = {word: idx for idx, word in enumerate(vocab)}
index2word = {idx: word for idx, word in enumerate(vocab)}

# Function to convert a sentence to a list of indices
def sentence_to_indices(sentence):
    sentence = sentence.lower()
    indices = [word2index["<sos>"]]
    for ch in sentence:
        if ch in word2index:  # Keep only known characters
            indices.append(word2index[ch])
    indices.append(word2index["<eos>"])
    return indices

# Function to convert a sequence of indices back into a sentence
def indices_to_sentence(indices):
    words = [index2word[idx] for idx in indices if idx not in [word2index["<sos>"], word2index["<eos>"], word2index["<pad>"]]]
    return "".join(words)

# Example dataset (pairs: context -> target sentence)
data_pairs = [    
    ("Ça fait un bail","Je suis trop content(e) de te voir"),
    ( "Ça faisait longtemps !", "Ça fait plaisir de te voir"),
    ("Comment tu vas depuis la dernière fois ?", "Ça va."),
    ("Tu vas bien ?", "Je vais bien."),
    ("ça va ?", "Je pète la forme"),
    ("Comment tu vas ? ","Bof, on fait avec."),          
    ("Comment ça va ?", "C’est pas la grande forme."),   
    ("La forme ?","Je suis sur les rotules."),
    ("Quoi de neuf ?", "Rien de nouveau."),
    ("Qu’est-ce que tu racontes de beau ?", "La routine."),
    ("Comment s’est passée ta semaine ?", "J’étais débordé(e)."),
    ("Ça va la famille ?", "Tout le monde va bien"),
    ("Il fait beau chez toi ?", "C’est vraiment une belle journée."),
    ("T’as des plans ce weekend ?", "J’ai un repas avec des amis samedi soir."),
    ("T’as entendu la dernière ?", "Oui, c’est complètement fou !"),
    ("T’es allé(e) au cinéma dernièrement ?", "J’ai vu le nouveau Batman au ciné. C’était super long, mais j’ai bien aimé."),
    ("Tu vas prendre des vacances prochainement ?", "Oui, je vais prendre des vacances bientôt."),
    ("On se voit bientôt ?", "À bientôt !"),
    ("Salut !", "Salut !")
]


# Convert the sentences into sequences of indices
data = []
for inp, target in data_pairs:
    input_indices = sentence_to_indices(inp)
    target_indices = sentence_to_indices(target)
    data.append((input_indices, target_indices))


Vocabulary Size: 30


### Encoder

The encoder is composed of an embedding layer followed by an LSTM. An embedding layer is a special type of neural network layer used to transform categorical data (like words in text) into dense numerical vectors. As a reminder : LSTM stands for **Long Short-Term Memory**. It is a special type of recurrent neural network (RNN) designed to effectively capture long-term dependencies in sequential data while mitigating the vanishing gradient problem common in traditional RNNs.

It receives a sequence of indices (the context sentence) as input and produces:
- The LSTM outputs for each time step
- The final hidden state and cell state

These states are later used by the decoder.



![lstm](img/LSTM.png) 

*Source : [Show, Attend and Tell](https://arxiv.org/pdf/1502.03044)*

Key characteristics of LSTMs include:

- **Memory Cells:** They maintain a cell state that acts as a memory, carrying relevant information throughout the processing of a sequence.
- **Gating Mechanisms:** LSTMs use gates (input, forget, and output gates) to control the flow of information. These gates decide which information is important to keep, update, or discard over time.
- **Handling Long-Term Dependencies:** By managing information over long sequences, LSTMs are well-suited for tasks such as language modeling, text generation, and time-series prediction.

In our model, the LSTM is used to process the sequence of embedded characters, learning the patterns and dependencies in the input text to generate coherent output.


In [45]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim) #The embedding maps each token index to a dense vector of a fixed
        #size (embedding_dim here). It converts these indices (discrete) into continuous vectors that capture semantic between tokens
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
    
    def forward(self, input_seq):
        # input_seq shape: (batch, seq_len)
        embedded = self.embedding(input_seq)             # (batch, seq_len, embedding_dim)
        outputs, (hidden, cell) = self.lstm(embedded)        # outputs: (batch, seq_len, hidden_dim)
        return outputs, hidden, cell

### Attention-based Decoder

The decoder generates the output sequence character-by-character.  
At each time step, it uses:
- The current input token (embedded)
- A context vector computed using an attention mechanism (dot-product between the decoder hidden state and the encoder outputs)

The context vector is computed by taking a weighted sum of the encoder outputs, where the weights are determined by the attention scores.  
This context is concatenated with the embedding and fed into the LSTM.  
Finally, the LSTM output and the context are combined to produce a distribution over the vocabulary.


In [46]:
class AttentionDecoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(AttentionDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # Concatenate the embedding and the context vector before feeding into the LSTM
        self.lstm = nn.LSTM(embedding_dim + hidden_dim, hidden_dim, batch_first=True)
        # Output layer combines the LSTM output and the context to produce vocabulary distribution
        self.out = nn.Linear(hidden_dim * 2, vocab_size)
    
    def forward(self, input_token, hidden, cell, encoder_outputs):
        # input_token shape: (batch) the current token index
        embedded = self.embedding(input_token).unsqueeze(1)  # (batch, 1, embedding_dim)
      
        # --- Dot-product Attention Mechanism ---
        # hidden shape: (num_layers, batch, hidden_dim); we use the first layer (shape: (batch, hidden_dim))
        decoder_hidden = hidden[0]           # (batch, hidden_dim)
        # encoder_outputs shape: (batch, seq_len, hidden_dim)
        # Compute attention scores (dot product)
        attn_scores = torch.bmm(encoder_outputs, decoder_hidden.unsqueeze(2)).squeeze(2)  # (batch, seq_len)
        attn_weights = torch.softmax(attn_scores, dim=1)       # (batch, seq_len)
      
        # Compute context vector as the weighted sum of encoder outputs
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)  # (batch, 1, hidden_dim)
      
        # Concatenate the embedding and the context
        lstm_input = torch.cat((embedded, context), dim=2)     # (batch, 1, embedding_dim + hidden_dim)
        output, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))  # output: (batch, 1, hidden_dim)
      
        # Combine LSTM output and context for the final prediction
        output = output.squeeze(1)    # (batch, hidden_dim)
        context = context.squeeze(1)  # (batch, hidden_dim)
        combined = torch.cat((output, context), dim=1)  # (batch, 2*hidden_dim)
        output = self.out(combined)   # (batch, vocab_size)
        output = torch.log_softmax(output, dim=1)
        return output, hidden, cell, attn_weights


### Training the Model

We train our model on a very small dataset consisting of a few (context → target) pairs.  


In [47]:
# Hyperparameters
embedding_dim = 16
hidden_dim = 32
num_epochs = 50    # Small dataset 
learning_rate = 0.01

# Instantiate the models
encoder = Encoder(vocab_size, embedding_dim, hidden_dim).to(device)
decoder = AttentionDecoder(vocab_size, embedding_dim, hidden_dim).to(device)

# Loss function and optimizers
criterion = nn.NLLLoss(ignore_index=word2index["<pad>"])
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    # For each data pair (input, target) in our dataset
    for input_indices, target_indices in data:
        # Convert lists of indices to tensors (batch size = 1)
        input_tensor = torch.tensor([input_indices], dtype=torch.long, device=device)
        target_tensor = torch.tensor(target_indices, dtype=torch.long, device=device)
        
        # Zero the gradients
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()
        
        # Pass through the encoder
        encoder_outputs, hidden, cell = encoder(input_tensor)
        
        # --- Pass through the decoder with teacher forcing ---
        loss = 0
        # The first input token for the decoder is always the <sos> token
        decoder_input = torch.tensor([target_indices[0]], device=device)
        # Loop over the target sequence (starting from the second token)
        for t in range(1, len(target_indices)):
            output, hidden, cell, attn_weights = decoder(decoder_input, hidden, cell, encoder_outputs)
            # The target token to predict
            target_token = target_tensor[t].unsqueeze(0)  # shape (1)
            loss += criterion(output, target_token)
            # Teacher forcing: use the actual target as the next input
            decoder_input = target_tensor[t].unsqueeze(0)
        
        loss.backward()
        encoder_optimizer.step()
        decoder_optimizer.step()
        
        # Average the loss over the sequence length
        total_loss += loss.item() / (len(target_indices) - 1)
    
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(data):.4f}")


Epoch 50, Loss: 0.0103


### Generating Text

The `generate_sentence` function takes a context string as input and generates a continuation of the text up to a maximum number of characters (or until the `<eos>` token is generated).  
It works as follows:
1. The context is passed through the encoder.
2. The decoder is used to generate text character-by-character by selecting at each step the token with the highest probability (greedy decoding).


In [48]:
def generate_sentence(context_sentence, max_length=20):
    """
    Generates a sentence (target sentence) from a given context sentence.
    
    Parameters:
      - context_sentence (str): The input context sentence.
      - max_length (int): Maximum length of the generated sentence.
    
    Returns:
      - A generated sentence (str), built character-by-character.
    """
    with torch.no_grad():
        # Convert the context sentence to indices and create a tensor
        input_indices = sentence_to_indices(context_sentence)
        input_tensor = torch.tensor([input_indices], dtype=torch.long, device=device)
        
        # Pass the context through the encoder
        encoder_outputs, hidden, cell = encoder(input_tensor)
        
        # Initialize the decoder with the <sos> token
        decoder_input = torch.tensor([word2index["<sos>"]], device=device)
        generated_sentence = ""
        
        # Generate characters up to max_length or until <eos> is produced
        for i in range(max_length):
            output, hidden, cell, attn_weights = decoder(decoder_input, hidden, cell, encoder_outputs)
            # Select the token with the highest probability (greedy decoding)
            topv, topi = output.topk(1)
            next_token = topi.item()
            if next_token == word2index["<eos>"]:
                break
            generated_sentence += index2word[next_token]
            # Use the predicted token as the next decoder input
            decoder_input = torch.tensor([next_token], device=device)
    
    return generated_sentence

# --- Test the generation function ---
test_context = "tu vas bien?"  # Example context sentence, you can try other if you want
print("Context:", test_context)
print("Generated:", generate_sentence(test_context, max_length=20))

Context: tu vas bien?
Generated: je vais bien


You may want to try with other examples but it might not work well, because of how small the training dataset is. It was just an example to implement an attention mechanism in the model.

 Now that you know how it works, let's try a more concrete example. Because training might be very expensive, we will use the pre-trained model on this [github](https://github.com/ApurbaSengupta/Text-Generation.git). It is a model that generates a sentence using a context sentence, and that has been trained on 4 english novels.

In [49]:
!git clone https://github.com/ApurbaSengupta/Text-Generation.git
!pip install unidecode

fatal: destination path 'Text-Generation' already exists and is not an empty directory.
Defaulting to user installation because normal site-packages is not writeable
[0m

The next cell defines the model, using again attention and it is a good exercise to have a look at it and try to understand.

In [50]:
import os
import unidecode
import string
import random
import re
import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F
import time, math
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

all_characters = string.printable
n_characters = len(all_characters)
use_cuda = False 

class TextGenerate(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers=1, bi=True):
        super(TextGenerate, self).__init__()
        
        # Define key hyperparameters
        self.input_size = input_size  # Vocabulary size
        self.hidden_size = hidden_size  # LSTM hidden layer size
        self.output_size = output_size  # Output vocabulary size (same as input in text generation)
        self.n_layers = n_layers  # Number of LSTM layers
        self.bi = bi  # Boolean flag for bidirectional LSTM

        # Word embedding layer: maps input indices to dense vector representations
        self.encoder = nn.Embedding(input_size, hidden_size)
        
        # LSTM layer: processes input embeddings into a hidden state
        self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, bidirectional=self.bi)
        
        # Fully connected layer: maps LSTM outputs to the output vocabulary
        if self.bi:
            self.decoder = nn.Linear(hidden_size * 2, output_size)  # For bidirectional LSTM
        else:
            self.decoder = nn.Linear(hidden_size, output_size)  # For unidirectional LSTM

        # Another linear layer to refine the final output
        self.out = nn.Linear(output_size, output_size)

        # Dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.1)

    def forward(self, input, hidden, cell):
        """
        Forward pass through the model.
        
        - Embedding input words
        - Passing through LSTM to generate context-aware representation
        - Applying an attention mechanism
        - Decoding the attention-weighted representation to predict the next word
        """
        
        # ---- Encoder ----
        # Convert input token indices into dense embeddings
        input = self.encoder(input.view(1, -1))  # Shape: (1, hidden_size)
        input = self.dropout(input)  # Apply dropout for regularization

        # Pass embeddings through LSTM
        output, states = self.lstm(input.view(1, 1, -1), (hidden, cell))  # Shape: (1, 1, hidden_size)
        output = output.permute(1, 0, 2)  # Permute to shape: (batch=1, seq_len=1, hidden_size)

        # ---- Attention Mechanism ----
        if self.bi:  # If using bidirectional LSTM
            # Split the output into two parts (one for each direction)
            out1, out2 = output[:, :, :self.hidden_size], output[:, :, self.hidden_size:]

            # Get the last hidden states for both directions
            h1, h2 = states[0][-2, :, :], states[0][-1, :, :]

            # Compute attention weights for both directions
            attn_wts_1 = F.softmax(torch.bmm(out1, h1.unsqueeze(2)).squeeze(2), dim=1)  # Shape: (batch, seq_len)
            attn_wts_2 = F.softmax(torch.bmm(out2, h2.unsqueeze(2)).squeeze(2), dim=1)

            # Compute attention-weighted context vectors
            attn_1 = torch.bmm(out1.transpose(1, 2), attn_wts_1.unsqueeze(2)).squeeze(2)  # (batch, hidden_size)
            attn_2 = torch.bmm(out2.transpose(1, 2), attn_wts_2.unsqueeze(2)).squeeze(2)

            # Concatenate attention vectors from both directions
            attn = torch.cat((attn_1, attn_2), dim=1)  # Final attention output: (batch, hidden_size * 2)

        else:  # If using unidirectional LSTM
            h = states[0].squeeze(0)  # Last hidden state

            # Compute attention weights
            attn_wts = F.softmax(torch.bmm(output, h.unsqueeze(2)).squeeze(2), dim=1)  # Shape: (batch, seq_len)

            # Compute attention-weighted context vector
            attn = torch.bmm(output.transpose(1, 2), attn_wts.unsqueeze(2)).squeeze(2)  # Shape: (batch, hidden_size)

        # ---- Decoder ----
        # Apply the decoder to transform attention vector into output logits
        output = self.decoder(attn)  # Shape: (batch, output_size)
        output = self.dropout(output)  # Apply dropout
        output = self.out(output)  # Final output layer

        return output, states  # Return the predicted token logits and updated hidden state

    def init_hidden(self):
        if self.bi:
          return Variable(torch.zeros(self.n_layers*2, 1, self.hidden_size))
        else:
          return Variable(torch.zeros(self.n_layers, 1, self.hidden_size))

    def init_cell(self):
        if self.bi:
          return Variable(torch.zeros(self.n_layers*2, 1, self.hidden_size))
        else:
          return Variable(torch.zeros(self.n_layers, 1, self.hidden_size))

# turn string into list of longs
def char_tensor(string):
    tensor = torch.zeros(len(string)).long()
    for c in range(len(string)):
        tensor[c] = all_characters.index(string[c])
    if use_cuda:
      tensor = tensor.cuda()
    return Variable(tensor)

# generate text given context
def generate(prime_str='A', predict_len=100, temperature=0.8):
    model.load_state_dict(torch.load('Text-Generation/model_generate.pt', map_location='cpu'))
    model.eval()

    hidden = model.init_hidden()
    cell = model.init_cell()

    if use_cuda:
      hidden = hidden.cuda()
      cell = cell.cuda()

    prime_input = char_tensor(prime_str)
    predicted = prime_str + "\n--------->\n"

    # use priming string to "build up" hidden state
    for p in range(len(prime_str) - 1):
        _, states = model(prime_input[p], hidden, cell)

        if use_cuda:
          hidden, cell = states[0].cuda(), states[1].cuda()
        else:
          hidden, cell = states[0], states[1]

    inp = prime_input[-1]

    for p in range(predict_len):
        output, states = model(inp, hidden, cell)

        if use_cuda:
          output = output.cuda()
          hidden, cell = states[0].cuda(), states[1].cuda()
        else:
          hidden, cell = states[0], states[1]

        # sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = torch.multinomial(output_dist, 1)[0]

        # add predicted character to string and use as next input
        predicted_char = all_characters[top_i]
        predicted += predicted_char
        inp = char_tensor(predicted_char)

    return predicted


#### Let's try some examples

In [54]:
# main
if __name__ == "__main__":

  hidden_size = 100
  n_layers = 2
  bi = True
  # define model
  model = TextGenerate(n_characters, hidden_size, n_characters, n_layers, bi)
  print("Example 1 :\n")
  # Pride and Prejudice - Jane Austen
  print(generate("\nThe tumult of her mind, was now painfully great. She knew not how \
  to support herself, and from actual weakness sat down and cried for \
  half-an-hour. ", 300, temperature=0.8),"\n")
  print("Example 2 :\n")

  # Dracula - Bram Stoker
  print(generate("\nTo believe in things that you cannot. Let me illustrate. I heard once \
  of an American who so defined faith: 'that faculty which enables us to \
  believe things which we know to be untrue.' For one, I follow that man. ", 300, temperature=0.8),"\n")

  # outside evaluation
  print("Example 3 : outside evaluation :\n")

  # Emma - Jane Austen
  print(generate("\nDuring his present short stay, Emma had barely seen him; but just enough \
  to feel that the first meeting was over, and to give her the impression \
  of his not being improved by the mixture of pique and pretension, now \
  spread over his air.  ", 300, temperature=0.8),"\n")
  
  print("Example 4 : outside evaluation :\n")

  # The Strange Case Of Dr. Jekyll And Mr. Hyde - Robert Louis Stevenson
  print(generate("\nPoole swung the axe over his shoulder; the blow shook the building, and \
  the red baize door leaped against the lock and hinges. A dismal \
  screech, as of mere animal terror, rang from the cabinet. ", 300, temperature=0.8),"\n")

Example 1 :


The tumult of her mind, was now painfully great. She knew not how   to support herself, and from actual weakness sat down and cried for   half-an-hour. 
--------->
Such a sun was visible things to see him to the struck as the
end from plain Flask, however me that the crossess was and raising who calmed but the cratterly was soul so figure to have admires to
be also. There did noble hungest down called the other side of thing all believe. And object of them whi 

Example 2 :


To believe in things that you cannot. Let me illustrate. I heard once   of an American who so defined faith: 'that faculty which enables us to   believe things which we know to be untrue.' For one, I follow that man. 
--------->
Not that his of
think out it as you make these minutes, and it Is all this did, not medre bluption to be the remetion-most beat as if one there will feeling on the coarness of a past in the plandop of face to describe he would have you-ship's dear stong, don't pampied purchai

As you can see, this model, that have been trained on only few novels (small dataset compared to other text generator as chatGPT (even though chatGPT uses a transformer architecture, or self-attention architecture)), produces sentences almost understandable :). The author says that this model obtains a perplexity score of 93. Perplexity score gauges how surprised the model is when predicting a given output based on an input. A perplexity score of 1 means the model made a perfect prediction, while higher scores indicate greater uncertainty and weaker performance.

## Image captioning

Describing an image with text is a real challenge for a machine. Unlike us, who can effortlessly say what we see by relying on our experience, common sense, and context, an AI model struggles to connect pixels to words. The tricky part is that understanding an image isn’t just about recognizing objects—it also involves grasping their relationships, interpreting the scene, and putting everything into a meaningful sentence.

For example, if we see a person walking their dog on the beach, we can instantly describe the scene, mentioning the weather, what the person is doing, or even the overall atmosphere. But for a model, it’s a whole different story—it first has to identify the visual elements (a person, a dog, sand, the ocean), then figure out how they interact, and finally generate a fluent and coherent description. And that’s no easy task.

#### Early Approaches: CNN + RNN-Based Architectures
Before attention mechanisms came along, image captioning models worked kind of like early machine translation systems, using an encoder-decoder setup. The idea was pretty straightforward:

 - A CNN (Convolutional Neural Network) scanned the image and turned it into a fixed-size set of features.

 - An RNN (Recurrent Neural Network) (or an LSTM/GRU) then took those features and generated a caption, word by word.

Most models used a pre-trained CNN like ResNet or VGG to process the image and squash all the important details into a single vector. That vector was then handed over to an RNN, which tried to turn it into a meaningful sentence.

The problem? Shrinking an entire image into one fixed vector meant losing a ton of important details. The model couldn’t always capture everything that mattered in the scene, making its descriptions a bit hit-or-miss. Plus, as the sentence got longer, you guessed it, the model struggled to keep track of context, often leading to vague or incomplete captions.

#### A breakthrough : [Show, Attend and Tell](https://arxiv.org/pdf/1502.03044), use of attention in image captioning

The authors have published their code in an old Python version (2.6) on this [github from kelvinxu](https://github.com/kelvinxu/arctic-captions.git) if you want to have a look.
I invite you to read this really good [github from sgrvinod](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git) which summarizes and implements very well the paper (in Python 3.6). This section contains explanations available on this github.

The core idea behind attention in image captioning is that the model learns **where to look**. Indeed, while generation the caption, the model will focus on different parts of the picture to determine the most relevant one at the moment. Let's have a look at this picture : 

![exemple](img/sat.png)
![exemple2](img/sat2.png)
*Source : https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git*

As you can see above, on every picture there is a whitened part, which corresponds to where the model have focused his attention to generate the corresponding word.
Let's have deeper look at the model's structure.

#### Encoder

The Encoder takes an input image with three color channels and converts it into a smaller, more compact representation with learned feature channels. This compressed version captures all the essential details from the original image. Because the model deals with images, CNNs are used to encode the image. Indeed, over the years, they have gotten really good at recognizing objects across thousands of categories, so they naturally learn to capture the most important visual features. Moreover, there a several options that can be used, depending on the user's needs (e.g., VGG19 or ResNet-101).

![exemple](img/cnn.png)
*Source : https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git*

#### Decoder 

The decoder has one goal : take as an input the encoded image et generate the caption. In the case we decode the image without attention, the approach is to average the encoded image features accross all pixels to get a single representeation. To do so, we could use a RNN (like a LSTM) and generate each word based on the previous one.

![no-attention](img/without-attention.png)
*Source : https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git*

However, this is where the authors made the difference. Indeed, by using attention, the model generate the next word by focus on a part of the picture and not on what have been generated previously. For example, in the sentence 'A man holding a football', to generate the word 'man', the model focuses on the man. There, the approach is not to average accross all pixels but to use a weighted average. At each step, this weighted image representation can be combined with the previously generated word to help generate the next word in the sequence.

![attention](img/decode.png)
*Source : https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git*

#### Attention

How would you figure out which part of an image is most important to focus on? You’d naturally consider what you've already described so far. This way, you can look at the image again and decide what needs to be mentioned next. For example, if you've just mentioned a man, the next logical step might be to describe that he’s holding a football.

This is exactly what the Attention mechanism does—it keeps track of the words generated so far and dynamically focuses on the most relevant part of the image for the next word.

![attention](img/attention11.png)
*Source : https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git*

In the paper, two types of attention networks have been described and implemented. A deterministic 'soft' attention and stochastics 'hard' attention. We'll describe the difference between those two attention models. You just want to know that, basically, in the **soft attention** model, everything remains differentiable, so we can train the model end-to-end using standard backpropagation, while for the **hard attention** model, it is not. This allows the hard attention model to fixate on a sigle and highly relevant region at each step but the training is more unstable. 

### **Description of the two attention models**

**Soft Attention**  

At time step $ t $, we compute a score $ e_{t,i} $ for each image patch $ i $.  
In general, this is given by:  

$$
e_{t,i} = w^{\top} \tanh(W_a \, a_i + U_a \, h_{t-1} + b_a)
$$


Intuitively, this score measures how relevant region $ i $ is based on the previous hidden state $ h_{t-1} $ and its own characteristics $ a_i $.  

We then apply a *softmax* to these scores to obtain attention weights $ \alpha_{t,i} $:  

$$
\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{k} \exp(e_{t,k})}
$$
 
We compute a **context vector** $ z_t $ as a **weighted sum** of the patch feature vectors:  

$$
z_t = \sum_{i=1}^{L} \alpha_{t,i} \, a_i
$$

The LSTM integrates $ z_t $ (along with the previously generated word) to produce the new hidden state $ h_t $.  
Based on $ h_t $, the model predicts the next word.  

> **Advantage:** Everything remains **differentiable**, so we can train the model end-to-end using standard backpropagation.  

---
 **Hard Attention**  

Instead of computing a **weighted sum** of patch features, **hard attention** **samples** a single patch $ i $ (randomly chosen) at each step $ t $, based on the probability $ \alpha_{t,i} $.  

The context vector $ z_t $ is then simply the feature vector $ a_i $ of the selected region.  

Since this mechanism involves discrete sampling, it is **not differentiable**.  
To train it, the authors use a reinforcement learning approach like **REINFORCE** (gradient estimation via sampling).  

> **Advantage:** The model learns to **fixate** on a single, highly relevant region at each step.  
> **Disadvantage:** Training is more unstable and requires specialized optimization heuristics.  


#### Model

This is what the model looks like once everything is concatenated.

![model](img/model.png)

Once the **Encoder** processes the image, its encoded representation is used to initialize the **hidden state** $ h $ and **cell state** $ C $ of the LSTM Decoder.  

At each decoding step:  

1. The **encoded image** and the **previous hidden state** are used by the **Attention mechanism** to assign weights to different pixels.  
2. The **previously generated word** and the **weighted sum** of the encoded features are then fed into the **LSTM Decoder** to generate the next word in the sequence.  

Mathematically:  

- The **attention scores** for each pixel $ i $ at time step $ t $ are computed as:  

  $$
  e_{t,i} = w^{\top} \tanh(W_a \, a_i + U_a \, h_{t-1} + b_a)
  $$

- The **attention weights** $  \alpha_{t,i}  $ are obtained via a softmax function:  

  $$
  \alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{k} \exp(e_{t,k})}
  $$

- The **context vector** $ z_t $ is computed as the weighted sum of image features:  

  $$
  z_t = \sum_{i=1}^{L} \alpha_{t,i} \, a_i
  $$

- Finally, the **LSTM Decoder** takes as input the **previously generated word** $ y_{t-1} $ and the **context vector** $ z_t $ to produce the next hidden state $ h_t $ and predict the next word $ y_t $.  

  $$
  h_t, C_t = \text{LSTM}(y_{t-1}, z_t, h_{t-1}, C_{t-1})
  $$

  $$
  y_t = \text{Softmax}(W_o h_t + b_o)
  $$

This process repeats until the model generates an end-of-sequence token.


#### Beam Search

When generating text, a linear layer is used at the end of the decoder to assign a score to each word in the vocabulary. The easiest way to generate a sentence would be to **always pick the word with the highest score** at each step. But this approach, called greedy decoding, isn't great because each word you choose affects the rest of the sequence.  

If you pick the wrong word early on, it can throw off the entire sentence, even if that first word had the highest individual score at the time.  

Imagine you're writing a sentence, and the best possible sequence actually involves choosing the third-best word at the first step, then maybe the second-best word at the second step, and so on. A greedy approach would never find this optimal sequence.

So, instead of locking in choices too soon, we'd like a way to consider multiple possible sentences and only commit once we've generated and evaluated full sequences.  

This is where **Beam Search** comes in.

---

### **How Beam Search Works**
Instead of picking just one best word at each step, Beam Search keeps track of **multiple possible sequences** and only selects the best overall one at the end.

1. At the first step, we choose the top k words (instead of just one).
2. For each of these k words, we generate k possible second words.
3. From all these k × k combinations, we pick the top k based on their total score.
4. For each of these k second words, we generate k third words and repeat the process.
5. We keep expanding the sequences until we reach the end of the sentence.
6. Once we have k completed sentences, we pick the one with the highest overall score.

This way, Beam Search finds a **more optimal sequence** instead of locking in a word too early and getting stuck with a bad choice.

![beamsearch](img/beam_search.png)

Now that you know how the model works, let's test it. We won't implement it here, because it is very long to do so, and training take lots of ressources. Indeed, the training uses [COCODATASET](https://cocodataset.org/#home), containing tens of thousands of pictures with their descriptions. To test the model, we will use again [srgvinod's github](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git), where we can find a trained model. However, because he used python 3.6 to develop the model, I had to do several modifications in the code and it might work a bit "differently". I copied his code to modify it on my github and so you can use it. I recommend using Google Colab if you don't have a GPU on your computer as it might not work on a CPU. This is just for you to test with your own pictures if you want to do so. 

I could manage to generate captions like this :

![plane](img/planesky.png)


![skater](img/skater.png)

As you can see, results are very good, even though it might not work sometimes : 

![echec](img/captionchien.png)
*Source : https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git*



The next section is the script defining the model, written by [srgvinod](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.git). It is quite long, but you can have a look at it, especially at the attention part.

In [52]:
class Encoder(nn.Module):
    """
    Encoder: Uses a pre-trained ResNet-101 model to extract visual features from an input image.
    """

    def __init__(self, encoded_image_size=14):
        super(Encoder, self).__init__()
        self.enc_image_size = encoded_image_size

        resnet = torchvision.models.resnet101(pretrained=True)  # pretrained ImageNet ResNet-101

        # Remove linear and pool layers (since we're not doing classification)
        modules = list(resnet.children())[:-2]
        self.resnet = nn.Sequential(*modules)

        # Resize image to fixed size to allow input images of variable size
        self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

        self.fine_tune()

    def forward(self, images):
        """
        Forward propagation.

        :param images: images, a tensor of dimensions (batch_size, 3, image_size, image_size)
        :return: encoded images
        """
        out = self.resnet(images)  # (batch_size, 2048, image_size/32, image_size/32)
        out = self.adaptive_pool(out)  # (batch_size, 2048, encoded_image_size, encoded_image_size)
        out = out.permute(0, 2, 3, 1)  # (batch_size, encoded_image_size, encoded_image_size, 2048)
        return out

    def fine_tune(self, fine_tune=True):
        """
        Allow or prevent the computation of gradients for convolutional blocks 2 through 4 of the encoder.

        :param fine_tune: Allow?
        """
        for p in self.resnet.parameters():
            p.requires_grad = False
        # If fine-tuning, only fine-tune convolutional blocks 2 through 4
        for c in list(self.resnet.children())[5:]:
            for p in c.parameters():
                p.requires_grad = fine_tune


class Attention(nn.Module):
    """
    Attention Network.
    """

    def __init__(self, encoder_dim, decoder_dim, attention_dim):
        """
        :param encoder_dim: feature size of encoded images
        :param decoder_dim: size of decoder's RNN
        :param attention_dim: size of the attention network
        """
        super(Attention, self).__init__()
        self.encoder_att = nn.Linear(encoder_dim, attention_dim)  # linear layer to transform encoded image
        self.decoder_att = nn.Linear(decoder_dim, attention_dim)  # linear layer to transform decoder's output
        self.full_att = nn.Linear(attention_dim, 1)  # linear layer to calculate values to be softmax-ed
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)  # softmax layer to calculate weights

    def forward(self, encoder_out, decoder_hidden):
        """
        Forward propagation.

        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :param decoder_hidden: previous decoder output, a tensor of dimension (batch_size, decoder_dim)
        :return: attention weighted encoding, weights
        """
        att1 = self.encoder_att(encoder_out)  # (batch_size, num_pixels, attention_dim)
        att2 = self.decoder_att(decoder_hidden)  # (batch_size, attention_dim)
        att = self.full_att(self.relu(att1 + att2.unsqueeze(1))).squeeze(2)  # (batch_size, num_pixels)
        alpha = self.softmax(att)  # (batch_size, num_pixels)
        attention_weighted_encoding = (encoder_out * alpha.unsqueeze(2)).sum(dim=1)  # (batch_size, encoder_dim)

        return attention_weighted_encoding, alpha


class DecoderWithAttention(nn.Module):
    """
    Decoder.
    """

    def __init__(self, attention_dim, embed_dim, decoder_dim, vocab_size, encoder_dim=2048, dropout=0.5):
        """
        :param attention_dim: size of attention network
        :param embed_dim: embedding size
        :param decoder_dim: size of decoder's RNN
        :param vocab_size: size of vocabulary
        :param encoder_dim: feature size of encoded images
        :param dropout: dropout
        """
        super(DecoderWithAttention, self).__init__()

        self.encoder_dim = encoder_dim
        self.attention_dim = attention_dim
        self.embed_dim = embed_dim
        self.decoder_dim = decoder_dim
        self.vocab_size = vocab_size
        self.dropout = dropout

        self.attention = Attention(encoder_dim, decoder_dim, attention_dim)  # attention network

        self.embedding = nn.Embedding(vocab_size, embed_dim)  # embedding layer
        self.dropout = nn.Dropout(p=self.dropout)
        self.decode_step = nn.LSTMCell(embed_dim + encoder_dim, decoder_dim, bias=True)  # decoding LSTMCell
        self.init_h = nn.Linear(encoder_dim, decoder_dim)  # linear layer to find initial hidden state of LSTMCell
        self.init_c = nn.Linear(encoder_dim, decoder_dim)  # linear layer to find initial cell state of LSTMCell
        self.f_beta = nn.Linear(decoder_dim, encoder_dim)  # linear layer to create a sigmoid-activated gate
        self.sigmoid = nn.Sigmoid()
        self.fc = nn.Linear(decoder_dim, vocab_size)  # linear layer to find scores over vocabulary
        self.init_weights()  # initialize some layers with the uniform distribution

    def init_weights(self):
        """
        Initializes some parameters with values from the uniform distribution, for easier convergence.
        """
        self.embedding.weight.data.uniform_(-0.1, 0.1)
        self.fc.bias.data.fill_(0)
        self.fc.weight.data.uniform_(-0.1, 0.1)

    def load_pretrained_embeddings(self, embeddings):
        """
        Loads embedding layer with pre-trained embeddings.

        :param embeddings: pre-trained embeddings
        """
        self.embedding.weight = nn.Parameter(embeddings)

    def fine_tune_embeddings(self, fine_tune=True):
        """
        Allow fine-tuning of embedding layer? (Only makes sense to not-allow if using pre-trained embeddings).

        :param fine_tune: Allow?
        """
        for p in self.embedding.parameters():
            p.requires_grad = fine_tune

    def init_hidden_state(self, encoder_out):
        """
        Creates the initial hidden and cell states for the decoder's LSTM based on the encoded images.

        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :return: hidden state, cell state
        """
        mean_encoder_out = encoder_out.mean(dim=1)
        h = self.init_h(mean_encoder_out)  # (batch_size, decoder_dim)
        c = self.init_c(mean_encoder_out)
        return h, c

    def forward(self, encoder_out, encoded_captions, caption_lengths):
        """
        Forward propagation.

        :param encoder_out: encoded images, a tensor of dimension (batch_size, enc_image_size, enc_image_size, encoder_dim)
        :param encoded_captions: encoded captions, a tensor of dimension (batch_size, max_caption_length)
        :param caption_lengths: caption lengths, a tensor of dimension (batch_size, 1)
        :return: scores for vocabulary, sorted encoded captions, decode lengths, weights, sort indices
        """

        batch_size = encoder_out.size(0)
        encoder_dim = encoder_out.size(-1)
        vocab_size = self.vocab_size

        # Flatten image
        encoder_out = encoder_out.view(batch_size, -1, encoder_dim)  # (batch_size, num_pixels, encoder_dim)
        num_pixels = encoder_out.size(1)

        # Sort input data by decreasing lengths; why? apparent below
        caption_lengths, sort_ind = caption_lengths.squeeze(1).sort(dim=0, descending=True)
        encoder_out = encoder_out[sort_ind]
        encoded_captions = encoded_captions[sort_ind]

        # Embedding
        embeddings = self.embedding(encoded_captions)  # (batch_size, max_caption_length, embed_dim)

        # Initialize LSTM state
        h, c = self.init_hidden_state(encoder_out)  # (batch_size, decoder_dim)

        # We won't decode at the <end> position, since we've finished generating as soon as we generate <end>
        # So, decoding lengths are actual lengths - 1
        decode_lengths = (caption_lengths - 1).tolist()

        # Create tensors to hold word predicion scores and alphas
        predictions = torch.zeros(batch_size, max(decode_lengths), vocab_size).to(device)
        alphas = torch.zeros(batch_size, max(decode_lengths), num_pixels).to(device)

        # At each time-step, decode by
        # attention-weighing the encoder's output based on the decoder's previous hidden state output
        # then generate a new word in the decoder with the previous word and the attention weighted encoding
        for t in range(max(decode_lengths)):
            batch_size_t = sum([l > t for l in decode_lengths])
            attention_weighted_encoding, alpha = self.attention(encoder_out[:batch_size_t],
                                                                h[:batch_size_t])
            gate = self.sigmoid(self.f_beta(h[:batch_size_t]))  # gating scalar, (batch_size_t, encoder_dim)
            attention_weighted_encoding = gate * attention_weighted_encoding
            h, c = self.decode_step(
                torch.cat([embeddings[:batch_size_t, t, :], attention_weighted_encoding], dim=1),
                (h[:batch_size_t], c[:batch_size_t]))  # (batch_size_t, decoder_dim)
            preds = self.fc(self.dropout(h))  # (batch_size_t, vocab_size)
            predictions[:batch_size_t, t, :] = preds
            alphas[:batch_size_t, t, :] = alpha

        return predictions, encoded_captions, decode_lengths, alphas, sort_ind


## **Ok it is your turn now to try and generate captions !**

First, you need to download the two files [here](https://drive.google.com/drive/u/0/folders/189VY65I_n4RTpQnmLGj7IzVnOF6dmePC). Second, you need to put them in the "model" folder. If you are using google colab, just download the files and slide them here.
![folderlocation](img/folderloc.png)
![modeldansfolder](img/in_model.png)
You can use the pictures I have used for this example or add your own images in the "image_to_caption" folder and have fun! You just have to change the name of the image in the next command (change *--img='image_to_caption/plane.jpg'*)

In [57]:
!python3 notebook_tuto_captioning/caption.py \
      --img='image_to_caption/plane.jpg' \
      --model='model/BEST_checkpoint_coco_5_cap_per_img_5_min_word_freq.pth.tar' \
          --word_map='model/WORDMAP_coco_5_cap_per_img_5_min_word_freq.json' --beam_size=5

Figure(640x480)
Caption has been generated : check 'caption.png' to have a look at it


### **Conclusion : Why Attention Mechanism is a Game-Changer**

The attention mechanism in image captioning and text generation is a big step forward because it lets models go beyond just recognizing objects—it actually helps them **grasp more abstract concepts and relationships within a scene, a sentence, or more**.  

#### **More flexibility than object detectors**  
Older methods relied on detecting specific objects, words, etc., first, but attention learns **what to focus on directly from data**. That means the model can highlight not just objects, but also patterns, textures, or even abstract ideas in an image.  

#### **We can actually see what the model "sees"**
Unlike most deep learning models that are black boxes, attention can be visualized through heatmaps. This helps us understand why the model generated a certain caption, making AI decisions more transparent.  

#### **It could be useful in many fields**
For instance, in the paper written by [**(H.Wang, *et al.*)**](https://ieeexplore.ieee.org/abstract/document/8852343?casa_token=msDOIuCblPIAAAAA:wVxucaHRLnVSHNZB-J1PGMqK0U_tI6tgKnqsi2H_saLJay4wHfJkS6RFLm6gOWAlzos9tVo9Nm0), they managed to summarize NBA games from statistical tables. For this table :

![table](img/tables.png) 
*Source : https://ieeexplore.ieee.org/abstract/document/8852343?casa_token=msDOIuCblPIAAAAA:wVxucaHRLnVSHNZB-J1PGMqK0U_tI6tgKnqsi2H_saLJay4wHfJkS6RFLm6gOWAlzos9tVo9Nm0*

The model managed to generate this caption : *The Atlanta Hawks (46 - 12) defeated the Orlando Magic (19 - 41) 95 - 88 on Wednesday at Philips Arena in Atlanta. The Hawks got off to a quick start in this one, out - scoring the Magic 28 - 16 in the first quarter alone. Along with the quick start, the Hawks were able to out - score the Magic 28 - 21 in the third quarter, while the Hawks were able to coast to a victory in front of their home crowd. The Hawks were also able to out - rebound the Magic 42 - 40, giving them enough of an advantage to secure the victory in front of their home crowd. The Hawks were led by the duo of Victor Oladipo and Nikola Vucevic. Oladipo went 8 - for - 18 from the field and 2 - for - 5 from the three - point line to score a game - high of 19 points, while also adding six assists and two steals. He’s now averaging 17 points and 6 rebounds on the year. Jeff Teague also had a solid showing, finishing with 17 points (6 - 15 FG, 1 - 4 3Pt, 4 - 4 FT), seven assists and two steals. It was his second double - double in a row, a stretch where he’s averaging 24 points and 12 rebounds over that span. Coming off the bench, Kyle Korver had a solid showing as well, finishing with 19 points (2 - 5 FG, 2 - 5 3Pt, 2 - 2 FT), three rebounds and two blocked shots. It was his second double - double in a row, a stretch where he’s averaging 17 points and 12 rebounds. The only other Magic player to reach double figures in points was Victor Oladipo, who chipped in with 19 points (8 - 18 FG, 2 - 5 3Pt, 1 - 2 FT) and six assists. The Magic’s next game will be at home against Detroit Pistons on Friday, while the Magic will be at home against the Detroit Pistons on Friday.*

It is quite impressive, but there many more fields. Indeed, this **encoder-decoder + attention** approach is modular, meaning it could be adapted for other AI tasks like:  
- **Medical imaging**  
- **Video description**  
- **Advanced AI storytelling**  

#### **Bottom line?**
Attention **helps models "look" at images more like humans do**, leading to **smarter and more detailed descriptions**. Plus, it’s way more transparent than traditional AI methods, making it **easier to trust and improve**.  


# References 

[[1] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 2048–2057.](https://dl.acm.org/doi/10.5555/3045118.3045336)

Github's implementation : [arctic-captions](https://github.com/kelvinxu/arctic-captions)

[[2] H. Wang, W. Zhang, Y. Zhu and Z. Bai, "Data-to-Text Generation with Attention Recurrent Unit," 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 2019, pp. 1-8, doi: 10.1109/IJCNN.2019.8852343. keywords: {Decoding;Logic gates;Task analysis;Computational modeling;Mathematical model;Standards;Context modeling;Long Short-Term Memory (LSTM);Attention Recurrent Unit (ARU);DoubleAtten}](https://ieeexplore.ieee.org/abstract/document/8852343?casa_token=msDOIuCblPIAAAAA:wVxucaHRLnVSHNZB-J1PGMqK0U_tI6tgKnqsi2H_saLJay4wHfJkS6RFLm6gOWAlzos9tVo9Nm0)

sgrvinod's github : [a-PyTorch-Tutorial-to-Image-Captioning](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning)

