## 0. Méta: paramètres, to-do list


Questions:
- [ ] Les tokens texte et les tokens images sont identifiés par les mêmes entiers. Est-ce problématique? Quelle alternative avons-nous?
- [ ] "bart-large" est limité à des séquences de 1024 tokens, vérifier pour bart-large-cnn (dans le code, il y a un bug avec 1025 tokens = 1 token "begining of sentence" + 1024 image tokens)


Must-have :
- [X] Encoder le texte avec Bart
- [X] Encoder les images avec Dall-E dVAE 
- [X] Une étape d'apprentissage avec Bart
- [ ] Une séquence d'apprentissage sur un mini-batch d'exemples provenant de Coco
- [ ] Un print des progrès de l'apprentissage
- [ ] Une prédiction d'image avec le decoder Dall-E dVAE

Nice-to-have:
- [ ] Passer tous les paramètres en début de notebook
- [X] Sauvegarder les modèles localement pour ne pas les télécharger à chaque fois
- [ ] Voir si possible d'utiliser GPT-3 à la place de Bart
- [ ] Utiliser les TPU sur Google Collab en activant l'offre envoyée à Arthur
- [ ] Uniformiser les noms de variable
- [ ] Utiliser un plus petit modèle que Bart CNN 
- [ ] Gérer l'attention du transformer

In [11]:
verbose = 2
text_token_length = 255
target_image_size = 256
image_token_side = 32

## 1. Entraînement

L'entraînement consiste en:

1. Encoder le texte en tokens-text
2. Encoder les images en tokens-image
3. Modéliser l'ensemble de façon auto-régressive

### 1.1. Encodage du texte

Nous utilison `BartTokenizer` pour l'encodage du texte comme mini Dall-E. Le Dall-E original utilise selon l'article du "_BPE-encoding_" (byte-pair encoding, c'est à dire strictement parlant des paires de caractères), ce qui peut s'interpréter comme l'utilisation du modèle GPT-3, qui repose lui-aussi sur un encodage proche d'un _BPE encoding_. Malheureusement, GPT-3 n'est pas disponible au grand public.

In [12]:
# ! git clone https://huggingface.co/facebook/bart-large-cnn ../../models/facebook/bart-large-cnn

In [70]:
from transformers import BartTokenizer
import torch

# https://huggingface.co/transformers/v2.11.0/model_doc/bart.html

tokenizer = BartTokenizer.from_pretrained(
    '../../models/facebook/bart-large-cnn'
)

caption = "A Emperor penguin standing on the ice"

# First version, taht does not generalize to a list of captions
# It returns an object: {input_ids:..., attention_mask: ...}
# caption_as_tokens = tokenizer(caption)

caption_as_tokens = tokenizer.encode(
    caption,
    max_length = text_token_length,
    padding = 'max_length',
    return_tensors = 'pt'
)

# for more than one caption
# caption_as_tokens = tokenizer.batch_encode_plus([caption])
    
if verbose >= 2:
    print("Caption is rendered as tokens by:")
    print(caption_as_tokens)

Caption is rendered as tokens by:
tensor([[    0,   250, 31918, 31526,   179,  2934,    15,     5,  2480,     2,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,   

### 1.2 Encodage des images

Pour l'encodage des images, nous utilisons le modèle (dVAE) mis à disposition par Open AI pour Dall-E. L'encodage des images nécessite quelques étape de pré-traitement, par exemple pour le mettres toutes au même format (carré) et à la même taille (256x256 pixels).

In [14]:
# download encoder.pkl

In [15]:
import torch
from dall_e import load_model

dev = torch.device('cpu')
encoder = load_model("../../models/openai/dall-e/encoder.pkl", dev)

if verbose>=1:
    print("Dall-E dVAE encoder has a meta-pixel look-up table of size:")
    print(encoder.vocab_size)

Dall-E dVAE encoder has a meta-pixel look-up table of size:
8192


In [16]:
from dall_e import map_pixels
import PIL
import torch
import torchvision.transforms as T
import torchvision.transforms.functional as TF

# scale images down to 256x256 (cropping the uneven dimension)
# we might get problems with some images from the COCO datasets
# ignore these images as a first approximation
# or reduce the image resolution ?

def preprocess(img):
    s = min(img.size)
    
    if s < target_image_size:
        raise ValueError(f'min dim for image {s} < {target_image_size}')
        
    r = target_image_size / s
    s = (round(r * img.size[1]), round(r * img.size[0]))
    img = TF.resize(img, s, interpolation=PIL.Image.LANCZOS)
    img = TF.center_crop(img, output_size=2 * [target_image_size])
    img = torch.unsqueeze(T.ToTensor()(img), 0)
    return map_pixels(img)

In [17]:
import requests
import PIL
import io

# replace by direct reading from disk
# persist images to disk in the first place
def download_image(url):
    resp = requests.get(url)
    resp.raise_for_status()
    return PIL.Image.open(io.BytesIO(resp.content))


In [53]:
import torch

image = preprocess(download_image(
    'https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iKIWgaiJUtss/v2/1000x-1.jpg'
))

# lines represent the 32*32 image-tokens or meta-pixels
# columns represent the 8192 possible meta-pixel values from the look-up table
image_as_token_logits = encoder(image)

# now we retain the most probable meta-pixel value
image_as_tokens = torch.argmax(image_as_token_logits, axis=1)
image_as_tokens = image_as_tokens.flatten(start_dim=1)



### 1.3 Modélisation auto-régressive avec Bart

In [71]:
torch.hstack((
        # prepend begining-of-sentence (BOS) token
        torch.tensor(model.config.bos_token_id).repeat(1).unsqueeze(1),
        image_as_tokens
    ))

tensor([[   0, 7522,  741,  ..., 5016, 1144, 1005]])

In [84]:
from transformers import BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained(
    '../../models/facebook/bart-large-cnn'
)

In [89]:
import torch

predict = model(
    input_ids = caption_as_tokens,
    decoder_input_ids = torch.hstack((
        # prepend begining-of-sentence (BOS) token
        # and stop after 100 tokens because not
        # possible to reach the 1024 image-tokens :()
        # works for sizes a number of 
        torch.tensor(model.config.bos_token_id)
          .repeat(image_as_tokens.shape[0])
          .unsqueeze(1),
        image_as_tokens[:,:100]
    ))
)

# ici nous avons un problème car les prédictions
# sont de la taille du dictionnaire de tokens-texte
# pas du dictionnaire de tokens-images
if verbose >= 1:
    print("Taille des prédictions:")
    print(predict.logits.shape)
    print("vs. taille du dictionnaire de tokens-texte:")
    print(tokenizer.vocab_size)
    print("vs. taille du dictionnaire de tokens-images:")
    print(encoder.vocab_size)

Taille des prédictions:
torch.Size([1, 101, 50264])
vs. taille du dictionnaire de tokens-texte:
50265
vs. taille du dictionnaire de tokens-images:
8192


Une possibilité est de changer la dernière couche du modèle pour prédire des valeurs du dictionnaire de tokens-image:

In [90]:
model.lm_head

Linear(in_features=1024, out_features=50264, bias=False)

In [91]:
model.lm_head = torch.nn.Linear(
    in_features=1024, out_features = encoder.vocab_size,
    bias=False
)
# for some reason, biases are stored elsewhere:
model.final_logits_bias = torch.rand(encoder.vocab_size)

In [92]:
predict = model(
    input_ids = caption_as_tokens,
    decoder_input_ids = torch.hstack((
        # prepend begining-of-sentence (BOS) token
        # and stop after 100 tokens because not
        # possible to reach the 1024 image-tokens :(
        # works for sizes a number of 
        torch.tensor(model.config.bos_token_id)
          .repeat(image_as_tokens.shape[0])
          .unsqueeze(1),
        image_as_tokens[:,:100]
    ))
)

In [95]:
predict.loss

In [93]:
predict.logits.shape # yes !

torch.Size([1, 101, 8192])

Maintenant comment entraîner ce modèle à l'aide d'une seule image?

In [122]:
# predict image

# 1) same as above
predictions = model(
    input_ids = caption_as_tokens,
    decoder_input_ids = torch.hstack((
        torch.tensor(model.config.bos_token_id)
          .repeat(image_as_tokens.shape[0])
          .unsqueeze(1),
        image_as_tokens[:,:100]
    ))
)
# 2) get the predicted next image tokens
image_prediction_as_tokens = predictions.logits.argmax(axis=2) # best response
# we actually do not need this for the next step and can directly
# use logits in the loss function

# 3) compare to original
loss_fn = torch.nn.CrossEntropyLoss()
# la perte est calculée pour chaque paire de tokens (vrai ; prédit)
# puis est moyennée sur l'ensemble du vecteur
loss = loss_fn(
  input  = predictions.logits[0,:,:],
  target = image_as_tokens[0,:101]
)

loss

#image_as_tokens[:,:100]

tensor(9.2402, grad_fn=<NllLossBackward>)

In [120]:
if verbose >=2 :
    print("Tensors for true image tokens and predicted image tokens are of size:")
    print(image_as_tokens[0,:101].shape)
    print(image_prediction_as_tokens[0,:].shape)

Tensors for true image tokens and predicted image tokens are of size:
torch.Size([101])
torch.Size([101])


In [114]:
import pandas as pd
pd.DataFrame(data = {
    "original_tokens" : image_as_tokens[0,:101],
    "predicted_tokens" : image_prediction_as_tokens[0,:],
})

Unnamed: 0,original_tokens,predicted_tokens
0,7522,498
1,741,4968
2,5973,3921
3,7663,1932
4,708,1039
...,...,...
96,1861,1413
97,563,4572
98,435,4947
99,5677,3663


In [103]:
image_as_tokens = []

torch.Size([1, 1024])

Maintenant nous pouvons optimiser les poids, à l'aide de la fonction de perte:

In [123]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=0.01)

optimizer.zero_grad()

loss.backward()

optimizer.step()

Est-ce que la perte a diminué? (Spoiler: oui.)

In [124]:
predictions = model(
    input_ids = caption_as_tokens,
    decoder_input_ids = torch.hstack((
        torch.tensor(model.config.bos_token_id)
          .repeat(image_as_tokens.shape[0])
          .unsqueeze(1),
        image_as_tokens[:,:100]
    ))
)

loss_fn(
  input  = predictions.logits[0,:,:],
  target = image_as_tokens[0,:101]
)

tensor(8.7003, grad_fn=<NllLossBackward>)

In [None]:
pre


#image_as_tokens_shifted = image_as_tokens.clone()
#image_as_tokens_shifted[1:] = image_as_tokens[:-1]
#image_as_tokens_shifted[0]  = model.config.decoder_start_token_id


#z = F.one_hot(z, num_classes=encoder.vocab_size).permute(0, 3, 1, 2).float()

# pad text to fixed length with an additional id and bind text and image tokens together

# model the sequence with a transformer model


# def shift_tokens_right(input_ids: np.array, decoder_start_token_id: int):
#     """
#     Shift input ids one token to the right.
#     """
#     shifted_input_ids = np.zeros(input_ids.shape)
#     shifted_input_ids[:, 1:] = input_ids[:, :-1]
#     shifted_input_ids[:, 0] = decoder_start_token_id
#     return shifted_input_ids


    # dataset.preprocess(
    #     tokenizer=tokenizer,
    #     decoder_start_token_id=model.config.decoder_start_token_id,
    #     normalize_text=model.config.normalize_text,
    #     max_length=model.config.max_text_length,
    # )


# all_tokens =  torch.cat( (text_tokens,image_tokens) )

# if verbose > 2:
#     print(all_tokens.shape)

In [None]:


# loss = cross-entropy
# torch.nn.crossEntropy()
# predict vs. image_tokens

# alternativement on peut changer la dernière couche de Bart
# nn.Linear(size_embedding, num_vocab_img)

# import torch
# import torch.nn as nn
# class RNN(nn.Module):
#     def __init__(self, input_size, hidden_size, output_size):
#         super(RNN, self).__init__()
#         self.hidden_size = hidden_size
#         self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
#         self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
#         self.o2o = nn.Linear(hidden_size + output_size, output_size)
#         self.dropout = nn.Dropout(0.1)
#         self.softmax = nn.LogSoftmax(dim=1)
#     def forward(self, category, input, hidden):
#         input_combined = torch.cat((category, input, hidden), 1)
#         hidden = self.i2h(input_combined)
#         output = self.i2o(input_combined)
#         output_combined = torch.cat((hidden, output), 1)
#         output = self.o2o(output_combined)
#         output = self.dropout(output)
#         output = self.softmax(output)
#         return output, hidden

KernelInterrupted: Execution interrupted by the Jupyter kernel.

In [None]:
image_tokens[:-1]

tensor([7522,  741, 5973,  ..., 6231, 5016, 1144])

In [None]:
## INFERENCE

# get token ids for texts (= encode)

# generate the next terms in the sequence with a random seed

# get image from token ids (= decode)

In [None]:
# https://colab.research.google.com/drive/14oChMr8KZVS7DzcbsuJix0JQKUTGO64j

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
# see ``examples/summarization/bart/evaluate_cnn.py`` for a longer example
model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs."
inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4f3692ed-5f27-49a4-899a-82a03e72232c' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>