# Project Overview

Welcome to the world of narrative transformation! In this project, we dive into the fascinating realm of transformer models, crafting a decoder-based architecture trained on the timeless fables of Jean de la Fontaine.

Our mission is simple yet profound: to understand and harness the power of transformers with limited resources. Armed with a small machine and a modest dataset, we aim to explore the potential of these models in the context of natural language generation.

By the end of this journey, we'll not only have a trained model capable of weaving narratives but also a richer comprehension of working with constraints in the realm of machine learning. Join us as we unravel the secrets of transformers in the enchanting world of Jean de la Fontaine's fables!

# Imports

In [1]:
!pip install sentencepiece
import nltk
import re
from transformers import CamembertModel, CamembertTokenizer
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
import torch.optim as optim
import textwrap
import sentencepiece
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import textwrap
import re
import pickle



PyTorch was selected for its integration with NLP libraries, like Hugging Face's Transformers, simplify implementation. This choice optimizes resource utilization on a small machine while aligning with the project's goal of understanding and achieving results with limited data.

# Data pre-processing

The backbone of this project consists of a corpus comprising approximately 250 fables sourced directly from Wikipedia. The dataset is written in a 500 KB text file. During preprocessing, all annotations and indentations originating from Wikipedia were removed.

In [32]:
with open('fables.txt', 'r', encoding='ISO-8859-1') as f:
    text = f.read()

text = textwrap.dedent(text)
text = text.replace('N ', '')
text = text.replace('(', '')
text = text.replace(')', '')
text = text.replace('ſ', '')
text = re.sub(r'\d+', '', text)

In [33]:
with open('fables.txt', 'w') as f:
    f.write(text)

text = text.replace('\n', '@')

As the Camembert Tokenizern ignores the '\n' character, it is replaced by '@' and will be converted back to '\n' after text generation.

In [34]:
print(text[:1000])

L'Aigle donnait la chasse à Maître Jean Lapin,@Qui droit à son terrier s'enfuyait au plus vite.@Le trou de l'Escarbot se rencontre en chemin :@Je laisse à penser si ce gîte@Etait sûr ; mais où mieux ? Jean Lapin s'y blottit.@L'Aigle fondant sur lui nonobstant cet asile,@L'Escarbot intercède et dit :@Princesse des Oiseaux, il vous est fort facile@D'enlever malgré moi ce pauvre malheureux ;@Mais ne me faites pas cet affront, je vous prie ;@Et,  puisque Jean Lapin vous demande la vie,@C'est mon voisin, c'est mon compère.@L'Oiseau de Jupiter, sans répondre un seul mot,@Choque de l'aile l'Escarbot,@L'étourdit, l'oblige à se taire,@Enlève Jean Lapin. L'Escarbot indigné@Vole au nid de l'Oiseau, fracasse en son absence@Ses ?ufs, ses tendres ?ufs, sa plus douce espérance :@Pas un seul ne fut épargné.@L'Aigle étant de retour et voyant ce ménage,@Remplit le ciel de cris, et, pour comble de rage,@Ne sait sur qui venger le tort qu'elle a souffert.@Elle gémit en vain, sa plainte au vent se perd.@Il 

# Creation of the database

In [7]:
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")

def encode_text_to_ids(text):
    tokenized_text = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
    return token_ids

def decode_ids_to_text(token_ids):
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    decoded_text = tokenizer.decode(token_ids)
    return decoded_text

def create_input_output_pairs(data, sequence_length):
    pairs = []
    for i in range(len(data) - sequence_length):
        x = data[i:i+sequence_length]
        y = data[i+1:i+sequence_length+1]
        pairs.append((x, y))
    return pairs

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'encode_text_to_ids' is used to convert a string into a table of token id. the tokenizer used is the CamembertTokenizer which is adapted to the french language. 'decode_ids_to_text' is used to convert back a table of token id into a string. 'create_input_output_pairs' is used to create the database.

In [38]:
data = torch.tensor(encode_text_to_ids(text), dtype=torch.long)

sequence_length = 32

pairs = create_input_output_pairs(data, sequence_length)

train_pairs, val_pairs = train_test_split(pairs, test_size=0.01, random_state=42)

train_x, train_y = zip(*train_pairs)
val_x, val_y = zip(*val_pairs)

train_x = torch.stack(train_x)
train_y = torch.stack(train_y)
val_x = torch.stack(val_x)
val_y = torch.stack(val_y)

batch_size = 64
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

val_dataset = TensorDataset(val_x, val_y)
val_loader = DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False)

# Definition of the model

The model I used for this task is a transfomer decoder. It's a key component in transformer models for natural language processing. It focuses on generating output sequences by attending to relevant parts of the input sequence. Its architecture allows it to capture dependencies between different words in the input, making it highly effective for tasks like language translation and text generation. Using a transformer decoder enhances the model's ability to understand context and produce coherent and contextually relevant output, making it a powerful choice for various language-related tasks in machine learning.

![Example Image](transformer.png)

In [8]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        self.encoding[:, 0::2] = torch.sin(position * div_term)
        self.encoding[:, 1::2] = torch.cos(position * div_term)
        self.encoding = self.encoding.unsqueeze(0)

    def forward(self, x):
        return x + self.encoding[:, :x.size(0)].detach()

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead):
        super(TransformerDecoderLayer, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead)
        self.linear1 = nn.Linear(d_model, 2048)
        self.dropout = nn.Dropout(0.2)
        self.linear2 = nn.Linear(2048, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attn_output, _ = self.self_attn(x, x, x)
        x = x + self.dropout(attn_output)

        ff_output = self.linear2(self.dropout(torch.relu(self.linear1(x))))
        x = x + self.dropout(ff_output)

        x = self.norm1(x)
        x = self.norm2(x)

        return x

class TransformerDecoder(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_layers, max_len=512):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_len)
        self.transformer_decoder_layers = nn.ModuleList(
            [TransformerDecoderLayer(d_model, nhead) for _ in range(num_layers)]
        )
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        x = self.positional_encoding(x)

        for layer in self.transformer_decoder_layers:
            x = layer(x)

        x = self.fc(x)
        return x

**PositionalEncoding:**
Adds positional information to input sequences, allowing the model to understand the order of elements in the sequence.

**TransformerDecoderLayer:**
A single layer of the transformer decoder, consisting of multi-head self-attention, feedforward neural network, and layer normalization. It helps the model capture context and relationships within the input sequence.

**TransformerDecoder:**
Assembles multiple transformer decoder layers to form a complete decoder. It incorporates embedding, positional encoding, and a final linear layer to output predictions based on the learned representations. This module is the heart of the transformer decoder architecture for language-related tasks.

# Creation of the instance

In [10]:
vocab_size = tokenizer.vocab_size+5
d_model = 256 #
nhead = 4 #number of head attention layer
num_layers = 4 #number of decoder block
max_len = 32 #maximum input : the number of token taken in account for predicting the next one

device = 'cuda' if torch.cuda.is_available() else 'cpu'

decoder = TransformerDecoder(vocab_size, d_model, nhead, num_layers, max_len).to(device)

pretrained_camembert = CamembertModel.from_pretrained('camembert-base')
pretrained_embedding_weights = pretrained_camembert.embeddings.word_embeddings.weight

new_shape = (vocab_size, d_model)
reshaped_weights = pretrained_embedding_weights[:vocab_size, :d_model]

decoder.embedding.weight = nn.Parameter(reshaped_weights, requires_grad=False)

I choosed these parameters empirically. 
I'm loading part of an embedding layer of a pretrained camembert model compatible with my tokenizer. Loading these weigths slightly the training time.  

In [104]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

num_parameters = count_parameters(decoder)
print(f"Number of parameters in the model: {num_parameters}")

Number of parameters in the model: 13485573


In [41]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(decoder.parameters(), lr=0.0001)

print_interval = 100
num_epochs = 10

decoder.eval()
avg_val_loss_list = []
loss_list = []

In [None]:
for epoch in range(num_epochs):
    decoder.train()
    running_loss = 0.0

    for i, (inputs, targets) in enumerate(train_loader):
        optimizer.zero_grad()
        outputs = decoder(inputs)

        loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))
        running_loss += loss.item()

        loss.backward()
        optimizer.step()

        if i%100 == 0:
            print(i,f'Loss: {loss:.4f}')
        loss_list.append(loss)
            
    avg_val_loss = 0.0
    decoder.eval()

    with torch.no_grad():
        for val_i in range(len(val_x)):
            val_inputs, val_targets = val_x[val_i], val_y[val_i]
            val_outputs = decoder(val_inputs)
            val_loss = criterion(val_outputs.view(-1, vocab_size), val_targets.view(-1))
            avg_val_loss += val_loss.item()

    avg_val_loss /= len(val_x)
    avg_val_loss_list.append(avg_val_loss)

    print(epoch,f'Avg Validation Loss: {avg_val_loss:.4f}')

    decoder.train()

In [61]:
torch.save(decoder.state_dict(), "model")

In [12]:
decoder_state_dict = torch.load("model")
decoder = TransformerDecoder(vocab_size, d_model, nhead, num_layers, max_len).to(device)
decoder.load_state_dict(decoder_state_dict) #you can load the model just by running this cell

<All keys matched successfully>

Training 10 epoch took me about 7 hours. The training time will depends on your machine. We can see at the end that the validation loss is increasing again : the model is overfitting. 

![Example Image](graph.png)

# Text generation

In [19]:
def generate_tokens(model, tokenizer, seed_text, n_tokens):
    model.eval()

    seed_tokens = tokenizer.encode(seed_text, return_tensors="pt")[:,:-1]

    for _ in range(n_tokens):
        with torch.no_grad():
            logits = decoder(seed_tokens)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            seed_tokens = torch.cat([seed_tokens, idx_next], dim=-1)

    # Decode the generated tokens
    generated_text = tokenizer.decode(seed_tokens.squeeze().tolist())

    return generated_text

To generate text, we extract the last token optained and add it to the already generated tokens before putting the sentence back into the transformer.

In [30]:
model = decoder

seed_text = "La fourmie n'est pas préteuse, c'est là son moindre défault"
n_tokens = 500

generated_text = generate_tokens(model, tokenizer, seed_text, n_tokens)
print(generated_text.replace('@','\n'))

<s> La fourmie n'est pas préteuse, c'est là son moindre défault, que faisiez-vous au temps chaud? ma bonne est bonne.
En en sera parlé :
Agré l’aucun rien, sans profondeur ;
J’ac bien.
Prétendras ;
Ne pénétr tous les Roi
Si content, dit l'y en peu votre cou fait un marrons : on ne nos bien ; l’homme est un si bien sans plus. É lors plus des sourd autant les proposer l'une leçon ;
Sen nous l'y mettraiaient de leurs choux.
Fidèle en fit autour de toutes ité de pour l’enfin ne de sens.
Indi,
Des malheur nos ressorts
Et si n cause l'en sans bruit un pesant un moment.
Me firent passer homme, pour l'elle ;
Mon cousin acco avec des ordres mettre le Chat assez,
Jon vous ce récit de ce n'effort.
V si haut ;
Même grenier dans plus chapitres en souffrant un cent si bien comme la plus tant chéri de changer d'un certain se dés plus
La main en pèlerinage d leurs mœurs des Maxime. Souhait, n me reste : je n quoi?
Et ne vint en mal.
Qu’est-hu
Lors lui prête un arbre, gare, l'yon.
Ven ce n parent de co

The generated output exhibits limited coherence, unsurprising given the modest size of the model, the constraints of the database, and the brief training duration. However, we can see that the model seems to understand that a sentence starts with capital letters and that certain sequences of 3 to 4 words are correct. These findings, though preliminary, provide insights into the model's potential capabilities with further refinement.