# Deep Learning project - 14/06/2024

- Irene Burri
- ID 0001120380

The purpose of this project is to take in input a sequence of words corresponding to a random permutation of a given english sentence, and reconstruct the original sentence.

The output can be either produced in a single shot, or through an iterative (autoregressive) loop generating a single token at a time.


CONSTRAINTS:
* No pretrained model can be used.
* The neural network models should have less the 20M parameters.
* No postprocessing should be done (e.g. no beamsearch)
* You cannot use additional training data.


BONUS PARAMETERS:

A bonus of 0-2 points will be attributed to incentivate the adoption of models with a low number of parameters.

## Import the libraries

In [2]:
#!pip install --upgrade keras

from datasets import load_dataset
import tensorflow as tf
import numpy as np
import keras
import keras.layers as layers
import keras.ops as ops
from keras import backend as K
from difflib import SequenceMatcher
from keras.layers import TextVectorization, MultiHeadAttention, LayerNormalization
from tensorflow.keras.utils import Sequence
import os
import datetime
import random 
import matplotlib as plt

# Setting the seed
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)
random.seed(seed)

2024-06-11 18:31:37.720850: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-11 18:31:37.720949: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-11 18:31:37.849949: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Dataset

The dataset is composed by sentences taken from the generics_kb dataset of hugging face. We restricted the vocabolary to the 10K most frequent words, and only took sentences making use of this vocabulary.

In [1]:
!pip install datasets



### Download and filter the dataset

In [3]:
# Download the dataset
ds = load_dataset('generics_kb',trust_remote_code=True)['train']

# Filter row with length greater than 8.
ds = ds.filter(lambda row: len(row["generic_sentence"].split(" ")) > 8)
corpus = ['<start> ' + row['generic_sentence'].replace(",", " <comma>") + ' <end>' for row in ds]
corpus = np.array(corpus)

Downloading builder script:   0%|          | 0.00/8.64k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1020868 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1020868 [00:00<?, ? examples/s]

### Create a tokenizer and detokenizer

In [4]:
tokenizer=TextVectorization( max_tokens=10000, standardize="lower_and_strip_punctuation", encoding="utf-8",) #con il max prende le piu frequenti. ordina i token del vocab dal piu frequente al meno frequente
tokenizer.adapt(corpus)

class TextDetokenizer:
    def __init__(self, vectorize_layer):
        self.vectorize_layer = vectorize_layer
        vocab = self.vectorize_layer.get_vocabulary()
        self.index_to_word = {index: word for index, word in enumerate(vocab)}

    def __detokenize_tokens(self, tokens):
        def check_token(t):
          if t == 3:
            s="<start>"
          elif t == 2:
            s="<end>"
          elif t == 7:
            s="<comma>"
          else:
            s=self.index_to_word.get(t, '[UNK]')
          return s

        return ' '.join([ check_token(token) for token in tokens if token != 0])

    def __call__(self, batch_tokens):
       return [self.__detokenize_tokens(tokens) for tokens in batch_tokens]


detokenizer = TextDetokenizer( tokenizer )
sentences = tokenizer( corpus ).numpy()

In [5]:
# Remove from corpus the sentences where any unknow word appears
mask = np.sum( (sentences==1), axis=1) >= 1
original_data = np.delete( sentences, mask , axis=0)

original_data.shape

(241236, 28)

### Data preprocessing

In [6]:
class DataGenerator(Sequence):
    def __init__(self, data, batch_size=256, shuffle=True, seed=42):

        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed
        self.on_epoch_end()

    def __len__(self):
        return int(np.floor(len(self.data) / self.batch_size))

    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        data_batch = np.array([self.data[k] for k in indexes])
        #copy of ordered sequences
        result = np.copy(data_batch)
        #shuffle only the relevant positions for each batch
        for i in range(data_batch.shape[0]):
          np.random.shuffle(data_batch[i,1:data_batch[i].argmin() - 1])

        return data_batch , result

    def on_epoch_end(self):
        self.indexes = np.arange(len(self.data))
        if self.shuffle:
            if self.seed is not None:
              np.random.seed(self.seed)
            np.random.shuffle(self.indexes)

In [7]:
# Shuffle the all data
shuffled_indices = np.random.permutation(len(original_data))
original_data = original_data[shuffled_indices]


In [8]:
# Create trainset
train_generator = DataGenerator(original_data[:220000], 220000)
x_train,labels = train_generator.__getitem__(0)

## Metrics

Let s be the source string and p your prediction. The quality of the results will be measured according to the following metric:

1.  look for the longest substring w between s and p
2.  compute |w|/max(|s|,|p|)

If the match is exact, the score is 1.

When computing the score, you should NOT consider the start and end tokens.



In [9]:
def score(s,p):
  match = SequenceMatcher(None, s, p).find_longest_match()
  #print(match.size)
  return (match.size/max(len(p),len(s)))

## Model architecture

#### The implementation was inspired from the classic Transformer architecture. I've started from keras documentation and then I removed positional embedding on the Encoder block


In [10]:
dropout_rate = 0.2

class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        # The embedding layer turns positive integers into dense vectors,
        # (Words with similar meaning are close to each other)
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        # get the number of tokens
        maxlen = tf.shape(x)[-1]
        # get all positions in order
        positions = tf.range(start=0, limit=maxlen, delta=1)
        # then get the embedded positions
        positions = self.pos_emb(positions)
        # compute the token embeddings
        x = self.token_emb(x)
        # finally return the embedded tokens + the positions
        return x + positions

class TokenEmbedding(layers.Layer):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # The embedding layer turns positive integers into dense vectors,
        # (Words with similar meaning are close to each other)
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)

    def call(self, x):
        # compute the token embeddings
        x = self.token_emb(x)
        # finally return the embedded tokens + the positions
        return x  

In [11]:
class TransformerEncoderBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate = dropout_rate):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training=True):
        attn_output = self.att(inputs, inputs, inputs) # Multi head attention where Key, Value and Query are all the same
        attn_output = self.dropout1(attn_output, training=training) # We add a dropout to reduce overfitting
        out1 = self.layernorm1(inputs + attn_output) # We add a residual connection and layernorm the result
        ffn_output = self.ffn(out1) # Feedforward network
        ffn_output = self.dropout2(ffn_output, training=training) # a second dropout
        return self.layernorm2(out1 + ffn_output) # a second residual connection

class TransformerEncoder(layers.Layer):
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, input_vocab_size, maximum_position_encoding, rate = dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.embed_dim = embed_dim
        self.token_emb = TokenEmbedding(vocab_size=input_vocab_size, embed_dim=embed_dim)
        self.enc_layers = [TransformerEncoderBlock(embed_dim, num_heads, ff_dim, rate) for _ in range(num_layers)]
        self.dropout = layers.Dropout(rate)

    def call(self, inputs, training=True):
        x = self.token_emb(inputs)
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training=training)
        return self.dropout(x, training=training)

In [12]:
class TransformerDecoderBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate = dropout_rate):
        super().__init__()
        self.att1 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.att2 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, enc_output, training=True):
        attn_output = self.att2(inputs, inputs, inputs, use_causal_mask=True) # Multi head attention where Key, Value and Query are all the same
        attn_output = self.dropout1(attn_output, training=training) # We add a dropout to reduce overfitting
        out1 = self.layernorm1(inputs + attn_output) # We add a residual connection and layernorm the result
        attn_output_2 = self.att1(out1,enc_output, enc_output)
        out2= self.layernorm2(attn_output_2 + out1)
        ffn_output = self.ffn(out2) # Feedforward network
        ffn_output = self.dropout2(ffn_output, training=training) # a second dropout
        return self.layernorm3(out2 + ffn_output) # a second residual connection

class TransformerDecoder(layers.Layer):
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, target_vocab_size, maximum_position_encoding, rate = dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.embed_dim = embed_dim
        self.token_emb = TokenAndPositionEmbedding(maxlen=maximum_position_encoding, vocab_size=target_vocab_size, embed_dim=embed_dim)
        self.dec_layers = [TransformerDecoderBlock(embed_dim, num_heads, ff_dim, rate) for _ in range(num_layers)]
        self.dropout = layers.Dropout(rate)

    def call(self, inputs, enc_output, training=True):
        attention_weights = {}
        x = self.token_emb(inputs)
        for i in range(self.num_layers):
            x = self.dec_layers[i](x, enc_output, training=training)
        return self.dropout(x, training=training)

In [13]:
class Transformer(keras.Model):
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, input_vocab_size, target_vocab_size, pe_input, pe_target, rate = dropout_rate):
        super().__init__()
        self.encoder = TransformerEncoder(num_layers, embed_dim, num_heads, ff_dim, input_vocab_size, pe_input, rate)
        self.decoder = TransformerDecoder(num_layers, embed_dim, num_heads, ff_dim, target_vocab_size, pe_target, rate)
        self.final_layer = layers.Dense(target_vocab_size)

    def call(self, inputs,training=True):
        x,y=inputs
        enc_output = self.encoder(x, training=training)
        dec_output = self.decoder(y, enc_output, training=training)
        final_output = self.final_layer(dec_output)
        return final_output

### Setting the model's parameters and creating the model

In [14]:
# Number of transformer layers in the model
num_layers = 4

# Dimensions of the embedding 
embed_dim = 200

# Number of attention heads in the multi-head attention mechanism
num_heads = 3

# Dimensionality of the feed-forward layers
ff_dim = 64

# Size of the vocabulary for the input  and target sequence
input_vocab_size = 10000
target_vocab_size = 10000

# Maximum positional encoding value for the input and target sequence
pe_input = 28
pe_target = 28

In [15]:
# Instantiate the Transformer model 
transformer = Transformer(num_layers, embed_dim, num_heads, ff_dim, input_vocab_size, target_vocab_size, pe_input, pe_target)

### Custom utilities 
- Custom masked loss function
- Custom masked accuracy function 
- Custom Scheduler 

In [16]:
K_VALUE = 1.00
max_sequence_len = 28

#Definition of a custom masked accuracy that works directly on tokens
def custom_masked_loss(label, pred):
    mask = label != 0
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
    loss = loss_object(label, pred)

    a = tf.cast(tf.range(1, max_sequence_len + 1), tf.float32)
    constant_val = tf.constant(K_VALUE)
    final_array = tf.pow(constant_val, a)

    mask = tf.cast(mask, dtype=loss.dtype)
    mask *= final_array

    loss *= mask

    loss = tf.reduce_sum(loss) / tf.reduce_sum(mask)
    return loss

In [17]:
# Defining a custom metric that works directly on tokens
def masked_accuracy(label, pred):
    pred = tf.argmax(pred, axis=2)
    label = tf.cast(label, pred.dtype)
    match = label == pred

    mask = label != 0

    match = match & mask

    match = tf.cast(match, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    return tf.reduce_sum(match) / tf.reduce_sum(mask)

In [18]:
# Definition of a custom scheduler 
class CustomScheduler(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super().__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    step = tf.cast(step, dtype=tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

### Compile the model 

* Adam optimizer
* Batch size 256
* Early stopping on accuracy


In [19]:
# Initialize the Adam optimizer with the custom learning rate scheduler
learning_rate = CustomScheduler(embed_dim)
opt = keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

In [20]:
# Define the batch size for training
BATCH_SIZE = 256
vocab_size = 10000

# Verify transformer setup with random source and target sequences
src = tf.random.uniform((BATCH_SIZE, max_sequence_len), dtype=tf.int64, minval=0, maxval=vocab_size)
trg = tf.random.uniform((BATCH_SIZE, max_sequence_len), dtype=tf.int64, minval=0, maxval=vocab_size)
transformer((src,trg))

# Compile the transformer model with the custom optimizer, loss function, and accuracy metric and check the architecture
transformer.compile(optimizer=opt, loss=[custom_masked_loss], metrics=[masked_accuracy])
transformer.summary()

# Remove the first element of each row and append a column of zeros to maintain the sequence length after the shift
sliced_array = labels[:, 1:]
ordered_sentences_shifted = np.hstack((sliced_array, np.zeros((sliced_array.shape[0], 1),dtype=int)))

# Set up early stopping to monitor validation accuracy
early_stopping = keras.callbacks.EarlyStopping(
    monitor="val_masked_accuracy",
    min_delta=0,
    patience=10,
    verbose=1,
    mode="max",
    restore_best_weights=False,
    start_from_epoch=0,
)

# Train the transformer model with the training data, using early stopping and 20% validation split
history = transformer.fit(x=(x_train,labels), y=ordered_sentences_shifted, batch_size=BATCH_SIZE, epochs=50, callbacks=[early_stopping], validation_split=0.05)

Epoch 1/50
[1m  1/817[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m15:36:32[0m 69s/step - loss: 9.2360 - masked_accuracy: 5.8326e-04

I0000 00:00:1718130952.295292      73 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m817/817[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m206s[0m 168ms/step - loss: 7.9287 - masked_accuracy: 0.0974 - val_loss: 4.7158 - val_masked_accuracy: 0.3768
Epoch 2/50
[1m817/817[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 106ms/step - loss: 4.2387 - masked_accuracy: 0.4240 - val_loss: 2.5961 - val_masked_accuracy: 0.5959
Epoch 3/50
[1m817/817[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 106ms/step - loss: 2.4364 - masked_accuracy: 0.6020 - val_loss: 1.5531 - val_masked_accuracy: 0.6940
Epoch 4/50
[1m817/817[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 106ms/step - loss: 1.5813 - masked_accuracy: 0.6844 - val_loss: 1.2047 - val_masked_accuracy: 0.7351
Epoch 5/50
[1m817/817[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 106ms/step - loss: 1.2612 - masked_accuracy: 0.7213 - val_loss: 1.0686 - val_masked_accuracy: 0.7527
Epoch 6/50
[1m817/817[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 106ms/step - loss: 1.0972 - masked_ac

## Testing the model

* Translating the outputs of the model into sequences of token
* Detokenizing the sequences into human readable sentences
* Computing the score on 3000 elements of testset


In [21]:
# Translate input sentences using the transformer model and predict token-by-token up to max_length
def translate(input_sentences, max_length=28):
    batch_size = tf.shape(input_sentences)[0]
    encoder_input = input_sentences #tf.expand_dims(input_sentences, 0)
    decoded_indexes = [[3] for _ in range(batch_size)]

    for i in range(1, max_length):
        decoder_input = tf.convert_to_tensor(decoded_indexes)#tf.expand_dims(decoded_indexes, 0)
        predictions = np.array(transformer((np.array(encoder_input), np.array(decoder_input)), training = False))
        predictions = predictions[:, -1, :]
        for j in range(batch_size):
            best_index = np.argmax(predictions[j])
            decoded_indexes[j].append(best_index)
    return decoded_indexes

In [22]:
detokenizer = TextDetokenizer(tokenizer)

# Set batch size and total number of sentences to compute the score on
batch_size = 500
total = 3000

# Initialize DataGenerator objects for training and testing sets
trainset=DataGenerator(original_data[:220000], batch_size=total)
testset=DataGenerator(original_data[220000:], batch_size=total)

# Although only the testset will be considered, I computed the score on both the train set and the test set over 3000 samples to see the trend.
# I iterate over each batch in the dataset, translating, detokenizing, and computing scores for each batch.
for dataset in [trainset, testset]:
    print("##### DATASET ##### \n") 
    shuffled_sentences_test,original_sentences_test = dataset.__getitem__(0)
    all_scores = []
    for i in range(total//batch_size):
        shuffled_sentences = shuffled_sentences_test[i*batch_size:(i+1)*batch_size]
        original_sentences = original_sentences_test[i*batch_size:(i+1)*batch_size]
        translated_sentences = translate(shuffled_sentences)

        detokenized_predictions= detokenizer(translated_sentences)
        detokenized_labels=detokenizer(original_sentences)

#         print(detokenized_predictions[0].replace("<start>", "").replace("<end>", "").replace(" <comma>", ",").strip())
#         print(detokenized_labels[0].replace("<start>", "").replace("<end>", "").replace(" <comma>", ",").strip())

        all_scores += [score(single_original.replace("<start>", "").replace("<end>", "").replace(" <comma>", ",").strip(), single_translated.replace("<start>", "").replace("<end>", "").replace(" <comma>", ",").strip()) for single_original, single_translated in zip(detokenized_labels, detokenized_predictions)]
        print(f"Computed: {len(all_scores)}; Score: {np.mean(all_scores)}")

##### DATASET ##### 





Computed: 500; Score: 0.8983231436298127
Computed: 1000; Score: 0.89943775764734
Computed: 1500; Score: 0.9017073316857626
Computed: 2000; Score: 0.8954527473103515
Computed: 2500; Score: 0.8950228223456183
Computed: 3000; Score: 0.8980481871110777
##### DATASET ##### 

Computed: 500; Score: 0.5036069817702377
Computed: 1000; Score: 0.511865962980707
Computed: 1500; Score: 0.5081348188609218
Computed: 2000; Score: 0.5072448989158479
Computed: 2500; Score: 0.5045335404919101
Computed: 3000; Score: 0.5016748135891118


In [23]:
transformer.save_weights('transformer.weights.h5')

## Last Considerations:
I've started with a Encoder-only Transformer but the performances have never reached a satisfiable results, so I've decided to use an Encoder-Decoder one where I removed the positional embedding in Encoder Block since the position of the tokens in the shuffle sentences were meaningless.

With this Architecture I've started several trials, I soon found out that some hyperparameters influenced greatly the number of the parameters of the network, such as the embedding dimension and the number of heads, instead the feed-forward layer dimension was less influenting.

This configurations below came out as the best ones:

- Number of Heads: 2, Embedding dimension: 300, Number of layers: 4,  feed-forward layer dimension: 64, dropout rate: 0.4 that produce a network with 18M parameters
- Number of Heads: 2, Embedding dimension: 256, Number of layers: 6, feed-forward layer dimension: 64, dropout rate: 0.4 that produce a network with 17,5M parameters
- Number of Heads: 2, Embedding dimension: 240, Number of layers: 6, feed-forward layer dimension: 64, dropout rate: 0.4 that produce a network with 16M parameters

- Number of Heads: 3, Embedding dimension: 200, Number of layers: 6, feed-forward layer dimension: 64, dropout rate: 0.4 that produce a network with 12M parameters

With this configurations the final scores were always between 0.49 and 0.51 so I've decided to use the last consfiguration that produces the lowest number of parameters.