## <center> Interpreting SPARQL queries into understandable english questions using natural language processing <br>

### By:

- Marco Mudenge
- Andy Chen
- Grover-Brando Tovar Oblitas

## Description

In this notebook, we will build an automatic translator using the Transformer architecture. The idea is to use an automatic translation system to translate queries in SPARQL language into questions in English.

#### What's SPARQL?
SPARQL is a knowledge base query language, similar to SQL. Knowledge bases are a source of structured data, according to the standards, models and languages of the Semantic Web, which allow efficient access to a large quantity of information in a wide variety of fields. However, their access is limited by the complexity of the requests which does not allow the public to use them directly. It is also difficult for the uninformed user to understand the meaning of a request. We therefore want to code a Transformer type model which allows us to interpret a SPARQL query on the DBpedia knowledge base by associating it with a question in English.

Thus, our machine translation system will take a SPARQL query as input and produce as output an English sentence corresponding to the question posed by the query. For example :

__Enter__ _select distinct count ( ?uri ) where { dbr:Apocalypto dbo:language ?x . ?x dbp:region ?uri }_

__Expected output__: _In how many other dbp:region do people live, whose dbo:language are spoken in dbr:Apocalypto?_

You may have noticed that we reuse elements with the prefix dbr/dbo/dbp which are associated with data in DBpedia and the knowledge base schema. dbr:Apocalypto is simply a URI that describes a resource (or data) in DBpedia. Here is the URI in question: https://dbpedia.org/describe/?url=http%3A%2F%2Fdbpedia.org%2Fresource%2FApocalypto&sid=35407

In this notebook, we will reproduce the Transformer architecture using Keras layers. We draw inspiration from the implementation of certain methods from the [Tensorflow](https://www.tensorflow.org/text/tutorials/transformer) tutorial.

## DESCRIPTION OF DATA AND EVALUATION METRICS

The corpus is a corpus of 5,000 pairs of questions - queries on DBPedia relating to a wide variety of more or less specific themes. Three data sets are provided:

- The 4000 pairs of questions – training queries in a `train.csv` file.
- The 500 pairs of questions – validation requests in a `validation.csv` file.
- The 500 pairs of questions - test queries in a `test.csv` file

The BLEU metric will be used to compare model translations to reference queries.

## LET'S BEGIN!

In [1]:
!pip install tensorflow_text

Collecting tensorflow_text
  Downloading tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorflow_text
Successfully installed tensorflow_text-2.14.0


In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_text
import pathlib
import re
from nltk.translate.bleu_score import sentence_bleu
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

In [3]:
import os

root = os.getcwd() + "/" # To change if needed

### 1 Data preparation

We must first prepare the data before sending it to the translation system. For this, two classes will be used:
- The `DataLoader` class will simply be used to read the data from the training and validation files
- The `Preprocessor` class will be used to pre-process the data in an expected format.

In [4]:
class DataLoader:
    """
     Class used to load data into DataFrame
    """

    def __init__(self, training_path: str, validation_path: str) -> None:

        self.train = pd.read_csv(training_path, sep=',', header=0).drop(columns=['id'])
        self.val = pd.read_csv(validation_path, sep=',', header=0).drop(columns=['id'])

    def get_train(self) -> pd.DataFrame:
        return self.train

    def get_val(self) -> pd.DataFrame:
        return self.val

#### 1.1 Pre-processing

The `Preprocessor` class will perform the following transformations on SPARQL queries:
- Replace all keywords (prefixes) of the form `dbx:` with `dbx_` (for example, `dbr:` becomes `dbr_` and `dbo:` becomes `dbo_`). Keywords that should be considered are: `dbr`, `dbo`, `dbp` and `rdf`
- Replace all the following punctuation marks with words:
   - `?` will become `var_`
   - `{` will become `brack_open`
   - `}` will become `brack_close`
   - `(` will become `parent_open`
   - `)` will become `parent_close`
   - `.` will become `sep_dot`

Regarding English questions, the `Preprocessor` class will perform the following transformations:
- Remove `?` at the end of sentences
- Replace all keywords of the form `dbx:` with `dbx_` (for example, `dbr:` becomes `dbr_` and `dbo:` becomes `dbo_`). Keywords that should be considered are: `dbr`, `dbo`, `dbp` and `rdf`
- Will remove all unnecessary spaces before the start and after the end of the question

This class also takes care of canceling the pre-processing once the Transformer has generated a sequence, which includes canceling the transformations indicated above and removing the start and end tokens of sentences which will have been added by the segmenter a little further down.

In [5]:
class Preprocessor:
    """
    Transforms and cleans data to improve model performance
    """

    SPARQL_TRANSLATE_OBJECTS = {
        "dbr:": "dbr_",
        "dbo:": "dbo_",
        "dbp:": "dbp_",
        "rdf:": "rdf_"
    }

    SPARQL_TRANSLATE_SYMBOLS = {
        "?": "var_",
        "{": "brack_open",
        "}": "brack_close",
        "(": "parent_open",
        ")": "parent_close",
        ".": "sep_dot"
    }

    def transform_dataframe(self, data: pd.DataFrame):
        """
        Transforms data from a DataFrame containing 'english' columns
        and 'sparql'. Calls the functions `transform_sparql` and
        `transform_english` on the correct columns

        Args:
            - data: Data to transform

        Returns:
            Transformed data
        """

        data['sparql'] = data['sparql'].apply(self.transform_sparql)

        data['english'] = data['english'].apply(self.transform_english)

        return data

    def transform_sparql(self, sparql: str):
        """
        Transform a sparql query by replacing the "dbx:" tokens with "dbx_"
        and replacing the punctuation marks with their equivalent in words
        as indicated above

        Args:
            sparql: sparql query

        Returns:
            Sparql query transformed with the modifications mentioned above
        """

        for key, value in self.SPARQL_TRANSLATE_OBJECTS.items():
            sparql = sparql.replace(key, value)

        for key, value in self.SPARQL_TRANSLATE_SYMBOLS.items():
            sparql = sparql.replace(key, value)

        return sparql

    def transform_english(self, english: str):
        """
        Transform a sparql query by replacing the "dbx:" tokens
        with "dbx_" and removing the question marks as well as
        unnecessary spaces at the beginning and end of the sentence

        Args:
            - english: English sentence to apply
            the transformations

        Returns:
            Sentence transformed with the modifications mentioned above

        """

        for key, value in self.SPARQL_TRANSLATE_OBJECTS.items():
            english = english.replace(key, value)

        english = english.replace('?', '')

        english = english.strip()

        return english

    def transform_back_english(self, english):
        """
        Performs reverse transformations of the English sentence
        (replaces dbx_ in dbx:).
        Be careful, this function must also remove the start tokens
        and end of a sentence which are added when
        segmentation (tokenization)

        Args:
            - english: Sentence generated by a model containing the tokens
            start and end

        Returns:
            - English sentence whose transformations have been undone
        """
        english = bytes(tf.squeeze(english).numpy()).decode()

        english = english.replace(' _ ', '_')

        for key, value in self.SPARQL_TRANSLATE_OBJECTS.items():
            english = english.replace(value, key)

        return english



Testing the `Preprocessor` class below

In [6]:
def test_preprocessor():

    test_queries = [
        'select distinct count ( ?uri ) where { ?uri dbo:director dbr:Stanley_Kubrick . }',
        'select distinct ?uri where { ?uri dbo:founder dbr:John_Forbes_(British_Army_officer) . ?uri rdf:type dbo:City }'
    ]

    test_english = [
        'how many movies are there whose dbo:director is dbr:Stanley_Kubrick ?',
        'what dbo:City\'s dbo:founder is dbr:John_Forbes_(British_Army_officer) ?'
    ]

    preprocessor = Preprocessor()
    print('Transformed sparql : ')
    for query in test_queries:
        print(preprocessor.transform_sparql(query))

    print()
    print('Transformed english : ')
    for english in test_english:
        print(preprocessor.transform_english(english))


test_preprocessor()

Transformed sparql : 
select distinct count parent_open var_uri parent_close where brack_open var_uri dbo_director dbr_Stanley_Kubrick sep_dot brack_close
select distinct var_uri where brack_open var_uri dbo_founder dbr_John_Forbes_parent_openBritish_Army_officerparent_close sep_dot var_uri rdf_type dbo_City brack_close

Transformed english : 
how many movies are there whose dbo_director is dbr_Stanley_Kubrick
what dbo_City's dbo_founder is dbr_John_Forbes_(British_Army_officer)


Expected output :

```
Transformed sparql :
select distinct count parent_open var_uri parent_close where brack_open var_uri dbo_director dbr_Stanley_Kubrick sep_dot brack_close

select distinct var_uri where brack_open var_uri dbo_founder dbr_John_Forbes_parent_openBritish_Army_officerparent_close sep_dot var_uri rdf_type dbo_City brack_close

Transformed english :
how many movies are there whose dbo_director is dbr_Stanley_Kubrick

what dbo_City's dbo_founder is dbr_John_Forbes_(British_Army_officer)
```

We can now instantiate an object from the
`Data Loader` class to load training and validation data from `train.csv` and `validation.csv` files

In [7]:
data_loader = DataLoader(
    training_path=root + 'train.csv',
    validation_path=root +'validation.csv'
)

Apply data pre-processing on previously loaded data

In [8]:
pre_processor = Preprocessor()
processed_train = pre_processor.transform_dataframe(data_loader.train)
processed_val = pre_processor.transform_dataframe(data_loader.val)

### 2. Segmentation (tokenization)

Once the data is imported and modified, the sentences must be adapted into a format that the model can understand.

First of all, we need to segment the sentences into tokens. For this, a dictionary of words (vocabulary) will be necessary.

#### 2.0 LanguageTokenizer

The `LanguageTokenizer` class will take care of creating this vocabulary and transforming sentences of a specific language into tokens. In our case, there will be 2 instances of this class: one for English and the other for sparql. This class has several functions that will be very useful to us including `create_vocab` to create the vocabulary of the model, `tokenize` to transform sentences into tokens and `detokenize` to transform tokens into sentences.

We will use Bert's segmenter to find the tokens and vocabulary. The segmenter parameters are given to you. This segmenter divides each word into word parts. For example "characteristically" will be segmented into 'characteristic' and '##ally'.

Then, for each of the sentences, after transforming them into tokens, you will need to add the start (`[START]`) and end of sentence (`[END]`) tokens. This operation will be carried out in the `add_start_end` function.

In [9]:
class LanguageTokenizer(tf.Module):
    """
    Class representing a tokenizer for a specific language.
    In our case, there will be one for sparql and one for English
    """

    reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]
    START = tf.argmax(tf.constant(reserved_tokens) == "[START]")
    END = tf.argmax(tf.constant(reserved_tokens) == "[END]")

    tokenizer_params = dict(lower_case=True)

    vocab_args = dict(
        vocab_size = 8000,
        reserved_tokens=reserved_tokens,
        bert_tokenizer_params=tokenizer_params,
        learn_params=None,
    )

    def __init__(self, reserved_tokens, vocab_path):
        """
        Initializes the BertTokenizer using the `vocab_path` parameter
        and putting the tokenizer in “lower case” mode.

        Args:
            - reserved_tokens: Reserved tokens from the BertTokenizer
            - vocab_path: Path to the file containing the tokenizer vocabulary
        """

        super().__init__(name="LanguageTokenizer")

        # Retrieve the vocabulary save as a file needed for tokenization
        with open(vocab_path) as f:
            f = open(vocab_path, 'r')

            init = tf.lookup.TextFileInitializer(
                    f.name,
                    key_dtype=tf.string, key_index=tf.lookup.TextFileIndex.WHOLE_LINE,
                    value_dtype=tf.int64, value_index=tf.lookup.TextFileIndex.LINE_NUMBER)

            lookup_table = tf.lookup.StaticVocabularyTable(
                init,
                num_oov_buckets=1
            )

            f.close()

        self.tokenizer = tensorflow_text.BertTokenizer(lookup_table, **self.tokenizer_params)

        self.reserved_tokens = reserved_tokens

    def create_vocab(language_sentences: pd.DataFrame, path: str):
        """
        Creates a vocabulary from the input sentences
        (language_sentences). For this we use the bert_vocab_from_dataset()
        function.

        Once the vocabulary has been created, it will need to be saved in a specified file
        by the `path` attribute.

        Args:
            - language_sentences: DataFrame containing language sentences
            - path: Path where the vocabulary will be saved
        """

        # Convert the sentence DataFrame into a Dataset
        vocab_tf_dataset = tf.data.Dataset.from_tensor_slices((language_sentences))

        # Create the vocabulary
        vocab = bert_vocab.bert_vocab_from_dataset(vocab_tf_dataset, **LanguageTokenizer.vocab_args)

        # Save the vocabulary to the specified file
        with open(path, 'w', encoding='utf-8') as f:
            f.write('\n'.join(vocab))
            f.close()

    @tf.function
    def tokenize(self, inputs):
        """
        Transforms sentences into token indexes and adds the
        start and end tokens.

        Args:
            - inputs: Input sentences

        Returns:
            Tokens matching the sentence with start and end tokens
        """

        # Tokenize the inputs into a ragged tensor
        tokens = self.tokenizer.tokenize(inputs).merge_dims(1,2)

        # Add the start and end token to the tokenized inputs
        tokens = LanguageTokenizer.add_start_end(tokens)

        return tokens

    @tf.function
    def detokenize(self, tokenized):
        """
        Transforms a list of indexes into tokens. Then apply
        the `cleanup_text` method to clean
        the data.

        Args:
            - tokenized: List of tokens

        Returns:
            Sentence corresponding to tokens
        """

        # Detokenize: turn the token IDs back into text returned as a ragged tensor
        detokenized = self.tokenizer.detokenize(tokenized)

        # Cleaning the detokenized text tensor back into a regular string
        detokenized = LanguageTokenizer.cleanup_text(self.reserved_tokens, detokenized)

        return detokenized

    def add_start_end(tokenized_sentences):
        """
        Function that adds the representation of the [START] and [END] tokens to the input sentence

        Args:
            - tokenized_sentences: Tensor containing the token indices of the sentences

        Returns:
            Initial tensor with token indices [START] and [END] at start and end
        """

        # Create the start and end tokens tensors. Dimensions should match the number of sentences
        start = tf.fill([tokenized_sentences.bounding_shape()[0], 1], LanguageTokenizer.START)
        end = tf.fill([tokenized_sentences.bounding_shape()[0], 1], LanguageTokenizer.END)

        # Make the data type compatible with the tokenized_sentences tensor
        start = tf.cast(start, tf.int64)
        end = tf.cast(end, tf.int64)

        # Concatenate the start tokens, sentences and end tokens
        tokenized_sentences = tf.concat([start, tokenized_sentences, end], axis=1)

        return tokenized_sentences


    def cleanup_text(reserved_tokens, token_txt):
        """
        Function that cleans up text generated by the BertTokenizer detokenize() function.
        Args:
            - reserved_tokens: Reserved tokens from the BertTokenizer
            - token_text: String generated by the detokenize() function

        Returns:
            Cleaned text
        """

        # Determine which tokens are reserved based on the reserved_tokens
        bad_tokens = [re.escape(token) for token in reserved_tokens if token != "[UNK]"] # Tokens to remove are all non-[UNK] tokens

        # Create a regular expression pattern from the bad_tokens
        bad_token_regex = tf.strings.join(bad_tokens, "|")

        # Determine which tokens match the bad_token_regex
        bad_cells = tf.strings.regex_full_match(token_txt, bad_token_regex)

        # Replace the bad tokens with an empty string
        cleaned_cells = tf.ragged.boolean_mask(token_txt, ~bad_cells)

        # Joining the good cells back into text
        cleaned_cells = tf.strings.reduce_join(cleaned_cells, separator=' ', axis=-1)

        return cleaned_cells

Testing the `LanguageTokenizer` class below

In [10]:
def test_add_start_end():

    tokenized_sentence = tf.ragged.constant([[320, 24, 500, 23, 21], [43, 45, 102, 30]], dtype=tf.int64)
    tf.print(LanguageTokenizer.add_start_end(tokenized_sentence))

test_add_start_end()

[[2, 320, 24, ..., 23, 21, 3], [2, 43, 45, 102, 30, 3]]


Expected output:
```
[[2, 320, 24, ..., 23, 21, 3], [2, 43, 45, 102, 30, 3]]
```

In [11]:
def test_tokenizer():

    sentence = ['how many U.S Presidents were born in New York ?']
    vocab_path = root + 'test_language_vocab.txt'
    LanguageTokenizer.create_vocab(sentence, vocab_path)

    with open(vocab_path) as f:
        vocab = f.read()

    print('Vocabulary : ', vocab.replace('\n', ' '))
    test_tokenizer_obj = LanguageTokenizer(LanguageTokenizer.reserved_tokens, vocab_path)
    tokenized_sentence = test_tokenizer_obj.tokenize(sentence)
    tf.print(f'Tokenized sentence : {tokenized_sentence}')

    detokenized_sentence = test_tokenizer_obj.detokenize(tokenized_sentence)
    tf.print(f'Detokenized sentence : {bytes(tf.squeeze(detokenized_sentence).numpy()).decode()}') # TODO: Would fail for batched inputs

test_tokenizer()

Vocabulary :  [PAD] [UNK] [START] [END] . ? a b d e h i k m n o p r s t u w y ##. ##? ##a ##b ##d ##e ##h ##i ##k ##m ##n ##o ##p ##r ##s ##t ##u ##w ##y
Tokenized sentence : <tf.RaggedTensor [[2, 10, 34, 40, 13, 25, 33, 41, 20, 4, 18, 16, 36, 28, 37, 30, 27, 28,
  33, 38, 37, 21, 28, 36, 28, 7, 34, 36, 33, 11, 33, 14, 28, 40, 22, 34,
  36, 31, 5, 3]]>
Detokenized sentence : how many u . s presidents were born in new york ?


Expected output:
```
Vocabulary :  [PAD] [UNK] [START] [END] . ? a b d e h i k m n o p r s t u w y ##. ##? ##a ##b ##d ##e ##h ##i ##k ##m ##n ##o ##p ##r ##s ##t ##u ##w ##y
Tokenized sentence : <tf.RaggedTensor [[2, 10, 34, 40, 13, 25, 33, 41, 20, 4, 18, 16, 36, 28, 37, 30, 27, 28,
  33, 38, 37, 21, 28, 36, 28, 7, 34, 36, 33, 11, 33, 14, 28, 40, 22, 34,
  36, 31, 5, 3]]>
Detokenized sentence : how many u . s presidents were born in new york ?
```

#### 2.1 Vocabulary

We can now create the vocabulary for each language using the `create_vocab` function. We store English vocabulary in a file called `language_vocab_english.txt` and sparql vocabulary in a file called `language_vocab_sparql.txt`

In [12]:
sparql_vocab_path = root + 'language_vocab_sparql.txt'
english_vocab_path = root + 'language_vocab_english.txt'

# Create the vocabulary for the sparql and english sentences
#LanguageTokenizer.create_vocab(processed_train['sparql'], sparql_vocab_path)
#LanguageTokenizer.create_vocab(processed_train['english'], english_vocab_path)


In order to only use one class, we will create a class that groups the two tokenizers into a single class called GroupedTokenizers. Complete the constructor which initializes the english attribute corresponding to the english tokenizer and the sparql attribute corresponding to the sparql tokenizer.

In [13]:
class GroupedTokenizers(tf.Module):
    """
    This class brings together the two segmenters (tokenizers) which will be
    used (one for each language)
    """

    def __init__(self, reserved_tokens, vocab_english_path: str, vocab_sparql_path: str):
        """
        Initializes the two tokenizers (english and sparql)
        Args:
            - reserved_tokens: Reserved tokens from the BertTokenizer
            - vocab_english_path: Path to the file containing
            the English vocabulary of the tokenizer
            - vocab_sparql_path: Path to the file containing
            the sparql vocabulary of the segmenter (tokenizer)
        """
        self.english = LanguageTokenizer(reserved_tokens, vocab_english_path)
        self.sparql = LanguageTokenizer(reserved_tokens, vocab_sparql_path)
        pass

The following test verifies that the pre-processing and tokenizer are working correctly

In [14]:
tokenizers = GroupedTokenizers(
    LanguageTokenizer.reserved_tokens,
    root + 'language_vocab_english.txt',
    root + 'language_vocab_sparql.txt'
)

def test_tokenizer_preprocessor(tokenizers: GroupedTokenizers):
    """
    Check that the tokenizer and preprocessor functions are correct
    and well coded. If they are, the initial English sentences and
    sparql should be the same as the input

    """
    english = 'how many movies are there whose dbo:director is dbr:Stanley_Kubrick ?'
    sparql = 'select distinct count ( ?uri ) where { ?uri dbo:director dbr:Stanley_Kubrick . }'
    print('English : \n', english, '\n')

    # processed_train = pre_processor.transform_dataframe
    pre_processor = Preprocessor()
    processed_english = pre_processor.transform_english(english)
    processed_sparql = pre_processor.transform_sparql(sparql)

    print('Processed english : \n', processed_english, '\n')
    tokenized_english = tokenizers.english.tokenize(processed_english)
    print('Tokenized english : \n', tokenized_english, '\n')
    detokenized_english = pd.Series(tokenizers.english.detokenize(tokenized_english).numpy())
    print('Detokenized english : \n', detokenized_english.apply(pre_processor.transform_back_english), '\n')
    print()
    print('------------------------------------------------')
    print()

    print('Sparql : \n', sparql, '\n')

    print('Processed sparql : \n', processed_sparql, '\n')
    tokenized_sparql = tokenizers.sparql.tokenize(processed_sparql)
    print('Tokenized sparql : \n', tokenized_sparql, '\n')

test_tokenizer_preprocessor(tokenizers)

English : 
 how many movies are there whose dbo:director is dbr:Stanley_Kubrick ? 

Processed english : 
 how many movies are there whose dbo_director is dbr_Stanley_Kubrick 

Tokenized english : 
 <tf.RaggedTensor [[2, 74, 75, 495, 67, 73, 65, 61, 25, 228, 59, 60, 25, 896, 95, 261, 25,
  36, 116, 329, 757, 114, 3]]> 

Detokenized english : 
 0    how many movies are there whose dbo:director i...
dtype: object 


------------------------------------------------

Sparql : 
 select distinct count ( ?uri ) where { ?uri dbo:director dbr:Stanley_Kubrick . } 

Processed sparql : 
 select distinct count parent_open var_uri parent_close where brack_open var_uri dbo_director dbr_Stanley_Kubrick sep_dot brack_close 

Tokenized sparql : 
 <tf.RaggedTensor [[2, 66, 65, 71, 68, 22, 63, 55, 22, 56, 68, 22, 62, 64, 57, 22, 63, 55,
  22, 56, 61, 22, 208, 59, 22, 41, 182, 80, 233, 22, 33, 879, 703, 111,
  60, 22, 58, 57, 22, 62, 3]]> 



### 3. Batching

Given the large amount of data involved in training a model, it is important to send the data as efficiently as possible. To do this, the data is grouped into small groups called “batches”. This makes it possible to process several elements in parallel and considerably reduces training time.

For this, the `Batcher` class will be used. This class takes care of grouping the data into small batches and preparing them to send to the model. This class has several functions:
- `make_batches`: It receives as a parameter an instance of the `tf.Dataset` class. It then divides the dataset into small batches and sends them to the `prepare_batch` function
- `prepare_batch`: Receives a batch and prepares it by performing the following transformations:
   - Segments input sentences using the correct tokenizers passed as parameters in the constructor
   - Ensures sentence size does not exceed `max_tokens`


<img src="https://github.com/marcomudenge/NLP_3_A_Transformer_For_Sparql_Translation/blob/main/Batcher.png?raw=1" alt="Batcher" width="100%" height="700"/>

In [15]:
class Batcher():
    """
    Cette classe s'occupe de regrouper les données en petits groupes (batches) et
    de préparer les données pour les envoyer au modèle.
    """

    def __init__(self, tokenizers: GroupedTokenizers, train, max_tokens, batch_size, buffer_size):
        """
        Initialise les paramètres en entrée

        Args :
            - tokenizers : tokenizers pour transformer les entrées en jeton
            - train : Valeur booléenne pour savoir si les batches seront utilisées
            pour de l'entrainement ou pas
            - max_tokens : Nombre de jetons maximums pour une entrée
            - batch_size : Taille des groupes (batches)
            - buffer_size : Taille du buffer servant à mélanger les données dans le
            cas de l'entrainement
        """

        self.tokenizers = tokenizers
        self.train = train
        self.max_tokens = max_tokens
        self.buffer_size = buffer_size
        self.batch_size = batch_size

    def prepare_batch(self, input_language, output_language=None):
        """
        Prepares batches to send to the model. This function is
        called for each element of a Tensorflow Dataset.

        Performs the following transformations:
            - Tokenize the input sentences using the correct tokenizers passed
            as a parameter in the constructor
            - Ensures sentence size does not exceed `max_tokens` (max_tokens
            is included)

        Args:
            - input_language: Input in the input language (sparql in our case)
            of size (self.batch_size, x)
            - output_language: Output in the output language (english in our case)
            of the size (self.batch_size, x). None in the case of test batches

        Returns:
            - If self.train == True:
                Returns a tuple of the form ((input_language, output_language_inputs), output_language_labels)
                which will be the respective inputs of the encoder and decoder and the
                decoder output.

                Here's what each return value represents
                - input_language: tensor containing the tokens of the `input_language` parameter
                limited to `max_tokens`
                - output_language_inputs: tensor containing the parameter tokens
                `output_language` limited to `max_tokens`+1 (to allow predicting the next
                token)
                - output_language_labels: tensor containing the parameter tokens
                `output_language` containing the next character

            - If self.train == False:
                Returns a tuple of the form (input_language, output_language) which
                represent the encoder input and a
                output tensor initialized with input token size
                (self.batch_size,). Return values are explained above
        """

        # Tokenize the input sentences
        input_language = self.tokenizers.sparql.tokenize(input_language)
        input_language = input_language[:, :(self.max_tokens+1)].to_tensor()

        # If we're in training mode, tokenize the output sentences
        if (self.train == True):
            # Tokenize the output sentences
            output_language = self.tokenizers.english.tokenize(output_language)
            output_language = output_language[:, :(self.max_tokens+1)].to_tensor()
            output_language_inputs = output_language[:, :-1]
            output_language_labels = output_language[:, 1:]

            return (input_language, output_language_inputs), output_language_labels
        else:
            # If we're not in training mode, return the input tokens and output tensor initialized with the start token
            #output_language = self.tokenizers.english.tokenize(tf.fill((self.batch_size,), ''))
            output_language = self.tokenizers.english.tokenize(tf.fill((tf.shape(input_language)[0],), ''))
            output_language = output_language[:, :(self.max_tokens+1)]
            output_language_labels = output_language[:, :-1].to_tensor()
            return input_language, output_language_labels

    def make_batches(self, ds):
        """
        Args:
            - ds: Dataset containing the examples of the form
            ((sparql, english_in), english_label)
            if self.train == True and form (sparql, english)
            if self.train == False

        Returns:
            The initial dataset (mixed if self.train == True) containing
            elements of the size of self.batch_size including the self.prepare_batch function
            was called on each of the elements and whose elements are
            prefetched. If self.train == False, it's the same principle,
            but the data is not mixed
        """

        # Shuffle the dataset if it is for training
        if (self.train == True):
            ds = ds.shuffle(self.buffer_size)

        # Prepare the batches
        ds = ds.batch(self.batch_size)
        ds = ds.map(self.prepare_batch)

        return ds

You can now test the batcher using the following function (check that the output of the decoder contains one more token than the sentence that enters the decoder and that what enters the encoder is indeed sparql ).

In [16]:
def test_batcher(tokenizers):


    english = pd.Series([
        'how many movies are there whose dbo_director is dbr_Stanley_Kubrick',
        'what is the dbo_River whose dbo_riverMouth is dbr_Dead_Sea',
    ])

    sparql = pd.Series([
        'select distinct count parent_open var_uri parent_close where brack_open var_uri dbo_director dbr_Stanley_Kubrick sep_dot brack_close',
        'select distinct var_uri where brack_open var_uri dbo_riverMouth dbr_Dead_Sea sep_dot var_uri rdf_type dbo_River brack_close',
    ])

    train = False
    batcher = Batcher(tokenizers, train, 8, 64, 20000)

    val_english = tf.data.Dataset.from_tensor_slices(english)
    val_sparql = tf.data.Dataset.from_tensor_slices(sparql)
    val_examples = tf.data.Dataset.zip((val_sparql, val_english))

    batches = batcher.make_batches(val_examples)
    for x in batches:
      if train:
        tf.print('Detokenized inputs encoder : ', tokenizers.sparql.detokenize(x[0][0]))
        tf.print('Detokenized inputs decoder  : ', tokenizers.english.detokenize(x[0][1]))
        tf.print('Detokenized outputs decoder : ', tokenizers.english.detokenize(x[1]))

        concat = tf.concat([x[0][0], x[0][1], x[1]], axis=1)
        print('Concatened values : ', concat)
      else:
        tf.print(x[0])
        tf.print(x[1])

test_batcher(tokenizers)

[[2 66 65 ... 63 55 22]
 [2 66 65 ... 64 57 22]]
[[2]
 [2]]


### 4. Transformer

<img style="float: right;" src="https://github.com/marcomudenge/NLP_3_A_Transformer_For_Sparql_Translation/blob/main/Transformer.png?raw=1" alt="Transformer" width="500" height="700"/>

Now that the data is ready to be sent to the model, all that remains is to create its architecture. For this, the Keras library will be used. Keras is a library that is built on top of Tensorflow to facilitate the development of models in an object-oriented style. Since Tensorflow 2.0, it is now directly integrated into Tensorflow. For more details, the documentation is present on this [site](https://keras.io/api/)

The architecture that will be followed in this notebook is presented in the image on the right. The list of layers that will be implemented are as follows:
- `Positional Embedding`: Allows the generation of position embeddings
- `Global-Self Attention`: Takes care of the encoder's attention mechanism
- `Feed Forward`: Allows you to connect inputs and outputs with a neural network
- `Decoder Attention`: Takes care of the first attention mechanism of the decoder
- `Cross Attention`: Takes care of the second attention mechanism of the decoder (connects the encoder to the decoder)

The addition and normalization layers will be included in the previous layers. For example, the `Add & Norm` layer that follows the `Global-Self Attention` layer in the graph will be included in the `Global-Self Attention` layer.

Then, layers will also be used to group these layers together to simplify the Transformer pipeline. Here is the list of layers that will be added to those on the graph:
- `Encoder Layer`: Represents a single encoder containing the `Global-Self Attention` and `Feed Forward` layers
- `Decoder Layer`: Represents a single decoder containing the `Decoder Attention`, `Cross Attention` and `Feed Forward` layers
- `Encoder`: Represents several encoders in parallel
- `Decoder`: Represents several decoders in parallel
- `Transformer`: Represents the entire Transformer and includes all encoders, decoders and embedding layers

Each layer will be created manually and implemented as a Keras layer. If you're not familiar with Keras, here are some tutorials that might help:
- https://keras.io/api/models/model/
- https://www.tensorflow.org/text/tutorials/transformer
- https://machinelearningmastery.com/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras/


#### 4.1 Positional Embedding

To allow the model to take into account the order of the tokens passed to it, it is important to pass information to the model about the position of the tokens in a sentence. The `PositionalEmbedding` layer takes care of this. Using the following formula, position embeddings are generated, allowing a token's position to be incorporated into its embedding:
$$PE_{(pos, 2i)} = sin \Big( \frac{pos}{10000^{2i/d_{model}}} \Big)$$
$$PE_{(pos, 2i+1)} = cos \Big( \frac{pos}{10000^{2i/d_{model}}} \Big)$$

where $d_{model}$ is the dimension of the output embeddings and $i$ is simply the index of a value in the embedding vector.

The `generate_positional_embedding` function generates positional embeddings. This takes as input:
- `length`: Maximum number of tokens for which the position embedding must be generated
- `depth`: Dimension of the model embeddings.

The `call` function of this layer is called with the following parameter (tensor sizes are indicated in parentheses):
- `x` (of size [batch_size, input_size] where the batch_size is the number of elements that are sent at a time for one iteration of the training and input_size is the maximum size of the input sentences): Layer inputs . This corresponds in particular to the tensor containing the indices of each token corresponding to the sentence
  

It returns the embedding of the input in the latent space including the positions of the tokens (batch_size, input_size, dim_model).

The `call` function must perform the following operations:
1. Call the `embedding_layer` layer which generates embeddings relative to the inputs
2. Multiply each value by the root of `dim_model` (This multiplication is used to enlarge the embeddings so that they are of an order of magnitude comparable to the position embeddings that are added subsequently. For more details, see original article leading to the creation of the Transformer entitled "Attention Is All You Need").
3. Then add the position embeddings to the embeddings generated by the `embedding_layer` (after they have been multiplied by the root of `dim_model`)

In [17]:
class PositionalEmbedding(tf.keras.layers.Layer):
    """
    Class representing the step which incorporates the positions of the tokens into the latent space
    """
    def __init__(self, input_size, dim_model):
        """
        Initializes a layer of embeddings and position embeddings

        Args:
            - input_size: Input size of the layer (vocabulary size)
            - dim_model: Size of model embeddings (size of layer output embedding)
        """
        super().__init__()
        self.embedding_layer = tf.keras.layers.Embedding(input_size, dim_model, mask_zero=True)
        self.position_embeddings = self.generate_positions_embedding(length=2048, depth=dim_model)
        self.dim_model = dim_model

    def compute_mask(self, *args, **kwargs):
        return self.embedding_layer.compute_mask(*args, **kwargs)

    def generate_positions_embedding(self, length, depth):
        depth = depth/2

        positions = np.arange(length)[:, np.newaxis]
        depths = np.arange(depth)[np.newaxis, :]/depth

        angle_rates = 1 / (10000**depths)
        angle_rads = positions * angle_rates

        pos_encoding = np.concatenate([np.sin(angle_rads), np.cos(angle_rads)], axis=-1)

        return tf.cast(pos_encoding, dtype=tf.float32)

    def call(self, x):
        """
        Runs layer embeddings on the input normalizing it to the root of the output dimension
        """
        x = self.embedding_layer(x) * tf.math.sqrt(tf.cast(self.dim_model, tf.float32))
        x = x + self.position_embeddings[tf.newaxis, :tf.shape(x)[1], :]
        return x

#### 4.2 Attention

The attention layers all rely on the same foundation which contains a multiple attention head, a normalization layer and an addition layer. The only difference between the different attention layers are the `Q` (query), `K` (key), and `V` (value) inputs which will be sent to the formula:

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

For this, the `DefaultAttention` class, a class from which all other attention layers will inherit, was created to avoid repeating the same constructor 3 times. You must complete the `call()` functions of each of the subclasses, namely `CrossAttention`, `GlobalSelfAttention` and `DecoderAttention`. To evaluate the values of `K`, `V` and `Q` of each attention layer, refer to the architecture graph.


In [18]:
class DefaultAttention(tf.keras.layers.Layer):
    """
    Base attention layer containing attention heads followed by a normalization and addition layer
    """
    def __init__(self, **kwargs):
        super().__init__()
        self.multiHeadAttention = tf.keras.layers.MultiHeadAttention(**kwargs)
        self.layerNormalization = tf.keras.layers.LayerNormalization()
        self.addLayer = tf.keras.layers.Add()


##### 4.2.1 CrossAttention
In the case of the `CrossAttention` layer, the `call` function takes the following inputs as parameters:
- `input`: The inputs of the layer, corresponding to the output of the `DecoderAttention` layer
- `context`: The output of the encoder
- `training`: Boolean value indicating whether the model is training or not.

This function should perform the following operations:
1. Apply the multiple attention heads layer with the correct values of `K`, `V` and `Q` (Don't forget to pass the `training` argument to the layer).
2. Add the output of the attention head layer to the inputs using the `Add` layer
3. Normalize everything using the normalization layer

In [19]:
class CrossAttention(DefaultAttention):
    """
    Layer that connects the encoder to the decoder.
    """

    def __init__(self, **kwargs):
        """
        Initializes an attention heads layer followed by a normalization layer
        then addition
        """
        super().__init__(**kwargs)

    def call(self, input, context, training):
        """
        Runs the attention layer. Adds attention outputs to the input and
         normalizes everything
        """
        # Initialize the query, key and value tensors
        Q, K, V = input, context, context

        # Call the multi-head attention layer
        attention_output = self.multiHeadAttention(query=Q, key=K, value=V, training=training)

        # Add the attention output to the input and normalize it
        output = self.addLayer([input, attention_output])
        output = self.layerNormalization(output)

        return output


##### 4.2.2 GlobalSelfAttention
In the case of the `GlobalSelfAttention` layer, the `call` function takes the following inputs as parameters:
- `input`: The inputs of the layer, corresponding to the output of the `DecoderAttention` layer
- `training`: Boolean value indicating whether the model is training or not.

This function should perform the following operations:
1. Apply the multiple attention heads layer with the correct values of `K`, `V` and `Q` (Don't forget to pass the `training` argument to the layer).
2. Add the output of the attention head layer to the inputs using the `Add` layer
3. Normalize everything using the normalization layer

In [20]:
class GlobalSelfAttention(DefaultAttention):
    """
    Self-attention layer allowing the model to look at other words in
    the input phrase when it encodes a specific word
    """

    def __init__(self, **kwargs):
        """
        Initializes a layer of attention heads followed by a layer of
        normalization then addition
        """
        super().__init__(**kwargs)

    def call(self, input, training):
        """
        Runs the attention layer. Add attention outputs to input
        and normalize everything
        """
        # Initialize the query, key and value tensors
        Q = input
        K = input
        V = input

        # Call the multi-head attention layer
        attention_output = self.multiHeadAttention(query=Q, value=K, key=V, training=training)

        # Add the attention output to the input and normalize it
        output = self.addLayer([input, attention_output])
        output = self.layerNormalization(output)

        return output


##### 4.2.3 DecoderAttention
In the case of the `DecoderAttention` layer, the `call` function takes the following inputs as parameters:
- `input`: The inputs of the layer, corresponding to the output of the `DecoderAttention` layer
- `training`: Boolean value indicating whether the model is training or not.

The implementation of the method is very similar to the `call` function of the `GlobalSelfAttention` class, but differs in one key point: the causal mask. This mask makes it possible in particular not to consider future tokens when the attention mechanism is calculated. This prevents the Transformer from training by knowing the future tokens that it must predict (therefore by “cheating”). This [article](https://medium.com/analytics-vidhya/masking-in-transformers-self-attention-mechanism-bad3c9ec235c) gives more information on the causal mask.

This function should perform the following operations:
1. Apply the multiple attention heads layer with the correct values of `K`, `V` and `Q` (Don't forget to pass the `training` argument to the layer and enable the causal mask of layer by setting the `use_causal_mask` attribute to `True` when calling the attention layer).
2. Add the output of the attention head layer to the inputs using the `Add` layer
3. Normalize everything using the normalization layer

In [21]:
class DecoderAttention(DefaultAttention):
    """
    Attention layer similar to the global self-attention layer, but masking
    the data that comes after
    """
    def __init__(self, **kwargs):
        """
        Initializes an attention heads layer followed by a normalization layer
        then addition
        """
        super().__init__(**kwargs)

    def call(self, input, training):
        """
        Runs the attention layer by hiding the data afterwards. Add the outputs
        attention at the entrance and normalizes everything
        """
        # Initialize the query, key and value tensors
        Q, K, V = input, input, input

        # Call the multi-head attention layer
        attention_output = self.multiHeadAttention(query=Q, value=K, key=V, training=training, use_causal_mask=True)

        # Add the attention output to the input and normalize it
        output = self.addLayer([input, attention_output])
        output = self.layerNormalization(output)

        return output

We can test our implementation of attention layers using the following function:

In [22]:
def test_attention():
    config = {
        'num_heads': 3,
        'key_dim': 3,
        'dropout': 0.1
    }
    cross_attention = CrossAttention(**config)
    global_self_attention = GlobalSelfAttention(**config)
    decoder_attention = DecoderAttention(**config)

    # Create determinisitc inputs and context
    generator = tf.random.Generator.from_seed(1)
    input = generator.normal(shape=(3, 1, 3))
    context = generator.normal(shape=(3, 1, 3))

    # Make attention layer deterministic
    layer = tf.keras.layers.MultiHeadAttention(num_heads=4, key_dim=4, dropout=0.1, kernel_initializer=tf.keras.initializers.ones())
    cross_attention.multiHeadAttention = layer
    global_self_attention.multiHeadAttention = layer
    decoder_attention.multiHeadAttention = layer

    outputs_cross_attention = tf.cast(cross_attention(input, context) * 100, tf.int32)
    outputs_global_self_attention = tf.cast(global_self_attention(input) * 100, tf.int32)
    outputs_decoder_attention = tf.cast(decoder_attention(input) * 100, tf.int32)

    print('Cross Attention result : ')
    print(outputs_cross_attention, '\n')

    print('Global-Self Attention result : ')
    print(outputs_global_self_attention, '\n')

    print('Decoder Attention result : ')
    print(outputs_decoder_attention, '\n')

test_attention()

Cross Attention result : 
tf.Tensor(
[[[ 124 -119   -4]]

 [[ 140  -59  -80]]

 [[  79   60 -140]]], shape=(3, 1, 3), dtype=int32) 

Global-Self Attention result : 
tf.Tensor(
[[[ 124 -119   -4]]

 [[ 140  -59  -80]]

 [[  79   60 -140]]], shape=(3, 1, 3), dtype=int32) 

Decoder Attention result : 
tf.Tensor(
[[[ 124 -119   -4]]

 [[ 140  -59  -80]]

 [[  79   60 -140]]], shape=(3, 1, 3), dtype=int32) 



#### 4.3 Feed Forward

The Feed Forward layer is, in our case, simply a sequence of 2 dense layers, a dropout layer, an addition layer and a normalization layer. These layers are already initialized in the constructor using a `Sequential` layer which groups several layers and applies them one after the other.

The `call` function takes the following inputs as parameters:
- `input`: Layer inputs (varies depending on where this layer is located in the architecture)

It then returns the result once the transformations are applied on the inputs

It performs the following operations:
1. Runs the sequential layer initialized in the constructor
2. Adds the result of the sequential layer to the inputs
3. Normalize everything using the normalization layer

In [23]:
class FeedForward(tf.keras.layers.Layer):
    """
    Propagation layer at the output of attention layers
    """

    def __init__(self, dim_model, feed_forward_size, dropout_rate=0.1):
        """
        Initializes layers of dense propagation (with dropout), addition and normalization
        Args:
            - dim_model: Model dimension (layer output)
            - feed_forward_size: Size of the dense propagation layer (input)
            - dropout_rate: Ratio of dropout layer entries that
            will be initialized to zero randomly
        """
        super().__init__()
        self.seq = tf.keras.Sequential([
            tf.keras.layers.Dense(feed_forward_size, activation='relu'),
            tf.keras.layers.Dense(dim_model),
            tf.keras.layers.Dropout(dropout_rate)
        ])
        self.add = tf.keras.layers.Add()
        self.layer_norm = tf.keras.layers.LayerNormalization()

    def call(self, input):
        """
        Runs propagation layers on the input, adds everything together and normalizes
        """
        # Add the layers and normalize the output
        output = self.add([input, self.seq(input)])
        output = self.layer_norm(output)

        return output

#### 4.4 Encoder

Our Transformer's encoder is actually made up of several layers called `EncoderLayer`. These layers represent a single pass of an encoder. However, the `Encoder` class groups together several of these `EncoderLayer`s to allow the Transformer to capture more complicated contexts between words.

We will therefore have to complete the `call` method of the `EncoderLayer` class. This method takes the following parameters as input:
- `input`: Layer inputs (notably the output of the `PositionalEmbedding` class)
- `training`: Boolean value indicating whether the method is called during training or not

It returns inputs once they have passed through all layers (`GlobalSelfAttention`, `FeedForward`)

This method should perform the following operations:
1. Call the attention layer with inputs
2. Call the propagation layer on the output of the attention layer

In [24]:
class EncoderLayer(tf.keras.layers.Layer):
    """
    Class representing an encoder layer
    """

    def __init__(self, *, dim_model, num_heads, feed_forward_size, dropout_rate=0.1):
        """
        Initializes a self-attention layer followed by a propagation layer

        Args:
            dim_model: Dimension of model embeddings
            num_heads: Number of encoder attention heads
            feed_forward_size: Number of feed forward neurons
            dropout_rate: Ratio of attention layer entries that will be
            randomly initialized to zero
        """
        super().__init__()

        self.self_attention = GlobalSelfAttention(
            num_heads=num_heads,
            key_dim=dim_model,
            dropout=dropout_rate
        )

        self.ffn = FeedForward(dim_model, feed_forward_size)

    def call(self, input, training):
        """
        Runs the attention and propagation layer on the inputs.
        The training argument specifies whether the call is made during training
        or not (important for the attention layer)
        """
        # Call the global self-attention layer and the feed forward layer
        output = self.self_attention(input, training=training)
        output = self.ffn(output)

        return output

Now, the `Encoder` class takes care of grouping several `EncoderLayer`s to allow the Transformer to infer more complex contexts.

The `call` method of the `Encoder` class takes the following parameters as input:
- `input`: Layer inputs (corresponding to the token indices of the sentence)
- `training`: Boolean value indicating whether the method is called during training or not

It returns inputs once they have passed through all encoder layers

This method performs the following operations:
1. Call position embeddings layer on inputs
2. Apply the dropout layer to the result
3. Call all layers `EncoderLayer` (the output of one encoder layer becomes the input of another)

In [25]:
class Encoder(tf.keras.layers.Layer):
    """
    Class representing all Transformer encoders
    """

    def __init__(self, *, num_layers, dim_model, num_heads, feed_forward_size, vocab_size, dropout_rate=0.1):
        """
        Initializes the position embeddings layer, a dropout layer, and the encoder layers
        Args:
            num_layers: Number of encoder layers
            dim_model: Dimension of model embeddings
            num_heads: Number of encoder attention heads
            feed_forward_size: Size of the feed forward (output)
            vocab_size: Size of the vocabulary (corresponding to the input size of the
            position embeddings layer)
            dropout_rate: Ratio of dropout layer entries that will be initialized
            to zero randomly
        """
        super().__init__()

        self.dim_model = dim_model
        self.num_layers = num_layers

        self.pos_embedding = PositionalEmbedding(input_size=vocab_size, dim_model=dim_model)

        self.dropout = tf.keras.layers.Dropout(dropout_rate)
        self.enc_layers = [EncoderLayer(dim_model=dim_model, num_heads=num_heads, feed_forward_size=feed_forward_size, dropout_rate=dropout_rate) for _ in range(num_layers)]

    def call(self, input, training):
        """
        Execute the embeddings and dropouts layer then all the encoder layers
        """
        input = self.dropout(self.pos_embedding(input))
        for i in range(self.num_layers):
            input = self.enc_layers[i](input, training)

        return input


#### 4.5 Decoder

The decoder of our Transformer is actually made up of several layers called `DecoderLayer`. These layers represent a single pass of a decoder. However, the `Decoder` class groups together several of these `DecoderLayer`s to allow the Transformer to capture more complicated contexts between words.

We will therefore have to complete the `call` method of the `DecoderLayer` class. This method takes the following parameters as input:
- `input`: Layer inputs
- `context`: The context of the attention layers
- `training`: Boolean value indicating whether the method is called during training or not

It returns inputs once they have passed through all layers (`DecoderAttention`, `CrossAttention`, `FeedForward`)

This method should perform the following operations:
1. Call the decoder attention layer with inputs
2. Call the cross-attention layer
3. Call the propagation layer (`FeedForward`)

In [26]:
class DecoderLayer(tf.keras.layers.Layer):
    """
    Class representing a decoder layer
    """

    def __init__(self, *, dim_model, num_heads, feed_forward_size, dropout_rate=0.1):
        """
        Args:
            dim_model: Dimension of model embeddings
            num_heads: Number of decoder attention heads
            feed_forward_size: Number of feed forward neurons
            dropout_rate: Dropout ratio for neurons in the Feed Forward layer
        """
        super().__init__()

        self.encoder_decoder_attention = DecoderAttention(
            num_heads=num_heads,
            key_dim=dim_model,
            dropout=dropout_rate
        )

        self.cross_attention = CrossAttention(
            num_heads=num_heads,
            key_dim=dim_model,
            dropout=dropout_rate
        )

        self.ffn = FeedForward(dim_model, feed_forward_size)

    def call(self, input, context, training):
        """
        Runs attention layers followed by FFN propagation layers
        """
        # Call the decoder attention layer, the cross attention layer and the feed forward layer
        output = self.encoder_decoder_attention(input, training=training)
        output = self.cross_attention(output, context, training=training)
        output = self.ffn(output)

        return output

The `Decoder` class takes care of grouping several `DecoderLayer`s.

The `call` method of the `Decoder` class takes the following parameters as input:
- `input`: Layer inputs (corresponding to the token indices of the sentence)
- `context`: Context of the attention layers (corresponding to the encoder output)
- `training`: Boolean value indicating whether the method is called during training or not

It returns inputs once they have passed through all decoder layers

This method performs the following operations:
1. Call position embeddings layer on inputs
2. Apply the dropout layer to the result
3. Call the `DecoderLayer` layers successively


In [27]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, *, num_layers, dim_model, num_heads, feed_forward_size, vocab_size, dropout_rate=0.1):
        """
        Initializes the position embeddings layer, a dropout layer, and the encoder layers
        Args:
            num_layers: Number of decoder layers
            dim_model: Dimension of model embeddings
            num_heads: Number of encoder attention heads
            feed_forward_size: Size of the feed forward (output)
            vocab_size: Vocabulary size (corresponding to the input size
            of the position embeddings layer)
            dropout_rate: Ratio of dropout layer entries that will be
            randomly initialized to zero
        """
        super().__init__()

        self.dim_model = dim_model
        self.num_layers = num_layers

        self.pos_embedding = PositionalEmbedding(input_size=vocab_size, dim_model=dim_model)
        self.dropout = tf.keras.layers.Dropout(dropout_rate)
        self.dec_layers = [DecoderLayer(dim_model=dim_model, num_heads=num_heads, feed_forward_size=feed_forward_size, dropout_rate=dropout_rate) for x in range(self.num_layers)]

        self.last_attn_scores = None

    def call(self, input, context, training):
        """
        Executes the dips and dropouts layer
        then all layers of decoders
        """
        input = self.dropout(self.pos_embedding(input))
        for i in range(self.num_layers):
            input = self.dec_layers[i](input, context, training)

        return input

#### 4.6 Transformer

The Transformer is now ready to be created. The constructor already takes care of initializing all the attributes necessary for its operation.

The `call` function takes the following arguments as inputs:
- `inputs`: The model inputs in the form of a tuple grouping the sparql input and the English input (`inputs = (sparql, english)`)
- `training`: Boolean value indicating whether the model is training or not

The `call` method must:
1. Separate inputs received into sparql and english
2. Send sparql sentences to the encoder
3. Send the English sentences to the decoder with the encoder output as context
4. Send the decoder output to the dense layer initialized in the constructor (`self.dense_layer`)
5. Call the `drop_mask` function with the probabilities generated by the dense layer as argument (by removing the `_keras_mask` attribute from the probabilities generated by the dense layer, we prevent the model from using this mask when calculating the metrics and cost)

In [28]:
class Transformer(tf.keras.Model):
    """
    Class representing the Transformer
    """

    def __init__(self, *, num_layers, dim_model, num_heads, feed_forward_size,
                input_vocab_size, target_vocab_size, dropout_rate=0.1):
        """
        Initializes the encoder and decoder layers and the final dense layer
        Args:
            num_layers: Number of decoder layers
            dim_model: Dimension of model embeddings
            num_heads: Number of encoder and decoder attention heads
            feed_forward_size: Size of the feed forward (output)
            input_vocab_size: Size of the input vocabulary
            target_vocab_Size: Size of the output vocabulary
            dropout_rate: Ratio of dropout layer entries that will be
            randomly initialized to zero
        """
        super().__init__()
        self.encoder = Encoder(num_layers=num_layers, dim_model=dim_model,
                            num_heads=num_heads, feed_forward_size=feed_forward_size,
                            vocab_size=input_vocab_size,
                            dropout_rate=dropout_rate)

        self.decoder = Decoder(num_layers=num_layers, dim_model=dim_model,
                            num_heads=num_heads, feed_forward_size=feed_forward_size,
                            vocab_size=target_vocab_size,
                            dropout_rate=dropout_rate)

        self.dense_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inputs, training=True):
        """
        Call the encoder and decoder layers with the correct inputs and
        context as well as the final dense layer
        """
        # Call the encoder and decoder layers
        sparql = inputs[0]
        english = inputs[1]
        sparql = self.encoder(sparql, training=training)
        english = self.decoder(english, sparql, training=training)

        # Call the dense layer
        english = self.dense_layer(english)

        self.drop_mask(training, english)

        return english

    def drop_mask(self, training, probabilities):
        if not training:
            try:
                del probabilities._keras_mask
            except AttributeError:
                pass

We can test our final Transformer implementation with the following function. **Be careful, just because you get the right results doesn't mean there are no bugs in your implementation, but it's already a good sign**

In [29]:
def test_transformer():

    config = {
        'num_layers': 2,
        'dim_model': 2,
        'num_heads': 2,
        'feed_forward_size': 2,
        'input_vocab_size': 2,
        'target_vocab_size': 2,
        'dropout_rate': 0.1
    }

    initializer = tf.keras.initializers.glorot_normal(42)

    feed_forward = FeedForward(2, 2, 0.1)
    feed_forward.seq = tf.keras.Sequential([
        tf.keras.layers.Dense(2, activation='relu', kernel_initializer=initializer, use_bias=False),
        tf.keras.layers.Dense(2, kernel_initializer=initializer, use_bias=False),
        tf.keras.layers.Dropout(0.1, seed=42)
    ])
    feed_forward.add = tf.keras.layers.Add()
    feed_forward.layer_norm = tf.keras.layers.LayerNormalization(beta_initializer=initializer, gamma_initializer=initializer)

    transformer = Transformer(**config)

    transformer.encoder.pos_embedding.embedding_layer = tf.keras.layers.Embedding(2, 2, embeddings_initializer=initializer, mask_zero=False)
    for l in transformer.encoder.enc_layers:
        l.self_attention = GlobalSelfAttention(num_heads=2, key_dim=2, dropout=0.1, kernel_initializer=initializer)
        l.ffn = feed_forward
    transformer.encoder.dropout = tf.keras.layers.Dropout(0.1, seed=42)

    transformer.decoder.pos_embedding.embedding_layer = tf.keras.layers.Embedding(2, 2, embeddings_initializer=initializer, mask_zero=True)
    for l in transformer.decoder.dec_layers:
        l.cross_attention = CrossAttention(num_heads=2, key_dim=2, dropout=0.1, kernel_initializer=initializer)
        l.encoder_decoder_attention = DecoderAttention(num_heads=2, key_dim=2, dropout=0.1, kernel_initializer=initializer)
        l.ffn = feed_forward

    transformer.dense_layer = tf.keras.layers.Dense(2, kernel_initializer=initializer, use_bias=False)
    transformer.decoder.dropout = tf.keras.layers.Dropout(0.1, seed=42)

    # Create determinisitc inputs and context
    generator = tf.random.Generator.from_seed(1)
    input = generator.normal(shape=(2, 2))
    context = generator.normal(shape=(2, 1))

    print(input)
    print(context)

    input_transformer = (input, context)
    output = transformer(input_transformer, training=False)
    print(output)

test_transformer()

input
output
tf.Tensor(
[[ 0.43842277 -0.53439844]
 [-0.07710262  1.5658045 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-0.79253083]
 [ 0.37646857]], shape=(2, 1), dtype=float32)
tf.Tensor(
[[[ 0.00026987 -0.03291528]]

 [[ 0.00026987 -0.03291528]]], shape=(2, 1, 2), dtype=float32)


```
tf.Tensor(
[[[[ 0.00026983 -0.03291529]
   [ 0.00026987 -0.03291498]]

  [[ 0.00026987 -0.03291498]
   [ 0.00026987 -0.03291498]]]


 [[[ 0.00026983 -0.03291529]
   [ 0.00026987 -0.03291498]]

  [[ 0.00026987 -0.03291498]
   [ 0.00026987 -0.03291498]]]], shape=(2, 2, 2, 2), dtype=float32)
```

#### 4.7 Scheduler

The `Scheduler` class allows, among other things, to update the learning rate of the model during training.

In [30]:
class Scheduler(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, dim_model, warmup_steps):
        super().__init__()
        self.dim_model = tf.cast(dim_model, tf.float32)
        self.warmup_steps = warmup_steps

    def __call__(self, step):
        step = tf.cast(step, tf.float32)
        return tf.math.rsqrt(self.dim_model) * tf.math.minimum(tf.math.rsqrt(step), step * (self.warmup_steps ** -1.5))

    def get_config(self):
        config = {
            'd_model': self.dim_model,
            'warmup_steps': self.warmup_steps,
        }
        return config

### 5. Model training

Now it's time to create the translator that will transform sparql queries into English. To do this, you will need to complete 4 methods of the `Translator` class, namely the `prepare`, `fit` and `translate` methods.
___

The `prepare` function receives the training and validation data in the form of a pandas DataFrame and takes care of:
1. Call the preprocessor on the training and validation data
2. Create a `tf.Dataset` object containing a tuple of sparql queries and questions in English for the training and validation set
3. Send the 2 created datasets (training and validation) to the batcher

It returns a tuple containing the training and validation batches

___

The `fit` function simply trains the model with the training and validation data passed as parameters.
___

The `translate` function takes care of translating a series of sparql data into English. To do this, several steps must be carried out. She must :
1. Apply the preprocessor on the given test set
2. Create batches using the test batcher
3. For each value in the created batches
   - Extract the contents of the tuple. Remember that what is output by the `prepare_batch` method in the case of a test batcher is a tuple of the form (SPARQL sentence, English sentence) where initially, the English sentence is initialized with the starting token
   - Send contexts and sentences to the Transformer so that it predicts the next token
   - Concatenate together all the tokens predicted by the Transformer to generate the translation
4. Reduce the size of the predictions to remove everything that comes after the end token generated by the Transformer (if no end token is generated, the translation does not need to be trimmed)
5. Transform predicted tokens into words using the right tokenizer
6. Undo initial transformations done using pre-processing


The `masked_loss` and `masked_accuracy` functions are provided to you and allow you to evaluate the accuracy of the Transformer by evaluating a loss function specific to the Transformer.

In [31]:
class Translator:

    num_layers = 4
    dim_model = 128
    feed_forward_size = 512
    num_heads = 6
    dropout_rate = 0.1
    input_vocab_size = 8000
    target_vocab_size = 8000
    batch_size = 64
    batch_size_test = 64
    buffer_size = 20000
    buffer_size_test = None

    def __init__(self):
        """
        Initializes the preprocessor, tokenizers, batchers and Transformer
        with the right settings
        """

        self.pre_processor = Preprocessor()

        self.tokenizers = GroupedTokenizers(
            LanguageTokenizer.reserved_tokens,
            root + 'language_vocab_english.txt',
            root + 'language_vocab_sparql.txt'
        )

        self.train_batcher = Batcher(tokenizers=self.tokenizers, train=True, max_tokens=Translator.dim_model, batch_size=Translator.batch_size, buffer_size=Translator.buffer_size)
        self.test_batcher = Batcher(tokenizers=self.tokenizers, train=False, max_tokens=Translator.dim_model, batch_size=Translator.batch_size_test, buffer_size=Translator.buffer_size_test)

        self.transformer = Transformer(
            num_layers=Translator.num_layers,
            dim_model=Translator.dim_model,
            num_heads=Translator.num_heads,
            feed_forward_size=Translator.feed_forward_size,
            input_vocab_size=Translator.input_vocab_size,
            target_vocab_size=Translator.target_vocab_size,
            dropout_rate=Translator.dropout_rate)

        self.scheduler = Scheduler(Translator.dim_model, 4000)
        self.optimizer = tf.keras.optimizers.Adam(self.scheduler, beta_1=0.9, beta_2=0.95, epsilon=1e-9)

        self.transformer.compile(
            loss=Translator.masked_loss,
            optimizer=self.optimizer,
            metrics=[Translator.masked_accuracy])

        self.end = self.tokenizers.sparql.tokenize([''])[0][1][tf.newaxis]

    def prepare(self, train: pd.DataFrame, val: pd.DataFrame):
        """
        Prepares validation and training sets for training
        by sending them to the preprocessor and batcher
        Args:
            - train: Training DataFrame with sparql (input) and English (output) columns
            - val: Validation DataFrame with sparql (input) and English (output) columns

        Returns:
            Tuple containing the training batches and the validation batches
        """
        # Preprocess the data
        train = self.pre_processor.transform_dataframe(train)
        val = self.pre_processor.transform_dataframe(val)

        # Change from dataframe to dataset
        train_dataset = tf.data.Dataset.from_tensor_slices((train['sparql'], train['english']))
        val_dataset = tf.data.Dataset.from_tensor_slices((val['sparql'], val['english']))

        # make the batches
        train_batches = self.train_batcher.make_batches(train_dataset)
        val_batches = self.test_batcher.make_batches(val_dataset)

        return train_batches, val_batches

    def fit(self, training, validation, epochs=50):
        """
        Train the model using the training set and validate the result
        """
        # Train the model
        self.transformer.fit(x=training,
                             batch_size=Translator.batch_size,
                             epochs=epochs,
                             #validation_data=validation,
                             #validation_batch_size=Translator.batch_size_test
                             )

    def translate(self, sparql: pd.Series):
        """
        Translates a series of sparql queries into English
        """
        # Pre-process the sparql sentenses
        sparql_processed = sparql.apply(self.pre_processor.transform_sparql)

        # Make a dataset
        sparql_dataset = tf.data.Dataset.from_tensor_slices(sparql_processed)

        # Create batches
        batches = self.test_batcher.make_batches(sparql_dataset)

        # For each batch
        output_list = []
        for batch in batches:
            for item1, item2 in zip(batch[0], batch[1]):

                item1 = tf.expand_dims(item1, axis=0)
                item2 = tf.expand_dims(item2, axis=0)
                # tf.print(item1)
                # tf.print(item2)

                output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
                output_array = output_array.write(0, LanguageTokenizer.START)

                # Predict one word at a time until we match the end-of-sentence token
                for i in tf.range(Translator.dim_model):
                  output = tf.transpose(output_array.stack())
                  output = tf.expand_dims(output, axis=0)
                  predictions = self.transformer([item1, output], training=False)

                  # Select the last token from the seq_len dimension.
                  predictions = predictions[:, -1:, :]  # Shape (batch_size, 1, vocab_size).
                  predicted_id = tf.argmax(predictions, axis=-1)[0][0]

                  # Concatenate the predicted_id to the output which is given to the
                  # decoder as its input.
                  output_array = output_array.write(i+1, predicted_id)
                  if predicted_id == LanguageTokenizer.END:
                    break

                output = tf.transpose(output_array.stack())
                output_list.append(output)
                output_array = output_array.mark_used()

                text = self.tokenizers.english.detokenize(tf.expand_dims(output, axis=0))
                input_text = self.tokenizers.sparql.detokenize(tf.expand_dims(item1, axis=0))
        return output_list

    def masked_loss(label, pred):
        mask = label != 0
        loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True, reduction='none')
        loss = loss_object(label, pred)

        mask = tf.cast(mask, dtype=loss.dtype)
        loss *= mask

        loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
        return loss

    def masked_accuracy(label, pred):
        pred = tf.argmax(pred, axis=2)
        label = tf.cast(label, pred.dtype)
        match = label == pred

        mask = label != 0

        match = match & mask

        match = tf.cast(match, dtype=tf.float32)
        mask = tf.cast(mask, dtype=tf.float32)
        return tf.reduce_sum(match)/tf.reduce_sum(mask)

#### 5.1 Data preparation

Then execute the cell below to now create an instance of the `Translator` class, load the training and validation data and prepare the data for training

In [32]:
translator = Translator()

data_loader = DataLoader(
    training_path=root + 'train.csv',
    validation_path=root + 'validation.csv'
)

train_batch, val_batch = translator.prepare(data_loader.train, data_loader.val)

#### 5.2 Training

Train the model with the data

In [33]:
translator.fit(train_batch, val_batch)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### 5.3 Translation

Perform test data translation to validate model effectiveness

In [35]:
predictions = translator.translate(data_loader.val['sparql'])

In [None]:
a = pd.Series(predictions)
a_df = pd.DataFrame(a)
a_df

In [None]:
formatted_predictions = pd.concat([pd.DataFrame(predictions), data_loader.val], axis=1)
formatted_predictions.drop(['id', 'sparql'], inplace=True, axis=1)
formatted_predictions.rename(columns={0:'prediction', 'english':'target_text'}, inplace=True)
formatted_predictions.head(10)

### 6. Evaluation: BLUE Metric

To evaluate the effectiveness of translations, the BLEU metric will be used. The formula is given below:
$$BLUE = BP * exp \Big( \sum_{n=1}^{N} w_n log p_n \Big)$$

where $p_n$ is the modified precision for the n-gram (corresponding to the ratio of the maximum frequency of the n-gram in each reference sentence to the frequency of the n-gram). Let $r$ then be the number of words in the target sentence and $c$ as the number of words in the predicted sentence. If $c>r$, then BP is 1. Otherwise $BP = exp(1 - \frac{r}{c})$.

The values of the weights $w_n$ are what give the different variations of the BLUE metric. In our case, the BLEU-3 metric will be used.

In [None]:
def evaluate_model(data: pd.DataFrame):
    """
    Evaluates model accuracy using the BLUE metric
    Args:
        - data: DataFrame containing two columns (predictions and target_text)

    Returns:
        The average BLUE score
    """
    weights = (1/3, 1/3, 1/3) # Use Bleu-3
    scores = np.zeros(data.shape[0])
    index = 0
    for iter, row in data.iterrows():
        if not pd.notnull(row['prediction']):
            continue
        prediction = row['prediction'].split()
        target_text = row['target_text'].split()

        scores[index] = sentence_bleu([target_text], prediction, weights=weights)

        index += 1
    return np.mean(scores)


#### 6.1 Model evaluation

Call the `evaluate_model` function on your model's predictions to evaluate its performance.

In [None]:
evaluate_model(formatted_predictions)

#### 6.2. Error analysis
Analyze model translations and errors. Implement a statistical analysis (in the form of your choice) which displays categories of errors and their % occurrence among all possible errors. You can direct your function to describe specific dimensions. For example: are the errors more often on the elements of the "dbx" knowledge base or on the rest of the tokens? Are the errors due to elements that are not seen in training?


> TODO

#### 6.3 Improvement
Give possible solutions to improve the BLEU score

A larger training set could allow the model to better learn the trigrams from the training set, which would increase the accuracy percentage with the expected trigrams. Since the degree of similarity of the trigrams of the prediction set to the expected set is a factor in the BLEU-3 metric, its score will be higher. It would also be beneficial to adjust the BLEU metric used depending on the size of the training set. For example, if the training set is small, it would be more beneficial to use a BLUE-2 or BLUE-3 metric since smaller n-grams will be much more common.

## END
