# Data Preparation and Exploration (Tokenizer)

This notebook contains code to experiment with the T5 Tokenizer, to correctly transform input data (e.g. original text) into processed data (anonymized text)

- Load the Tokenizer
- Load Dataset
- Use the tokenizer to calculate tokens and attention masks
- Use the tokenizer to calculate tokens in target_mode for translation

In [None]:
try:
    import numpy as np
    import random
    import dotenv
    import json

    from transformers import (
        AutoTokenizer, # tokenizer model 
    )

    from libs.utility import detect_accelerator, downloadFromHuggingFace
    from libs.parameters import Properties
    from libs.dataset import CustomPIIDataset, DataPreprocessor
    from datasets import load_dataset, Dataset, DatasetDict
except ImportError as e:
    raise e

## 1. Load environment variables and settings 

Configure the notebook for runtime with custom environment variables

In [None]:
# load dotenv
config_env: dict = dotenv.dotenv_values("./localenv")

P_FILE: str = config_env.get("PARAMETER_FILE", "parameters.yaml")
M_REPO: str = config_env.get("MODEL_REPO_ID", "google/flan-t5-small")
DATASET_PATH: str = config_env.get("DATASET_FILE", "../dataset")
DATASET_SPLIT: float = 0.8 # 80% train, 20% validation
OUTPUT_DIR: str = config_env.get("OUTPUT_DIR", "flan-finetuned-ita")

# load parameters
params: Properties = Properties(P_FILE)
print(f"Loaded HF: Cache Dir: {params.config_parameters.huggingface.cache_dir}\nDownloading to {params.config_parameters.huggingface.local_dir}")

Download the model if not already present on disk. This uses the HuggingFace Library to interface with remote repositories

- MODEL_ID: `google/flan-t5-small`
- cache_dir: where on local disk the checkpoint will be cached during download
- local_dir: local path where the HF library will download the checkpoint to
- apitoken: token to authenticate and interact with huggingface API

In [None]:
# download model from HF repository
try:
    model_name: str = downloadFromHuggingFace(M_REPO,
                                            cache_dir=params.config_parameters.huggingface.cache_dir,
                                            local_dir=params.config_parameters.huggingface.local_dir,
                                            apitoken=params.config_parameters.huggingface.apitoken)
except Exception as e:
    print(f"Caught exception: {e}")


Load the dataset into a custom `DataSet` class:

- Every item contains a dictionary with `source` and `target` text input data points

In [None]:
# load dataset using custom class
it_pii_dataset: CustomPIIDataset = CustomPIIDataset("../dataset")
print(f"Dataset Loaded! -> Processed {len(it_pii_dataset)} datapoints")

# print one datapoint
print(it_pii_dataset[0])

Load the tokeniser from the T5 checkpoint using HuggingFace Transformers Library

In [None]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Tokenizer setup

The anonymization task is essentially a text-to-text transformation operation in which some tokens are translated into other tokens.

We may have to expand the tokenizer vocabulary in order to use these new tokens.

The new tokens are:

- [NAME]
- [ADDRESS]
- [CITY]
- [EMAIL]
- [DATE]
- [TAXCODE]
- [PHONE]
- [CREDITCARD]

In [None]:
# add new tokens to the tokenizer vocabulary
new_tokens = [
    "[NAME]",
    "[ADDRESS]",
    "[CITY]",
    "[EMAIL]",
    "[DATE]",
    "[TAXCODE]",
    "[PHONE]",
    "[CREDITCARD]",
]

# update tokenizer
tokenizer.add_tokens(new_tokens)

## 2. Explore the Dataset

In [None]:
# synthetic test data. Use the same sentences used during test of the PyTorch model
# Test with Italian examples containing PII
test_sentences: dict = [
    "Il signor Alessandro Bianchi abita in Via Nazionale 45, Milano.",
    "Per contattare Giulia Rossi chiamare il 339-8765432 o scrivere a giulia.rossi@email.it",
    "Il paziente Marco Esposito, nato il 25/08/1982, codice fiscale SPSMRC82M25H501Z.",
    "Pagamento con carta 5123-4567-8901-2345 intestata a Francesca Lombardi.",
    "Contattare la dottoressa Elena Ricci al numero 02-12345678, ufficio in Corso Italia 88, Roma.",
]

# same sentences, with PII anonymized
anonymized_sentences: list = [
    "Il signor [NAME] abita in [ADDRESS], [CITY]",
    "Per contattare [NAME] chiamare il [PHONE] o scrivere a [EMAIL]",
    "Il paziente [NAME], nato il [DATE], codice fiscale [TAXCODE]",
    "Pagamento con carta [CREDITCARD] intestata a [NAME]",
    "Contattare la dottoressa [NAME] al numero [PHONE], ufficio in [ADDRESS], [CITY]"
]

now use the tokenizer to encode some example text sentences for further computation

In [None]:
# use the tokenizer to calculate token IDs for input strings
tokenized_input_sentences: list = []
for input_string in test_sentences:
    tokenized_input_sentences.append(tokenizer(input_string))

# use the tokenizer to calculate token IDs for output strings
tokenized_output_sentences: list = []
for input_string in anonymized_sentences:
    tokenized_output_sentences.append(tokenizer(input_string))

# assert input length matches
assert(len(tokenized_input_sentences) == len(tokenized_output_sentences))

# show tokenization example
dataset_len: int = len(tokenized_input_sentences)
datapoint: int = int((dataset_len*random.random()) % dataset_len)

# display encoded and decoded form
print("INPUT")
print(f"Token ID: {tokenized_input_sentences[datapoint].get('input_ids')}")
print(f"Attention Mask: {tokenized_input_sentences[datapoint].get('attention_mask')}")
print(f"Decoded text: {tokenizer.decode(tokenized_input_sentences[datapoint].get('input_ids'))}")

# display encoded and decoded form
print("OUTPUT")
print(f"Token ID: {tokenized_output_sentences[datapoint].get('input_ids')}")
print(f"Attention Mask: {tokenized_output_sentences[datapoint].get('attention_mask')}")
print(f"Decoded text: {tokenizer.decode(tokenized_output_sentences[datapoint].get('input_ids'))}")

instantiate a preprocessor instance to begin working on the training dataset

In [None]:
preprocessor: DataPreprocessor = DataPreprocessor(tokenizer)

now, prepare the data set for processing

In [None]:
train_data = {
    "original": [ex for ex in test_sentences],
    "anonymized": [ex for ex in anonymized_sentences]
}

# the complete rebuilt dataset. this is used for training
dataset = DatasetDict({
    "train": Dataset.from_dict(train_data),
})

# have a look inside the dataset
print(dataset)
from pprint import pprint

for item in dataset.get('train'):
    pprint(item)

In [None]:
# apply tokenization and trasformation to the train dataset
# INPUT DATAPOINT:
#   contains features: ORIGINAL TEXT, ANONYMIZED TEXT
# OUTPUT DATAPOINT (compatible with huggingface trainer):
#   contains features: INPUT_IDS, ATTENTION MASK, LABELS
consolidated_dataset = dataset.map(
    preprocessor.data_preprocess, # function to tokenize and prepare the datapoint to be consumed by the Trainer,
    batched=True, # process the data point
    remove_columns=dataset['train'].column_names
)

# explore dataset
from pprint import pprint
print(consolidated_dataset)

for item in consolidated_dataset.get('train'):
    pprint(tokenizer.decode(item.get('input_ids')))
    pprint(tokenizer.decode(item.get('labels')))
    print("\n")