# Practical lecture 8

Today, we will work with the Flan T5 model (a version of the T5 model which has been trained on 1,800 different annotated NLP datasets from several languages). 

We'll fine-tune the Flan T5 model for a task which it has not been trained for, namely, Finnish person name recognition. Our goal is to build a model which takes input of the form:

```
find person names in: Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muita jatkamaan edesmenneen Ville Oksasen jalanjäljissä .
```

and produces output

```
Ville Oksanen
```

If there are more than one person name, those should be separated by commas.

We'll do this on Colab, so let's start by installing the transfomers library, and importing the functions `load_dataset` and `load_metric` which we'll use to read some small Finnish NER datasets and for evluation during training. 

We'll also connect Colab with Google Drive because we'll save models on Google Drive.

In [1]:
!pip3 install transformers datasets
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate
import transformers
from datasets import load_dataset, load_metric

from google.colab import drive
drive.mount('/content/drive')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m72

We'll use Weights and Biases for dislaying logging information during training. You will need to sign up for the service at `wanb.ai`. This will give you an access token which you can paste when running `wanb login` 

In [43]:
!pip3 install wandb
import wandb
wandb.login()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




True

We will then read annotated data for Finnish person name recognition. This data comes from the [Finer database](https://github.com/mpsilfve/finer-data). 

Both the training and development data are TSV files having two columns: text (Finnish sentences) and ner (Comma-separated lists of person names in the input sentence or NONE in case there are no names):

```
Mutta jos takana oli Pohjois-Korea , Sullivan on valmis kippaamaan syyn Yhdysvaltain niskaan .  Sullivan
Kuljetus , verot ja muut lisäerät nostanevat laitteen hintaa reippaasti , jos sitä tänne asti saadaan . NONE
Tyypillinen botti oli yhteydessä isäntäänsä joka kolmas minuutti , Check Pointin Pohjoismaiden aluejohtaja Örjan Westman toteaa tiedotteessa .  Örjan Westman
Eri työntekijät julkaisivat Twitterissä lausuntoja , joissa he vaativat Eichin eroa .   Eichin
Sténin mukaan yhtiössä on nyt kahdeksan työntekijää , ja muutama aiotaan palkata lisää .        Sténin
```

We'll use the `load_dataset` function from the HuggingFace [datasets](https://huggingface.co/docs/datasets/index) module.

In [3]:
miikka_train = load_dataset(path=".", 
                      #  data_files="/content/drive/MyDrive/finer/FINER.train.tsv".split(),
                      #  data_files="/Users/kmaurinjones/Desktop/School/UBC/UBC_Coursework/block_6/cl581/COLX_581_low-resource_students/lectures/FINER.train.tsv".split(),
                       data_files="/content/drive/MyDrive/COLAB_FILES/capstone/FINER.train.tsv".split(),
                       delimiter="\t",
                       column_names="text ner".split())["train"]

miikka_dev = load_dataset(path=".", 
                      #  data_files="/content/drive/MyDrive/finer/FINER.dev.tsv".split(),
                       data_files="/content/drive/MyDrive/COLAB_FILES/capstone/FINER.dev.tsv".split(),
                       delimiter="\t",
                       column_names="text ner".split())["train"]

Downloading and preparing dataset csv/. to /root/.cache/huggingface/datasets/csv/.-4c256866824ed18f/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/.-4c256866824ed18f/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset csv/. to /root/.cache/huggingface/datasets/csv/.-f52ffe7825dfc0da/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/.-f52ffe7825dfc0da/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
miikka_dev

Dataset({
    features: ['text', 'ner'],
    num_rows: 346
})

# Creating train/dev/test splits

In [5]:
import pandas as pd

master_data = pd.read_csv("/content/drive/MyDrive/COLAB_FILES/capstone/capstone_data/metadata_detection_all_data.csv")
master_data.head()

Unnamed: 0,raw_file_name,clean_str_metadata,clean_str_content,clean_str_full_file
0,NOL-10723-12.txt,Date: 2013-01-08 File number: NOL-10723-12 Cit...,Order under Section 69 Residential Tenancies A...,Date: 2013-01-08 File number: NOL-10723-12 Cit...
1,TNL-43964-13.txt,Date: 2013-05-02 File number: TNL-43964-13 Cit...,Order under section 69 Residential Tenancies A...,Date: 2013-05-02 File number: TNL-43964-13 Cit...
2,TNL-45470-13.txt,Date: 2013-06-17 File number: TNL-45470-13 Cit...,Order under Section 69 Residential Tenancies A...,Date: 2013-06-17 File number: TNL-45470-13 Cit...
3,TEL-33159-13; \n TET-33272-13.txt,Date: 2013-02-25 File number: TEL-33159-13; TE...,Order under sections 31 and 69 Residential Ten...,Date: 2013-02-25 File number: TEL-33159-13; TE...
4,TNL-39747-12.txt,Date: 2013-02-07 File number: TNL-39747-12 Cit...,Order under Section 68 Residential Tenancies A...,Date: 2013-02-07 File number: TNL-39747-12 Cit...


In [6]:
master_data['clean_str_metadata'][0]

'Date: 2013-01-08 File number: NOL-10723-12 Citation: NOL-10723-12 (Re), 2013 CanLII 5182 (ON LTB), <https://canlii.ca/t/fw1m8>, retrieved on 2023-05-17'

In [7]:
train_split = 0.75
dev_split = 0.10
test_split = 0.15

train_rows = int(len(master_data) * train_split)
dev_rows = int(len(master_data) * dev_split)
test_rows = int(len(master_data) * test_split)

train_df = master_data.iloc[:train_rows, :]
dev_df = master_data.iloc[train_rows:train_rows + dev_rows, :]
test_df = master_data.iloc[train_rows + dev_rows:, :]

# making sure all rows are accounted for across the 3 sets
assert len(train_df) + len(dev_df) + len(test_df) == len(master_data)

In [8]:
test_df.head()

Unnamed: 0,raw_file_name,clean_str_metadata,clean_str_content,clean_str_full_file
37142,SOL-12679-20-SA.txt,Date: 2020-10-19 File number: SOL-12679-20-SA ...,Order under Section 78(11) Residential Tenanci...,Date: 2020-10-19 File number: SOL-12679-20-SA ...
37143,TNL-23183-20.txt,Date: 2020-09-08 File number: TNL-23183-20 Cit...,Order under Section 69 Residential Tenancies A...,Date: 2020-09-08 File number: TNL-23183-20 Cit...
37144,TSL-07617-19-VO-AM.txt,Date: 2020-09-03 File number: TSL-07617-19-VO-...,AMENDED Order under Subsection 74(14) Resident...,Date: 2020-09-03 File number: TSL-07617-19-VO-...
37145,TSL-11876-19.txt,Date: 2020-09-25 File number: TSL-11876-19 Cit...,Order under Section 69 Residential Tenancies A...,Date: 2020-09-25 File number: TSL-11876-19 Cit...
37146,TEL-09891-20.txt,Date: 2020-12-14 File number: TEL-09891-20 Cit...,Order under Section 69 Residential Tenancies A...,Date: 2020-12-14 File number: TEL-09891-20 Cit...


In [9]:
from datasets import load_dataset
train_data = load_dataset(path = ".", data_files = "/content/drive/MyDrive/COLAB_FILES/capstone/capstone_data/metadata_detection_train.csv")['train']
                          # column_names = ["raw_file_name", "clean_str_metadata", "clean_str_content", "clean_str_full_file"]["train"])
dev_data = load_dataset(path = ".", data_files = "/content/drive/MyDrive/COLAB_FILES/capstone/capstone_data/metadata_detection_dev.csv")['train']
                          # column_names = ["raw_file_name", "clean_str_metadata", "clean_str_content", "clean_str_full_file"]["train"])
test_data = load_dataset(path = ".", data_files = "/content/drive/MyDrive/COLAB_FILES/capstone/capstone_data/metadata_detection_test.csv")['train']
                          # column_names = ["raw_file_name", "clean_str_metadata", "clean_str_content", "clean_str_full_file"]["train"])

Downloading and preparing dataset csv/. to /root/.cache/huggingface/datasets/csv/.-d399a2158d01d4c4/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/.-d399a2158d01d4c4/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset csv/. to /root/.cache/huggingface/datasets/csv/.-6ee60bba667dc303/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/.-6ee60bba667dc303/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset csv/. to /root/.cache/huggingface/datasets/csv/.-d7d74e4bf990a474/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/.-d7d74e4bf990a474/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
open("/content/drive/MyDrive/COLAB_FILES/capstone/FINER.dev.tsv", "r").read()

'Kaikilla tiedostoilla ei ole tarkkaa taloudellista arvoa .\tNONE\nAsukkaamme ovat tyytyväisiä sen jatkuvasti toimittamiin palveluihin .\tNONE\nSyynä on vain tunteja aikaisemmin Youtubessa julkaistu äänitallenne , jonka väitetään sisältävän Turkin korkea-arvoisten poliitikkojen keskustelun mahdollisista sotatoimista Syyriassa .\tNONE\nVaarallinen mainostemppu : Dotcom pelasti pelaajien joulun\tDotcom\nLisäksi jokainen , joka katsoo elokuvan , asettaa itsensä hengenvaaraan .\tNONE\nPohjois-Korean netti katkesi\tNONE\nDrummondin mukaan Google tarkastelee käytäntöään saatuaan neuvoa-antavan raportin tammikuun lopussa .\tDrummondin\nVaaralliselta kuulostava tempaus on parhaimmillaankin vain symbolinen , sillä elokuvaa tuskin kuitenkaan katsottaisiin Pohjois-Koreassa .\tNONE\nDotcom näki tässä tilaisuuden mainostaa Mega-palveluaan , joka tarjoaa tietojen salausta ja turvattua tallennusta .\tDotcom\n– Haluamme , että verotus on reilu ja samalla varmistaa tasaisen tulovirran , Osbourne sanoo 

Let's examine the training and development data:

In [11]:
print(train_data)
print(dev_data)
print(test_data)

Dataset({
    features: ['raw_file_name', 'clean_str_metadata', 'clean_str_content', 'clean_str_full_file'],
    num_rows: 32773
})
Dataset({
    features: ['raw_file_name', 'clean_str_metadata', 'clean_str_content', 'clean_str_full_file'],
    num_rows: 4369
})
Dataset({
    features: ['raw_file_name', 'clean_str_metadata', 'clean_str_content', 'clean_str_full_file'],
    num_rows: 6556
})


And, print the first training example:

In [12]:
print(test_data[0])

{'raw_file_name': 'SOL-12679-20-SA.txt', 'clean_str_metadata': 'Date: 2020-10-19 File number: SOL-12679-20-SA Citation: Effort Trust Company v Rudd, 2020 CanLII 120305 (ON LTB), <https://canlii.ca/t/jgw27>, retrieved on 2023-05-19', 'clean_str_content': "Order under Section 78(11) Residential Tenancies Act, 2006 File Number: SOL-12679-20-SA In the matter of: 504, 223 JACKSON STREET W HAMILTON ON L8P4R4 Between: The Effort Trust Company Landlord and Jo-Anne Rudd Tenant The Effort Trust Company (the 'Landlord') applied for an order to terminate the tenancy and evict Jo-Anne Rudd (the 'Tenant') and for an order to have the Tenant pay the rent the Tenant owes because the Tenant failed to meet a condition specified in the order issued by the Board on July 9, 2019 with respect to application SOL-04485-19. The Landlord's application was resolved by order SOL-12679-20, issued on February 5, 2020. The Tenant filed a motion to set aside order SOL-12679-20. This motion was heard in Passcode: 531 

We'll now start to process our data. We will need to use `nltk.word_tokenize`, so we need to load the `nltk` module. Additionally, we'll load the Flan T5 tokenizer.

In [13]:
import nltk
nltk.download('punkt')
import string
from transformers import AutoTokenizer

# model_checkpoint = "google/flan-t5-base"
model_checkpoint = "google/flan-t5-small" # trying small model to see if performance is as good as base size -- might be a better tradeoff
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

We'll then tokenize the training and development data. We'll also add the prompt `"find person names in:"` to all training and develpment examples. We get something like:

```find person names in: Anne saw Bill with Mary```

It's important to remember to set `truncation=True` when we exceed the maximum subword token count in the input our output. Flan T5 cannot handle arbitrarily long inputs and outputs.

In [14]:
print(test_data)

Dataset({
    features: ['raw_file_name', 'clean_str_metadata', 'clean_str_content', 'clean_str_full_file'],
    num_rows: 6556
})


In [15]:
PREFIX = "extract metadata boundary:"
MAX_INPUT_LENGTH = 256
MAX_TARGET_LENGTH = 256

def preprocess_data(examples):
    inputs = [PREFIX + " " + text for text in examples["clean_str_full_file"]] # prefix + full case file str as input
    
    model_inputs = tokenizer(inputs, 
                             max_length = MAX_INPUT_LENGTH, 
                             truncation = True) # truncating input prompt (case file) as necessary -- probably always necessary

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["clean_str_metadata"], # clean metadata as target "label"
                           max_length = MAX_TARGET_LENGTH, 
                           truncation = False) # truncating metadata length if necessary -- probably shouldn't do this,
                                               # since we're looking for the boundary of the metadata, which would truncated if True

    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs

We'll then apply the tokenization function to our datasets

In [16]:
tokenized_train = train_data.map(preprocess_data, batched = True)
tokenized_dev = dev_data.map(preprocess_data, batched = True)

Map:   0%|          | 0/32773 [00:00<?, ? examples/s]



Map:   0%|          | 0/4369 [00:00<?, ? examples/s]

Let's print the first tokenized training example:

In [17]:
tokenized_train[0]

{'raw_file_name': 'NOL-10723-12.txt',
 'clean_str_metadata': 'Date: 2013-01-08 File number: NOL-10723-12 Citation: NOL-10723-12 (Re), 2013 CanLII 5182 (ON LTB), <https://canlii.ca/t/fw1m8>, retrieved on 2023-05-17',
 'clean_str_content': "Order under Section 69 Residential Tenancies Act, 2006 File Number: NOL-10723-12 ZEL (the 'Landlord') applied for an order to terminate the tenancy and evict SP and MM (the 'Tenants') because the Tenants did not pay the rent that the Tenants owe. This application was heard in Sudbury on January 3, 2013. The Landlord’s agent, MS, and the Tenant, MM, attended the hearing. Determinations: 1. The Tenants have not paid the total rent they were required to pay for the period from December 1, 2012 to January 31, 2013. Because of the arrears, the Landlord served a Notice of Termination. 2. The Tenants are in possession of the rental unit. 3. The monthly rent is $861.00 effective January 1, 2013. 4. The Tenants have made no payment since the application was fi

Next we will import classes which are used for fine-tuning of seq2seq models like Flan-T5.

* `AutoModelForSeq2SeqLM` loads the model itself,
* `DataCollatorForSeq2Seq` takes care of data batching,
* `Seq2SeqTrainingArguments` sets all hyperparameters for training, and
* `Seq2SeqTrainer` trains the model

In [18]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

We'll then set the hyperparameters for model training:

```
model_dir -- where to save model checkpoints
evaluation_strategy="steps" -- evaluate every N steps
eval_steps=100 -- where N = 100
logging_strategy="steps" -- write information about training loss  every N steps
logging_steps=1 -- where N = 1
save_strategy="steps" -- save model every N steps
save_steps=100 -- where N = 100
learning_rate=4e-5 -- initial learning rate for training
per_device_train_batch_size=batch_size -- training batch size
per_device_eval_batch_size=batch_size -- evaluation batch size
weight_decay=0.01 -- hyperparameter for weight decay during training
save_total_limit=3 -- save a maximum of 3 models
num_train_epochs=1 -- number of training epochs
predict_with_generate=True -- generate the actual output during evaluation 
```

In [19]:
# !pip install --upgrade accelerate
# !pip uninstall -y transformers accelerate
# !pip install transformers accelerate
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [34]:
batch_size = 8
model_name = "flan-t5-metadata_extraction_small"
model_dir = f"/content/drive/MyDrive/COLAB_FILES/capstone/capstone_data/extraction_models/{model_name}"

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy = "steps",
    eval_steps = 1000,
    logging_strategy = "steps",
    logging_steps = 1,
    save_strategy = "steps",
    save_steps = 1000,
    learning_rate = 4e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 3,
    num_train_epochs = 1,
    predict_with_generate = True)

Let's define a data collator which batches the training and development data for us:

In [35]:
data_collator = DataCollatorForSeq2Seq(tokenizer, padding=True)

We'll use BLEU score to evaluate progress during training because our outputs are token sequences and even partial matches can be valuable

In [36]:
# This is potentially required because of some obscure colab bug
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip3 install sacrebleu
metric = load_metric("sacrebleu")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


We'll now define a function to compute BLEU score based on model predictions and gold standard outputs.

In [37]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # The labels are padded to equal length using a special symbol -100.
    # This is not the regular <PAD> symbol for the tokenizer so
    # we'll have to replace -100 in the labels before decoding.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # For BLEU, we'll need to split the outputs into sequences
    # of word tokens. We'll just use NLTK word_tokenize.
    decoded_preds = [nltk.word_tokenize(pred.strip())
                      for pred in decoded_preds]
    decoded_labels = [[nltk.word_tokenize(label.strip())]
                      for label in decoded_labels]
    
    # Compute BLEU scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)

    # Return BLEU scores
    return result 

We'll then initalize a trainer for our model. It takes a function `model_init` which returns a plain non-fine-tuned Flan T5 model

In [38]:
# Function that returns an untrained model to be trained
def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We'll then test how the mode behaves on some input prior to fine-tuning. The expected output is:

```Ville Oksanen```

But, the model doesn't seem to really understand the task at all. Instead of finding the names, the model seems to more or less copy the input to the output. This is not a huge surprise. Fine-tuning is required to adapt the model to our task.

In [39]:
train_data[0]['clean_str_metadata']
train_data[0]['clean_str_full_file']

"Date: 2013-01-08 File number: NOL-10723-12 Citation: NOL-10723-12 (Re), 2013 CanLII 5182 (ON LTB), <https://canlii.ca/t/fw1m8>, retrieved on 2023-05-17 Order under Section 69 Residential Tenancies Act, 2006 File Number: NOL-10723-12 ZEL (the 'Landlord') applied for an order to terminate the tenancy and evict SP and MM (the 'Tenants') because the Tenants did not pay the rent that the Tenants owe. This application was heard in Sudbury on January 3, 2013. The Landlord’s agent, MS, and the Tenant, MM, attended the hearing. Determinations: 1. The Tenants have not paid the total rent they were required to pay for the period from December 1, 2012 to January 31, 2013. Because of the arrears, the Landlord served a Notice of Termination. 2. The Tenants are in possession of the rental unit. 3. The monthly rent is $861.00 effective January 1, 2013. 4. The Tenants have made no payment since the application was filed. 5. The Landlord collected a rent deposit of $840.00 from the Tenants and this dep

In [40]:
PREFIX

'extract metadata boundary:'

In [41]:
model = model_init()
text = train_data[0]['clean_str_full_file']
inputs = ["extract metadata boundary:" + text] # PREFIX = "extract metadata boundary:"

print("INPUT:", inputs)
inputs = tokenizer(inputs, max_length=128, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=1, max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print("OUTPUT:", decoded_output)

INPUT: ["extract metadata boundary:Date: 2013-01-08 File number: NOL-10723-12 Citation: NOL-10723-12 (Re), 2013 CanLII 5182 (ON LTB), <https://canlii.ca/t/fw1m8>, retrieved on 2023-05-17 Order under Section 69 Residential Tenancies Act, 2006 File Number: NOL-10723-12 ZEL (the 'Landlord') applied for an order to terminate the tenancy and evict SP and MM (the 'Tenants') because the Tenants did not pay the rent that the Tenants owe. This application was heard in Sudbury on January 3, 2013. The Landlord’s agent, MS, and the Tenant, MM, attended the hearing. Determinations: 1. The Tenants have not paid the total rent they were required to pay for the period from December 1, 2012 to January 31, 2013. Because of the arrears, the Landlord served a Notice of Termination. 2. The Tenants are in possession of the rental unit. 3. The monthly rent is $861.00 effective January 1, 2013. 4. The Tenants have made no payment since the application was filed. 5. The Landlord collected a rent deposit of $84

Let's start training. The `wandb` module will give you a URL where you can monitor training progress (just click on the link in the browser). In Miikka's case, that URL was:

```
https://wandb.ai/mpsilfve/huggingface/runs/43aguwuf
```

Every time you run the code, you'll get a different URL and `wandb` will store the information from all your runs.

In [44]:
trainer.train()

Step,Training Loss,Validation Loss,Score,Counts,Totals,Precisions,Bp,Sys Len,Ref Len
1000,0.001,9e-05,10.450431,"[136614, 127869, 123492, 119121]","[138177, 133808, 129439, 125070]","[98.86884213725874, 95.56155087887122, 95.40555783032934, 95.2434636603502]",0.108567,138177,444984
2000,0.0003,8e-06,10.450431,"[136614, 127869, 123492, 119121]","[138177, 133808, 129439, 125070]","[98.86884213725874, 95.56155087887122, 95.40555783032934, 95.2434636603502]",0.108567,138177,444984
3000,0.0003,4e-06,10.450431,"[136614, 127869, 123492, 119121]","[138177, 133808, 129439, 125070]","[98.86884213725874, 95.56155087887122, 95.40555783032934, 95.2434636603502]",0.108567,138177,444984
4000,0.0002,4e-06,10.450431,"[136614, 127869, 123492, 119121]","[138177, 133808, 129439, 125070]","[98.86884213725874, 95.56155087887122, 95.40555783032934, 95.2434636603502]",0.108567,138177,444984


Trainer is attempting to log a value of "[136614, 127869, 123492, 119121]" of type <class 'list'> for key "eval/counts" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[138177, 133808, 129439, 125070]" of type <class 'list'> for key "eval/totals" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[98.86884213725874, 95.56155087887122, 95.40555783032934, 95.2434636603502]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[136614, 127869, 123492, 119121]" of type <class 'list'> for key "eval/counts" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value 

TrainOutput(global_step=4097, training_loss=0.003888585084985566, metrics={'train_runtime': 1615.1253, 'train_samples_per_second': 20.291, 'train_steps_per_second': 2.537, 'total_flos': 3046094755332096.0, 'train_loss': 0.003888585084985566, 'epoch': 1.0})

After training, we can load the best checkpoint model from Google Drive:

In [47]:
model_name = "flan-t5-metadata_extraction_small"
model_dir = f"/content/drive/MyDrive/COLAB_FILES/capstone/capstone_data/extraction_models/flan-t5-metadata_extraction_small/checkpoint-4000"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

# max_input_length = 128

We can then test the model on the same input as before. This time, the output seems to be correct: `Ville Oksanen` 

In [49]:
# text = """Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muita jatkamaan edesmenneen Ville Oksasen jalanjäljissä ."""
# inputs = ["find person names in: " + text]

# inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
# output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=1, max_length=64)
# decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
# print(decoded_output)
# model = model_init()

test_case = test_data[110]

text = test_case['clean_str_full_file']
print(f"GOAL: {test_case['clean_str_metadata']}")
inputs = ["extract metadata boundary:" + text] # PREFIX = "extract metadata boundary:"

print("INPUT:", inputs)
inputs = tokenizer(inputs, max_length = 256, truncation = True, return_tensors = "pt")
output = model.generate(**inputs, num_beams = 8, do_sample = True, min_length = 1, max_length = 128)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens = True)[0]
print("OUTPUT:", decoded_output)

GOAL: Date: 2020-11-16 File number: EAL-89282-20 Citation: Bajwa v Brunet, 2020 CanLII 119151 (ON LTB), <https://canlii.ca/t/jh011>, retrieved on 2023-05-19
INPUT: ["extract metadata boundary:Date: 2020-11-16 File number: EAL-89282-20 Citation: Bajwa v Brunet, 2020 CanLII 119151 (ON LTB), <https://canlii.ca/t/jh011>, retrieved on 2023-05-19 Order under Section 87(1) Residential Tenancies Act, 2006 File Number: EAL-89282-20 In the matter of: 1, 448 THIRD STREET W CORNWALL ON K6J2R2 Between: Aisha Ghaffar Imran Bajwa Landlords and Christopher Brunet Tyler Ruhl Tenants Aisha Ghaffar and Imran Bajwa (the 'Landlords') applied for an order to terminate the tenancy and evict Christopher Brunet and Tyler Ruhl (the 'Tenants') because the Tenants did not pay the rent that the Tenants owe. This application was heard by video conference on November 12, 2020. Only the Landlord Aisha Ghaffar attended the hearing. As of 3:00 p/m., the Tenants were not present or represented at the hearing although pr