<a href="https://colab.research.google.com/github/iyoussef1079/Travaux-pratiques-NLP/blob/master/tp3_2023/incident_analysis_collab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Joint Intent Classification and Slot filling with BERT
This notebook is based on the paper __BERT for Joint Intent Classification and Slot Filling__ by Chen et al. (2019), https://arxiv.org/abs/1902.10909 but on a different dataset made for a class project.

Ideas were also taken from https://github.com/monologg/JointBERT, which is a PyTorch implementation of the paper with the original dataset.


## Install transformers

In [None]:
!pip install transformers



## Download data

In [13]:
# Connection à drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [14]:
import json

train_data_path = '/content/drive/MyDrive/ULaval/dev_examples.json'
new_exemples_path = '/content/drive/MyDrive/ULaval/new_examples.json'
test_data_path = '/content/drive/MyDrive/ULaval/test_examples.json'


def load_incident_dataset(filename):
    with open(filename, 'r') as fp:
        incident_list = json.load(fp)

    return incident_list

# Load datasets
train_data = load_incident_dataset(train_data_path)
new_examples = load_incident_dataset(new_exemples_path)
test_data_path = load_incident_dataset(test_data_path)

In [15]:
# train_data += new_examples

example = train_data[10]
example

{'text': ' On August 24  2003  Jose Crespin Company  a stucco contractor  employed  Employee #1 and four coworkers. They were applying a stucco finish to the  exterior insulating finishing system on the Home Depot  Store Number 6555.  After completing the lumber canopy at the west end of the building  the  employees moved to the east end to finish the exterior insulating finishing  system on the spandrel panels at the garden center. While waiting for the  building surface to cool  the employees took a work break. During their break   the weather swiftly changed from clear and sunny to heavy rain and strong  winds. The employees then moved to the north side of the building at the  garden center where they hoped that the masonry piers and spandrel panels  would shelter them from the rain. The wind reached speeds in excess of 40 mph  and began collapsing the masonry piers (C.1-0.2 and C.1-0.3) where the  employees were standing. Realizing the imminent danger of the collapsing  piers  four

## Load Tokenizer from transformers

We will use a pretrained bert model `bert-base-cased` for both Tokenizer and our classifier.

In [None]:
import tensorflow as tf
from transformers import AutoTokenizer

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Encode texts from the dataset

We have to encode the texts using the tokenizer to create tensors for training the classifier.

In [None]:
# https://huggingface.co/transformers/preprocessing.html

def encode_texts(tokenizer, texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors="tf", max_length=512)

texts = [d["text"] for d in train_data]
tds = encode_texts(tokenizer, texts)
tds.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [None]:
encoded_texts = tds

for t in encoded_texts["input_ids"]:
  if t.shape != (512,):
    print(t.shape)

## Encode labels

### Slots

To padd all the texts to the same length, the tokenizer will use special characters. To handle those we need to add <PAD> to slots_names. It can be some other symbol as well.

In [None]:
# encode slots
slot_names = set()
for td in train_data:
    slots = td["arguments"]
    for slot in slots:
        # slot_names.add(slot)

        slot_names.add("B-" + slot)
        slot_names.add("I-" + slot)
slot_names = list(slot_names)
slot_names.insert(0, "<PAD>")
print(len(slot_names))

25


In [None]:
slot_map = dict() # slot -> index
for idx, us in enumerate(slot_names):
    slot_map[us] = idx
slot_map

{'<PAD>': 0,
 'I-WHO': 1,
 'B-SUBSTANCE': 2,
 'I-EVENT': 3,
 'I-WHERE': 4,
 'I-SUBSTANCE': 5,
 'B-BODY-PARTS': 6,
 'B-ACTIVITY': 7,
 'B-WHO': 8,
 'I-EQUIPMENT': 9,
 'I-DEATH': 10,
 'B-INJURED': 11,
 'I-WHEN': 12,
 'B-WHERE': 13,
 'I-INJURED': 14,
 'B-EVENT': 15,
 'I-BODY-PARTS': 16,
 'B-WHEN': 17,
 'B-EQUIPMENT': 18,
 'B-CAUSE': 19,
 'I-ACTIVITY': 20,
 'I-CAUSE': 21,
 'I-INJURY': 22,
 'B-DEATH': 23,
 'B-INJURY': 24}

In [None]:
# gets slot name from its values
def get_slot_from_word(word, slot_dict):
    for slot_label, value in slot_dict.items():
        for slot_element in value:
          if word in slot_element.split():
              index = slot_element.index(word)
              # return slot_label
              return "B-" + slot_label if index == 0 else "I-" + slot_label
    return None

print(train_data[0]["text"])
print(train_data[0]["arguments"])
print("slot_name for grind is : ", get_slot_from_word("grind", train_data[0]["arguments"]))

{'EVENT': ['Employee #1  was struck and thrown'], 'ACTIVITY': ['checking the depth of the cut into the asphalt', 'grind out existing asphalt from an  interstate at a railroad bridge overpass'], 'WHO': ['Employee #1', 'Employee #1  with Villager  Construction Inc.'], 'WHERE': ['railroad bridge overpass'], 'WHEN': ['November 10  2013'], 'CAUSE': ['The driver of the Tahoe  continued traveling in the far inside lane of the work zone'], 'EQUIPMENT': ['asphalt milling machine', 'Wirtgen; Model Number: W2100', 'handheld  pendant', 'a Chevrolet Tahoe'], 'INJURY': ['severe trauma   lacerations  fractures  and contusions'], 'INJURED': ['Employee #1'], 'BODY-PARTS': ['body and head'], 'DEATH': ['Employee #1']}
slot_name for grind is :  B-ACTIVITY


In [None]:
import numpy as np

# find the max encoded test length
# tokenizer pads all texts to same length anyway so
# just get the length of the first one's input_ids
max_len = len(encoded_texts["input_ids"][0])

def encode_slots(all_slots, all_texts, tokenizer, slot_map, max_len=512):
    encoded_slots = np.zeros(shape=(len(all_texts), max_len), dtype=np.int32)

    for idx, text in enumerate(all_texts):
        enc = [] # for this idx, to be added at the end to encoded_slots
        bert_token_count = 0  # Track the number of BERT tokens

        raw_tokens = text.split()
        for rt in raw_tokens:
            bert_tokens = tokenizer.tokenize(rt)
            bert_token_count += len(bert_tokens)

            if bert_token_count > max_len - 2:  # Account for [CLS] and [SEP]
                break  # Stop processing if max length is reached

            rt_slot_name = get_slot_from_word(rt, all_slots[idx])
            if rt_slot_name is not None:
                enc.extend([slot_map[rt_slot_name]] * len(bert_tokens))
            else:
                enc.extend([0] * len(bert_tokens))

        # Truncate or pad the enc to fit into encoded_slots
        enc = enc[:max_len - 2]  # Truncate if necessary
        enc_length = len(enc)
        if enc_length < max_len - 2:
            enc.extend([0] * (max_len - 2 - enc_length))  # Pad with zeros if shorter

        encoded_slots[idx, 1:len(enc) + 1] = enc

    return encoded_slots



In [None]:
all_slots = [td["arguments"] for td in train_data]
all_texts = [td["text"] for td in train_data]

print(len(all_slots))
print(len(all_texts))
print(slot_map)

110
110
{'<PAD>': 0, 'I-WHO': 1, 'B-SUBSTANCE': 2, 'I-EVENT': 3, 'I-WHERE': 4, 'I-SUBSTANCE': 5, 'B-BODY-PARTS': 6, 'B-ACTIVITY': 7, 'B-WHO': 8, 'I-EQUIPMENT': 9, 'I-DEATH': 10, 'B-INJURED': 11, 'I-WHEN': 12, 'B-WHERE': 13, 'I-INJURED': 14, 'B-EVENT': 15, 'I-BODY-PARTS': 16, 'B-WHEN': 17, 'B-EQUIPMENT': 18, 'B-CAUSE': 19, 'I-ACTIVITY': 20, 'I-CAUSE': 21, 'I-INJURY': 22, 'B-DEATH': 23, 'B-INJURY': 24}


In [None]:
encoded_slots = encode_slots(all_slots, all_texts, tokenizer, slot_map)

In [None]:
encoded_slots[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 17, 12, 12, 15, 15, 15,
       15,  3,  3,  1,  1,  1,  1,  1,  1,  1, 20,  0,  0,  0,  0,  0, 20,
       20,  9,  9,  0,  0,  0,  0,  0,  9,  9,  9,  0,  0,  0,  0,  0,  7,
        7, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,  0,  0,  0, 15, 15, 15,
       15,  0,  0,  3,  0,  0, 20,  0,  7, 20, 20, 20, 20, 20, 20, 20, 20,
        0, 20, 18, 18,  9,  0,  0, 20,  0,  0, 19,  0,  0,  0,  0,  0, 20,
       20,  0,  0,  0,  0,  0,  0,  0, 15, 15, 15, 15,  3,  3,  0,  0,  0,
        3,  0,  7, 20,  0,  0, 19,  0,  3,  0, 20,  0, 20, 20,  9,  9,  0,
       20,  0, 20, 20,  9,  3,  0,  0, 20,  9,  9,  3,  0,  0,  0,  0,  0,
       20,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0, 20,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  0,  0,  3,
        0,  0, 20,  0,  0, 20, 20,  0,  0,  3,  0,  0,  0, 20, 21, 21, 20,
       20,  0, 21,  0, 20, 20, 20,  0,  0, 21,  0,  0,  0,  0,  0,  0,  0,
       19,  0, 21, 21,  3

## Classifier Model

### Definition

In [None]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, slot_num_labels=None,
                 model_name=model_name, dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        self.dropout = Dropout(dropout_prob)
        self.slot_classifier = Dense(slot_num_labels,
                                     name="slot_classifier")

    def call(self, inputs, **kwargs):
        # two outputs from BERT
        trained_bert = self.bert(inputs, **kwargs)
        sequence_output = trained_bert.last_hidden_state

        # sequence_output will be used for slot_filling / classification
        sequence_output = self.dropout(sequence_output,
                                       training=kwargs.get("training", False))
        slot_logits = self.slot_classifier(sequence_output)

        return slot_logits

In [None]:
joint_model = JointIntentAndSlotFillingModel(slot_num_labels=len(slot_map))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### Hyperparams, Optimizer and Loss function

In [None]:
# Configure the optimizer
opt = Adam(learning_rate=3e-5, epsilon=1e-08)

# Since the model only outputs slots, use one loss function and one metric
loss = SparseCategoricalCrossentropy(from_logits=True)
metric = SparseCategoricalAccuracy("accuracy")

# Compile the model
joint_model.compile(optimizer=opt, loss=loss, metrics=[metric])

### Train

In [None]:
x = {
    "input_ids": encoded_texts["input_ids"],
    "token_type_ids": encoded_texts["token_type_ids"],
    "attention_mask": encoded_texts["attention_mask"]
}

history = joint_model.fit(
    x,
    encoded_slots,  # Target slot labels
    epochs=40,
    batch_size=8,
    shuffle=True
)


Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


## Inference

In [None]:
def nlu(text, tokenizer, model, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = model(inputs)
    slot_logits = outputs

    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, :]

    info = {"slots": {}}

    out_dict = {}
    # get all slot names and add to out_dict as keys
    predicted_slots = set([slot_names[s] for s in slot_ids if s != 0])
    for ps in predicted_slots:
      out_dict[ps] = []

    # check if the text starts with a small letter
    if text[0].islower():
      tokens = tokenizer.tokenize(text, add_special_tokens=True)
    else:
      tokens = tokenizer.tokenize(text)
    for token, slot_id in zip(tokens, slot_ids):
        # add all to out_dict
        slot_name = slot_names[slot_id]

        if slot_name == "<PAD>":
            continue

        # collect tokens
        collected_tokens = [token]
        idx = tokens.index(token)

        # see if it starts with ##
        # then it belongs to the previous token
        if token.startswith("##"):
          # check if the token already exists or not
          if tokens[idx - 1] not in out_dict[slot_name]:
            collected_tokens.insert(0, tokens[idx - 1])

        # add collected tokens to slots
        out_dict[slot_name].extend(collected_tokens)

    # process out_dict
    for slot_name in out_dict:
        tokens = out_dict[slot_name]
        slot_value = tokenizer.convert_tokens_to_string(tokens)

        info["slots"][slot_name] = slot_value.strip()

    return info


In [None]:
nlu("On April 5  2010  an employee and a coworker of a utility contractor were  involved with the replacement of natural gas line risers at single family  homes. A 3-ft deep hole was hand dug  approximately 18-in. in diameter  to  access the main 1-in. gas line. A footage squeeze tool was clamped onto the  1-in. main gas line and the old riser assembly was removed. During the process  of installing the new riser  the clamp was removed causing the flow of natural  gas to enter the excavated hole. The employee was found by the coworker face  down in the hole overcome by the gas. The employee was killed.", tokenizer, joint_model, slot_names)

{'slots': {'I-WHEN': '2010 an',
  'I-WHO': 'a contractor were the',
  'B-WHO': 'employee cow utility',
  'I-EQUIPMENT': 'squeeze tool was',
  'B-WHERE': 'family',
  'I-ACTIVITY': 'a natural gas risers at and installing natural gas',
  'I-EVENT': 'and coworker of replacement line was hand diameter main line clamped 1 line old removed process new c removed flow to excavated was found by the coworker face down in the hole overcome by the gas was killed',
  'I-WHERE': 'homes',
  'I-CAUSE': 'access clamp was causing the of enter the hole',
  'B-EVENT': 'employee employee',
  'B-WHEN': '5',
  'B-ACTIVITY': 'of',
  'B-EQUIPMENT': '3 footage'}}

## Usage of llama2

This section use llama2 model to do prompt engineering

In [6]:
import shutil
import os

folder_path = '/content/model'  # Replace with your folder path

# Check if the folder exists
if os.path.exists(folder_path):
    try:
        shutil.rmtree(folder_path)
        print(f"Folder '{folder_path}' and all its contents have been deleted.")
    except Exception as e:
        print(f"An error occurred: {e}")
else:
    print(f"Folder '{folder_path}' does not exist.")


Folder '/content/model' and all its contents have been deleted.


In [6]:
!pip install langchain einops accelerate transformers bitsandbytes scipy

Collecting langchain
  Downloading langchain-0.0.348-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
Collecting bitsandbytes
  Downloading bitsandbytes-0.41.3-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch

In [8]:
!pip install transformers[torch]



In [1]:
# Import transformer classes for generaiton
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
# Import torch for datatype attributes
import torch

In [7]:
# Define variable to hold llama2 weights naming
name = "meta-llama/Llama-2-7b-chat-hf"
# Set auth token variable from hugging face
auth_token = "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"

In [8]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained(name,
    cache_dir='./model/', use_auth_token=auth_token)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [22]:
import gc

gc.collect()
torch.cuda.empty_cache()

In [10]:
# Create model
model = AutoModelForCausalLM.from_pretrained(name,
    cache_dir='./model/', use_auth_token=auth_token, torch_dtype=torch.float16,
    rope_scaling={"type": "dynamic", "factor": 2}, load_in_8bit=True)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [11]:
# Setup a prompt
prompt = "### User:What is the fastest car in  \
          the world and how much does it cost? \
          ### Assistant:"
# Pass the prompt to the tokenizer
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Setup the text streamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Actually run the thing
output = model.generate(**inputs, streamer=streamer,
                        use_cache=True, max_new_tokens=float('inf'))

## Testing on dataset exemple.

In [44]:
exemple = new_examples[0]
answers = new_examples[0]["arguments"]

In [20]:
context = train_data[1]["text"]
context_expected_response = train_data[1]["arguments"]



new_question = exemple["text"]

# Setup a prompt
prompt = f"###  \
          Q: Can you extract informations from this description of an incident ? You dont need to give the context in your answer.: {context}\
          A: {context_expected_response} \
          Q: Can you extract informations from this description of an incident ? You dont need to give the context in your answer.: {new_question}: \
          A: "
# Pass the prompt to the tokenizer
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Setup the text streamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Actually run the thing
output = model.generate(**inputs, streamer=streamer,
                        use_cache=True, max_new_tokens=float('inf'))

{'EVENT': ['the mast that the upper block was attached to catastrophically failed', 'the three workers (Employees #1  #2  and #3) were plunged 20-30 ft to the ground'], 'ACTIVITY': ['installing a new antennae on a communication tower'], 'WHO': ['three male construction workers (ages ranging 31-51)'], 'WHERE': ['communication tower'], 'WHEN': ['October 25  2010'], 'CAUSE': ['catastrophic failure of the mast'], 'EQUIPMENT': ['hoist', 'new antennae'], 'INJURY': ['fractures', 'multiple cuts and lacerations'], 'INJURED': ['all'], 'BODY-PARTS': ['left leg', 'left foot', 'right leg', 'right foot'], 'DEATH': ['']} 


Evaluating

In [52]:
from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys


def normalize_answer(s):
    """Mettre en minuscule et retirer la ponctuation, des déterminants and les espaces."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def f1_score(prediction, ground_truth):
    """Normalise les 2 textes, trouve ce qu'il y a en comment et estime précision, rappel et F1."""
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if len(ground_truth_tokens) == 0 or len(prediction_tokens) == 0:
        return int(ground_truth_tokens == prediction_tokens)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    """Vérifie si les 2 textes sont quasi-identiques."""
    return (normalize_answer(prediction) == normalize_answer(ground_truth))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    """La fonction princiaple. Important de noter que ground_truths est une liste
       parce qu'il peut y avoir plusieurs réponses possibles."""
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)

def evaluate_demo(prediction, ground_truths):
    """Fonction utilitaire pour illuster l'utilisation de metric_max_over_ground_truths.
       Vous pouvez créer votre propre fonction selon vos besoins. """
    exact_match = metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
    f1_value = metric_max_over_ground_truths(f1_score, prediction, ground_truths)

    # Log the prediction and the ground truths
    print("Prediction:", prediction)
    print("Ground Truths:", ground_truths)

    # Log the evaluation metrics
    print('Exact match:', exact_match, '\nF1:', f1_value)

# def evaluate_multiple(predictions, ground_truths):
#     total_score = 0
#     for prediction, ground_truth in zip(predictions, ground_truths):
#         total_score += evaluate_demo(prediction, ground_truth)

#     average_score = total_score / len(predictions)
#     return average_score


In [41]:
import json

# Decode the output to json
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True).split("A: ")[-1].strip().replace("'", '"')
output_json = json.loads(decoded_output)

['the mast that the upper block was attached to catastrophically failed', 'the three workers (Employees #1  #2  and #3) were plunged 20-30 ft to the ground']


In [53]:
for k, v in output_json.items():
  evaluate_demo(v[0], answers[k])

Prediction: the mast that the upper block was attached to catastrophically failed
Ground Truths: ['Mast failure causing a fall']
Exact match: False 
F1: 0.15384615384615383
Prediction: installing a new antennae on a communication tower
Ground Truths: ['Riding the load line of a hoist', 'installing a new antennae on a communication tower']
Exact match: True 
F1: 1.0
Prediction: three male construction workers (ages ranging 31-51)
Ground Truths: ['Three male construction workers (ages 31-51)']
Exact match: False 
F1: 0.923076923076923
Prediction: communication tower
Ground Truths: ['Communication tower']
Exact match: True 
F1: 1.0
Prediction: October 25  2010
Ground Truths: ['October 25, 2010']
Exact match: True 
F1: 1.0
Prediction: catastrophic failure of the mast
Ground Truths: ['Catastrophic failure of the mast']
Exact match: True 
F1: 1.0
Prediction: hoist
Ground Truths: ['Hoist']
Exact match: True 
F1: 1.0
Prediction: fractures
Ground Truths: ['Two with fractures, one with multiple 