# Joint Intent Classification and Slot filling with BERT
This notebook is based on the paper __BERT for Joint Intent Classification and Slot Filling__ by Chen et al. (2019), https://arxiv.org/abs/1902.10909 but on a different dataset made for a class project.

Ideas were also taken from https://github.com/monologg/JointBERT, which is a PyTorch implementation of the paper with the original dataset.


## Install transformers

In [77]:
!pip install transformers



## Download data

In [78]:
# Connection à drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## Read data from json files

Data is of the following format
````json5
{
  "text": "",
  "positions": [{}],
  "slots": [{}],
  "intent": ""
}
````

We will be using `text` as the input and `slots` and `intent` as lables

In [79]:
import json

train_data_path = '/content/drive/MyDrive/ULaval/dev_examples.json'
new_exemples_path = '/content/drive/MyDrive/ULaval/new_examples.json'
test_data_path = '/content/drive/MyDrive/ULaval/test_examples.json'


def load_incident_dataset(filename):
    with open(filename, 'r') as fp:
        incident_list = json.load(fp)

    return incident_list

# Load datasets
train_data = load_incident_dataset(train_data_path)
new_examples = load_incident_dataset(new_exemples_path)
test_data_path = load_incident_dataset(test_data_path)

In [80]:
train_data += new_examples

example = train_data[0]
example

 'arguments': {'EVENT': ['Employee #1  was struck and thrown'],
  'ACTIVITY': ['checking the depth of the cut into the asphalt',
   'grind out existing asphalt from an  interstate at a railroad bridge overpass'],
  'WHO': ['Employee #1', 'Employee #1  with Villager  Construction Inc.'],
  'WHERE': ['railroad bridge overpass'],
  'WHEN': ['November 10  2013'],
  'CAUSE': ['The driver of the Tahoe  continued traveling in the far inside lane of the work zone'],
  'EQUIPMENT': ['asphalt milling machine',
   'Wirtgen; Model Number: W2100',
   'handheld  pendant',
   'a Chevrolet Tahoe'],
  'INJURY': ['severe trauma   lacerations  fractures  and contusions'],
  'INJURED': ['Employee #1'],
  'BODY-PARTS': ['body and head'],
  'DEATH': ['Employee #1']}}

## Load Tokenizer from transformers

We will use a pretrained bert model `bert-base-cased` for both Tokenizer and our classifier.

In [81]:
import tensorflow as tf
from transformers import AutoTokenizer

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Encode texts from the dataset

We have to encode the texts using the tokenizer to create tensors for training the classifier.

In [82]:
# https://huggingface.co/transformers/preprocessing.html

def encode_texts(tokenizer, texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors="tf", max_length=512)

texts = [d["text"] for d in train_data]
tds = encode_texts(tokenizer, texts)
tds.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [83]:
encoded_texts = tds

for t in encoded_texts["input_ids"]:
  if t.shape != (512,):
    print(t.shape)

## Encode labels
### Intents

### Slots

To padd all the texts to the same length, the tokenizer will use special characters. To handle those we need to add <PAD> to slots_names. It can be some other symbol as well.

In [84]:
# encode slots
slot_names = set()
for td in train_data:
    slots = td["arguments"]
    for slot in slots:
        # slot_names.add(slot)

        slot_names.add("B-" + slot)
        slot_names.add("I-" + slot)
slot_names = list(slot_names)
slot_names.insert(0, "<PAD>")
print(len(slot_names))

25


In [85]:
slot_map = dict() # slot -> index
for idx, us in enumerate(slot_names):
    slot_map[us] = idx
slot_map

{'<PAD>': 0,
 'I-WHO': 1,
 'B-SUBSTANCE': 2,
 'I-EVENT': 3,
 'I-WHERE': 4,
 'I-SUBSTANCE': 5,
 'B-BODY-PARTS': 6,
 'B-ACTIVITY': 7,
 'B-WHO': 8,
 'I-EQUIPMENT': 9,
 'I-DEATH': 10,
 'B-INJURED': 11,
 'I-WHEN': 12,
 'B-WHERE': 13,
 'I-INJURED': 14,
 'B-EVENT': 15,
 'I-BODY-PARTS': 16,
 'B-WHEN': 17,
 'B-EQUIPMENT': 18,
 'B-CAUSE': 19,
 'I-ACTIVITY': 20,
 'I-CAUSE': 21,
 'I-INJURY': 22,
 'B-DEATH': 23,
 'B-INJURY': 24}

In [86]:
# gets slot name from its values
def get_slot_from_word(word, slot_dict):
    for slot_label, value in slot_dict.items():
        for slot_element in value:
          if word in slot_element.split():
              index = slot_element.index(word)
              # return slot_label
              return "B-" + slot_label if index == 0 else "I-" + slot_label
    return None

print(train_data[0]["text"])
print(train_data[0]["arguments"])
print("slot_name for grind is : ", get_slot_from_word("grind", train_data[0]["arguments"]))

{'EVENT': ['Employee #1  was struck and thrown'], 'ACTIVITY': ['checking the depth of the cut into the asphalt', 'grind out existing asphalt from an  interstate at a railroad bridge overpass'], 'WHO': ['Employee #1', 'Employee #1  with Villager  Construction Inc.'], 'WHERE': ['railroad bridge overpass'], 'WHEN': ['November 10  2013'], 'CAUSE': ['The driver of the Tahoe  continued traveling in the far inside lane of the work zone'], 'EQUIPMENT': ['asphalt milling machine', 'Wirtgen; Model Number: W2100', 'handheld  pendant', 'a Chevrolet Tahoe'], 'INJURY': ['severe trauma   lacerations  fractures  and contusions'], 'INJURED': ['Employee #1'], 'BODY-PARTS': ['body and head'], 'DEATH': ['Employee #1']}
slot_name for grind is :  B-ACTIVITY


In [87]:
import numpy as np

# find the max encoded test length
# tokenizer pads all texts to same length anyway so
# just get the length of the first one's input_ids
max_len = len(encoded_texts["input_ids"][0])

def encode_slots(all_slots, all_texts, tokenizer, slot_map, max_len=512):
    encoded_slots = np.zeros(shape=(len(all_texts), max_len), dtype=np.int32)

    for idx, text in enumerate(all_texts):
        enc = [] # for this idx, to be added at the end to encoded_slots
        bert_token_count = 0  # Track the number of BERT tokens

        raw_tokens = text.split()
        for rt in raw_tokens:
            bert_tokens = tokenizer.tokenize(rt)
            bert_token_count += len(bert_tokens)

            if bert_token_count > max_len - 2:  # Account for [CLS] and [SEP]
                break  # Stop processing if max length is reached

            rt_slot_name = get_slot_from_word(rt, all_slots[idx])
            if rt_slot_name is not None:
                enc.extend([slot_map[rt_slot_name]] * len(bert_tokens))
            else:
                enc.extend([0] * len(bert_tokens))

        # Truncate or pad the enc to fit into encoded_slots
        enc = enc[:max_len - 2]  # Truncate if necessary
        enc_length = len(enc)
        if enc_length < max_len - 2:
            enc.extend([0] * (max_len - 2 - enc_length))  # Pad with zeros if shorter

        encoded_slots[idx, 1:len(enc) + 1] = enc

    return encoded_slots



In [88]:
all_slots = [td["arguments"] for td in train_data]
all_texts = [td["text"] for td in train_data]

print(len(all_slots))
print(len(all_texts))
print(slot_map)

110
110
{'<PAD>': 0, 'I-WHO': 1, 'B-SUBSTANCE': 2, 'I-EVENT': 3, 'I-WHERE': 4, 'I-SUBSTANCE': 5, 'B-BODY-PARTS': 6, 'B-ACTIVITY': 7, 'B-WHO': 8, 'I-EQUIPMENT': 9, 'I-DEATH': 10, 'B-INJURED': 11, 'I-WHEN': 12, 'B-WHERE': 13, 'I-INJURED': 14, 'B-EVENT': 15, 'I-BODY-PARTS': 16, 'B-WHEN': 17, 'B-EQUIPMENT': 18, 'B-CAUSE': 19, 'I-ACTIVITY': 20, 'I-CAUSE': 21, 'I-INJURY': 22, 'B-DEATH': 23, 'B-INJURY': 24}


In [89]:
encoded_slots = encode_slots(all_slots, all_texts, tokenizer, slot_map)

In [90]:
encoded_slots[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 17, 12, 12, 15, 15, 15,
       15,  3,  3,  1,  1,  1,  1,  1,  1,  1, 20,  0,  0,  0,  0,  0, 20,
       20,  9,  9,  0,  0,  0,  0,  0,  9,  9,  9,  0,  0,  0,  0,  0,  7,
        7, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,  0,  0,  0, 15, 15, 15,
       15,  0,  0,  3,  0,  0, 20,  0,  7, 20, 20, 20, 20, 20, 20, 20, 20,
        0, 20, 18, 18,  9,  0,  0, 20,  0,  0, 19,  0,  0,  0,  0,  0, 20,
       20,  0,  0,  0,  0,  0,  0,  0, 15, 15, 15, 15,  3,  3,  0,  0,  0,
        3,  0,  7, 20,  0,  0, 19,  0,  3,  0, 20,  0, 20, 20,  9,  9,  0,
       20,  0, 20, 20,  9,  3,  0,  0, 20,  9,  9,  3,  0,  0,  0,  0,  0,
       20,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0, 20,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  0,  0,  3,
        0,  0, 20,  0,  0, 20, 20,  0,  0,  3,  0,  0,  0, 20, 21, 21, 20,
       20,  0, 21,  0, 20, 20, 20,  0,  0, 21,  0,  0,  0,  0,  0,  0,  0,
       19,  0, 21, 21,  3

## Classifier Model

### Definition

In [91]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, slot_num_labels=None,
                 model_name=model_name, dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        self.dropout = Dropout(dropout_prob)
        self.slot_classifier = Dense(slot_num_labels,
                                     name="slot_classifier")

    def call(self, inputs, **kwargs):
        # two outputs from BERT
        trained_bert = self.bert(inputs, **kwargs)
        sequence_output = trained_bert.last_hidden_state

        # sequence_output will be used for slot_filling / classification
        sequence_output = self.dropout(sequence_output,
                                       training=kwargs.get("training", False))
        slot_logits = self.slot_classifier(sequence_output)

        return slot_logits

In [92]:
joint_model = JointIntentAndSlotFillingModel(slot_num_labels=len(slot_map))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### Hyperparams, Optimizer and Loss function

In [93]:
# Configure the optimizer
opt = Adam(learning_rate=3e-5, epsilon=1e-08)

# Since the model only outputs slots, use one loss function and one metric
loss = SparseCategoricalCrossentropy(from_logits=True)
metric = SparseCategoricalAccuracy("accuracy")

# Compile the model
joint_model.compile(optimizer=opt, loss=loss, metrics=[metric])

### Train

In [97]:
x = {
    "input_ids": encoded_texts["input_ids"],
    "token_type_ids": encoded_texts["token_type_ids"],
    "attention_mask": encoded_texts["attention_mask"]
}

history = joint_model.fit(
    x,
    encoded_slots,  # Target slot labels
    epochs=40,
    batch_size=8,
    shuffle=True
)


Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


## Inference

In [98]:
def nlu(text, tokenizer, model, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = model(inputs)
    slot_logits = outputs

    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, :]

    info = {"slots": {}}

    out_dict = {}
    # get all slot names and add to out_dict as keys
    predicted_slots = set([slot_names[s] for s in slot_ids if s != 0])
    for ps in predicted_slots:
      out_dict[ps] = []

    # check if the text starts with a small letter
    if text[0].islower():
      tokens = tokenizer.tokenize(text, add_special_tokens=True)
    else:
      tokens = tokenizer.tokenize(text)
    for token, slot_id in zip(tokens, slot_ids):
        # add all to out_dict
        slot_name = slot_names[slot_id]

        if slot_name == "<PAD>":
            continue

        # collect tokens
        collected_tokens = [token]
        idx = tokens.index(token)

        # see if it starts with ##
        # then it belongs to the previous token
        if token.startswith("##"):
          # check if the token already exists or not
          if tokens[idx - 1] not in out_dict[slot_name]:
            collected_tokens.insert(0, tokens[idx - 1])

        # add collected tokens to slots
        out_dict[slot_name].extend(collected_tokens)

    # process out_dict
    for slot_name in out_dict:
        tokens = out_dict[slot_name]
        slot_value = tokenizer.convert_tokens_to_string(tokens)

        info["slots"][slot_name] = slot_value.strip()

    return info


In [99]:
nlu("On April 5  2010  an employee and a coworker of a utility contractor were  involved with the replacement of natural gas line risers at single family  homes. A 3-ft deep hole was hand dug  approximately 18-in. in diameter  to  access the main 1-in. gas line. A footage squeeze tool was clamped onto the  1-in. main gas line and the old riser assembly was removed. During the process  of installing the new riser  the clamp was removed causing the flow of natural  gas to enter the excavated hole. The employee was found by the coworker face  down in the hole overcome by the gas. The employee was killed.", tokenizer, joint_model, slot_names)

{'slots': {'I-WHEN': '2010 an',
  'I-WHO': 'a contractor were the',
  'B-WHO': 'employee cow utility',
  'I-EQUIPMENT': 'squeeze tool was',
  'B-WHERE': 'family',
  'I-ACTIVITY': 'a natural gas risers at and installing natural gas',
  'I-EVENT': 'and coworker of replacement line was hand diameter main line clamped 1 line old removed process new c removed flow to excavated was found by the coworker face down in the hole overcome by the gas was killed',
  'I-WHERE': 'homes',
  'I-CAUSE': 'access clamp was causing the of enter the hole',
  'B-EVENT': 'employee employee',
  'B-WHEN': '5',
  'B-ACTIVITY': 'of',
  'B-EQUIPMENT': '3 footage'}}

## Generate prediction.json

This section creates a file containing all the prediction results for inputs from dev.json