# Joint Intent Classification and Slot filling with BERT
This notebook is based on the paper __BERT for Joint Intent Classification and Slot Filling__ by Chen et al. (2019), https://arxiv.org/abs/1902.10909 but on a different dataset made for a class project.

Ideas were also taken from https://github.com/monologg/JointBERT, which is a PyTorch implementation of the paper with the original dataset.


## Install transformers

In [1]:
!pip install transformers



## Download data

In [2]:
!wget https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/train.json

--2023-12-05 16:11:27--  https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/train.json
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/train.json [following]
--2023-12-05 16:11:27--  https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/train.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5055766 (4.8M) [text/plain]
Saving to: ‘train.json.1’


2023-12-05 16:11:27 (145 MB/s) - ‘train.json.1’ saved [5055766/5055766]



In [3]:
!wget https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/dev.json

--2023-12-05 16:11:27--  https://github.com/ShawonAshraf/nlu-jointbert-dl2021/raw/main/data/nlu_traindev/dev.json
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/dev.json [following]
--2023-12-05 16:11:28--  https://raw.githubusercontent.com/ShawonAshraf/nlu-jointbert-dl2021/main/data/nlu_traindev/dev.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 248459 (243K) [text/plain]
Saving to: ‘dev.json.1’


2023-12-05 16:11:28 (18.7 MB/s) - ‘dev.json.1’ saved [248459/248459]



In [4]:
# Connection à drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## Read data from json files

Data is of the following format
````json5
{
  "text": "",
  "positions": [{}],
  "slots": [{}],
  "intent": ""
}
````

We will be using `text` as the input and `slots` and `intent` as lables

In [5]:
# import json
# import os

# class RawData(object):
#     def __init__(self, id, intent, positions, slots, text):
#         self.id = id
#         self.intent = intent
#         self.positions = positions
#         self.slots = slots
#         self.text = text

#     def __repr__(self):
#         return str(json.dumps(self.__dict__, indent=2))


# """
# reads json from data file
# returns a list containing DataInstance objects
# """


# def read_train_json_file(filename):
#     if os.path.exists(filename):
#         intents = []

#         with open(filename, "r", encoding="utf-8") as json_file:
#             data = json.load(json_file)

#             for k in data.keys():
#                 intent = data[k]["intent"]
#                 positions = data[k]["positions"]
#                 slots = data[k]["slots"]
#                 text = data[k]["text"]

#                 temp = RawData(k, intent, positions, slots, text)
#                 intents.append(temp)

#         return intents
#     else:
#         raise FileNotFoundError("No file found with that path!")

# # read from json file
# train_data = read_train_json_file("train.json")

In [6]:
import json

train_data_path = '/content/drive/MyDrive/ULaval/dev_examples.json'
new_exemples_path = '/content/drive/MyDrive/ULaval/new_examples.json'
test_data_path = '/content/drive/MyDrive/ULaval/test_examples.json'


def load_incident_dataset(filename):
    with open(filename, 'r') as fp:
        incident_list = json.load(fp)

    return incident_list

# Load datasets
train_data = load_incident_dataset(train_data_path)
new_examples = load_incident_dataset(new_exemples_path)
test_data_path = load_incident_dataset(test_data_path)

In [7]:
example = train_data[0]
example

 'arguments': {'EVENT': ['Employee #1  was struck and thrown'],
  'ACTIVITY': ['checking the depth of the cut into the asphalt',
   'grind out existing asphalt from an  interstate at a railroad bridge overpass'],
  'WHO': ['Employee #1', 'Employee #1  with Villager  Construction Inc.'],
  'WHERE': ['railroad bridge overpass'],
  'WHEN': ['November 10  2013'],
  'CAUSE': ['The driver of the Tahoe  continued traveling in the far inside lane of the work zone'],
  'EQUIPMENT': ['asphalt milling machine',
   'Wirtgen; Model Number: W2100',
   'handheld  pendant',
   'a Chevrolet Tahoe'],
  'INJURY': ['severe trauma   lacerations  fractures  and contusions'],
  'INJURED': ['Employee #1'],
  'BODY-PARTS': ['body and head'],
  'DEATH': ['Employee #1']}}

## Load Tokenizer from transformers

We will use a pretrained bert model `bert-base-cased` for both Tokenizer and our classifier.

In [8]:
import tensorflow as tf
from transformers import AutoTokenizer

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Encode texts from the dataset

We have to encode the texts using the tokenizer to create tensors for training the classifier.

In [9]:
# https://huggingface.co/transformers/preprocessing.html

def encode_texts(tokenizer, texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors="tf", max_length=512)

texts = [d["text"] for d in train_data]
tds = encode_texts(tokenizer, texts)
tds.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [10]:
encoded_texts = tds

for t in encoded_texts["input_ids"]:
  if t.shape != (512,):
    print(t.shape)

## Encode labels
### Intents

In [11]:

# intents = [d.intent for d in train_data]
# intent_names = list(set(intents))
# intent_names

In [12]:
# intent_map = dict() # index -> intent
# for idx, ui in enumerate(intent_names):
#     intent_map[ui] = idx
# intent_map

In [13]:
# # map to train_data values
# def encode_intents(intents, intent_map):
#     encoded = []
#     for i in intents:
#         encoded.append(intent_map[i])
#     # convert to tf tensor
#     return tf.convert_to_tensor(encoded, dtype="int32")

# encoded_intents = encode_intents(intents, intent_map)

### Slots

To padd all the texts to the same length, the tokenizer will use special characters. To handle those we need to add <PAD> to slots_names. It can be some other symbol as well.

In [14]:
# encode slots
slot_names = set()
for td in train_data:
    slots = td["arguments"]
    for slot in slots:
        slot_names.add(slot)
slot_names = list(slot_names)
slot_names.insert(0, "<PAD>")
slot_names

['<PAD>',
 'WHEN',
 'DEATH',
 'EVENT',
 'ACTIVITY',
 'WHERE',
 'CAUSE',
 'SUBSTANCE',
 'INJURED',
 'EQUIPMENT',
 'BODY-PARTS',
 'WHO',
 'INJURY']

In [15]:
slot_map = dict() # slot -> index
for idx, us in enumerate(slot_names):
    slot_map[us] = idx
slot_map

{'<PAD>': 0,
 'WHEN': 1,
 'DEATH': 2,
 'EVENT': 3,
 'ACTIVITY': 4,
 'WHERE': 5,
 'CAUSE': 6,
 'SUBSTANCE': 7,
 'INJURED': 8,
 'EQUIPMENT': 9,
 'BODY-PARTS': 10,
 'WHO': 11,
 'INJURY': 12}

In [16]:
# gets slot name from its values
def get_slot_from_word(word, slot_dict):
    for slot_label, value in slot_dict.items():
        for slot_element in value:
          if word in slot_element.split():
              return slot_label
    return None

print(train_data[0]["text"])
print(train_data[0]["arguments"])
print("slot_name for struck is : ", get_slot_from_word("struck", train_data[0]["arguments"]))

{'EVENT': ['Employee #1  was struck and thrown'], 'ACTIVITY': ['checking the depth of the cut into the asphalt', 'grind out existing asphalt from an  interstate at a railroad bridge overpass'], 'WHO': ['Employee #1', 'Employee #1  with Villager  Construction Inc.'], 'WHERE': ['railroad bridge overpass'], 'WHEN': ['November 10  2013'], 'CAUSE': ['The driver of the Tahoe  continued traveling in the far inside lane of the work zone'], 'EQUIPMENT': ['asphalt milling machine', 'Wirtgen; Model Number: W2100', 'handheld  pendant', 'a Chevrolet Tahoe'], 'INJURY': ['severe trauma   lacerations  fractures  and contusions'], 'INJURED': ['Employee #1'], 'BODY-PARTS': ['body and head'], 'DEATH': ['Employee #1']}
slot_name for struck is :  EVENT


In [17]:
import numpy as np

# find the max encoded test length
# tokenizer pads all texts to same length anyway so
# just get the length of the first one's input_ids
max_len = len(encoded_texts["input_ids"][0])

def encode_slots(all_slots, all_texts, tokenizer, slot_map, max_len=512):
    encoded_slots = np.zeros(shape=(len(all_texts), max_len), dtype=np.int32)

    for idx, text in enumerate(all_texts):
        enc = [] # for this idx, to be added at the end to encoded_slots
        bert_token_count = 0  # Track the number of BERT tokens

        raw_tokens = text.split()
        for rt in raw_tokens:
            bert_tokens = tokenizer.tokenize(rt)
            bert_token_count += len(bert_tokens)

            if bert_token_count > max_len - 2:  # Account for [CLS] and [SEP]
                break  # Stop processing if max length is reached

            rt_slot_name = get_slot_from_word(rt, all_slots[idx])
            if rt_slot_name is not None:
                enc.extend([slot_map[rt_slot_name]] * len(bert_tokens))
            else:
                enc.extend([0] * len(bert_tokens))

        # Truncate or pad the enc to fit into encoded_slots
        enc = enc[:max_len - 2]  # Truncate if necessary
        enc_length = len(enc)
        if enc_length < max_len - 2:
            enc.extend([0] * (max_len - 2 - enc_length))  # Pad with zeros if shorter

        encoded_slots[idx, 1:len(enc) + 1] = enc

    return encoded_slots



In [18]:
all_slots = [td["arguments"] for td in train_data]
all_texts = [td["text"] for td in train_data]

print(len(all_slots))
print(len(all_texts))
print(slot_map)

100
100
{'<PAD>': 0, 'WHEN': 1, 'DEATH': 2, 'EVENT': 3, 'ACTIVITY': 4, 'WHERE': 5, 'CAUSE': 6, 'SUBSTANCE': 7, 'INJURED': 8, 'EQUIPMENT': 9, 'BODY-PARTS': 10, 'WHO': 11, 'INJURY': 12}


In [19]:
encoded_slots = encode_slots(all_slots, all_texts, tokenizer, slot_map)

In [20]:
encoded_slots[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  3,  3,  3,
        3,  3,  3, 11, 11, 11, 11, 11, 11, 11,  4,  0,  0,  0,  0,  0,  4,
        4,  9,  9,  0,  0,  0,  0,  0,  9,  9,  9,  0,  0,  0,  0,  0,  4,
        4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  0,  0,  0,  3,  3,  3,
        3,  0,  0,  3,  0,  0,  4,  0,  4,  4,  4,  4,  4,  4,  4,  4,  4,
        0,  4,  9,  9,  9,  0,  0,  4,  0,  0,  6,  0,  0,  0,  0,  0,  4,
        4,  0,  0,  0,  0,  0,  0,  0,  3,  3,  3,  3,  3,  3,  0,  0,  0,
        3,  0,  4,  4,  0,  0,  6,  0,  3,  0,  4,  0,  4,  4,  9,  9,  0,
        4,  0,  4,  4,  9,  3,  0,  0,  4,  9,  9,  3,  0,  0,  0,  0,  0,
        4,  0,  0,  0,  0,  0,  0,  0,  4,  0,  0,  0,  0,  0,  0,  4,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11, 11,  0,  0,  0,  0,  3,
        0,  0,  4,  0,  0,  4,  4,  0,  0,  3,  0,  0,  0,  4,  6,  6,  4,
        4,  0,  6,  0,  4,  4,  4,  0,  0,  6,  0,  0,  0,  0,  0,  0,  0,
        6,  0,  6,  6,  3

## Classifier Model

### Definition

In [21]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, slot_num_labels=None,
                 model_name=model_name, dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        self.dropout = Dropout(dropout_prob)
        self.slot_classifier = Dense(slot_num_labels,
                                     name="slot_classifier")

    def call(self, inputs, **kwargs):
        # two outputs from BERT
        trained_bert = self.bert(inputs, **kwargs)
        sequence_output = trained_bert.last_hidden_state

        # sequence_output will be used for slot_filling / classification
        sequence_output = self.dropout(sequence_output,
                                       training=kwargs.get("training", False))
        slot_logits = self.slot_classifier(sequence_output)

        return slot_logits

In [22]:
joint_model = JointIntentAndSlotFillingModel(slot_num_labels=len(slot_map))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### Hyperparams, Optimizer and Loss function

In [23]:
# Configure the optimizer
opt = Adam(learning_rate=3e-5, epsilon=1e-08)

# Since the model only outputs slots, use one loss function and one metric
loss = SparseCategoricalCrossentropy(from_logits=True)
metric = SparseCategoricalAccuracy("accuracy")

# Compile the model
joint_model.compile(optimizer=opt, loss=loss, metrics=[metric])

### Train

In [24]:
x = {
    "input_ids": encoded_texts["input_ids"],
    "token_type_ids": encoded_texts["token_type_ids"],
    "attention_mask": encoded_texts["attention_mask"]
}

history = joint_model.fit(
    x,
    encoded_slots,  # Target slot labels
    epochs=2,
    batch_size=8,
    shuffle=True
)


Epoch 1/2




Epoch 2/2


## Inference

In [29]:
def nlu(text, tokenizer, model, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = model(inputs)
    slot_logits = outputs

    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, :]

    info = {"slots": {}}

    out_dict = {}
    # get all slot names and add to out_dict as keys
    predicted_slots = set([slot_names[s] for s in slot_ids if s != 0])
    for ps in predicted_slots:
      out_dict[ps] = []

    # check if the text starts with a small letter
    if text[0].islower():
      tokens = tokenizer.tokenize(text, add_special_tokens=True)
    else:
      tokens = tokenizer.tokenize(text)
    for token, slot_id in zip(tokens, slot_ids):
        # add all to out_dict
        slot_name = slot_names[slot_id]

        if slot_name == "<PAD>":
            continue

        # collect tokens
        collected_tokens = [token]
        idx = tokens.index(token)

        # see if it starts with ##
        # then it belongs to the previous token
        if token.startswith("##"):
          # check if the token already exists or not
          if tokens[idx - 1] not in out_dict[slot_name]:
            collected_tokens.insert(0, tokens[idx - 1])

        # add collected tokens to slots
        out_dict[slot_name].extend(collected_tokens)

    # process out_dict
    for slot_name in out_dict:
        tokens = out_dict[slot_name]
        slot_value = tokenizer.convert_tokens_to_string(tokens)

        info["slots"][slot_name] = slot_value.strip()

    return info


In [30]:
nlu("On April 5  2010  an employee and a coworker of a utility contractor were  involved with the replacement of natural gas line risers at single family  homes. A 3-ft deep hole was hand dug  approximately 18-in. in diameter  to  access the main 1-in. gas line. A footage squeeze tool was clamped onto the  1-in. main gas line and the old riser assembly was removed. During the process  of installing the new riser  the clamp was removed causing the flow of natural  gas to enter the excavated hole. The employee was found by the coworker face  down in the hole overcome by the gas. The employee was killed.", tokenizer, joint_model, slot_names)

{'slots': {'ACTIVITY': 'the replacement of natural risers approximately main old new c excavated hole',
  'WHO': 'employee coworker',
  'WHEN': '5 2010',
  'EVENT': '1'}}

In [31]:
nlu("add Brian May to my Reggae Infusions list", tokenizer, joint_model, slot_names)

{'slots': {'EVENT': 'Regga'}}

In [None]:
import calendar
import time

# to generate timestamps for prediction file
def get_time_stamp():
    ts = calendar.timegm(time.gmtime())
    return ts

get_time_stamp()

## Generate prediction.json

This section creates a file containing all the prediction results for inputs from dev.json