# BERT sequential classifier as simple QA system

This notebook is the code necessary to finetune a BERT sequence classifier as a QA system on our provided training and testing data.

The data is comprised of zoning ordinance questions and their respective answers.

This system was not designed to be a full fledged QA system but is created as a contrast to the more fully featured systems tested in other notebooks and implementations.

To run this notebook simply run each cell in order.

In [2]:
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'

from sklearn.model_selection import train_test_split
from datasets import load_dataset
from datasets import load_metric
import pandas as pd
import numpy as np
import evaluate

file_path = f'{os.getcwd()}/data'

from transformers import AutoTokenizer
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

Printing the filepath to confirm that relative filepaths do not break on non local machines

In [3]:
print(file_path)

/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data


### Simple Additional Preprocessing:

Examining the training set as a data frame to verify it's contents and running simple replace functions to clean up the questions and answers of any extraneous characters. The data has already been preprocessed once before by either the corpus builder or the template generator

In [4]:
hf_dataset_df = pd.read_csv(f'{file_path}/csv/questions_answers.csv', low_memory=False)

hf_dataset_df['question'] = hf_dataset_df['question'].str.replace('  ', ' ')
hf_dataset_df['question'] = hf_dataset_df['question'].str.replace(',', '')

hf_dataset_df.loc[hf_dataset_df['answer'] == 'True', 'answer'] = 'Yes'
hf_dataset_df.loc[hf_dataset_df['answer'] == 'False', 'answer'] = 'No'

hf_dataset_df.head()

Unnamed: 0,sparql,template_name,variables,answer,question
0,\nSELECT ?zoning_label\n\nWHERE {\n ?zo...,template_use_1var_m_answer,{'use': 'group care facilities'},"['C2', 'C3', 'C4']",Which zoning districts allow group care facili...
1,\nSELECT ?zoning_label\n\nWHERE {\n ?zo...,template_use_1var_m_answer,{'use': 'group care facilities'},"['C2', 'C3', 'C4']",Which zoning districts permit group care facil...
2,\nSELECT ?zoning_label\n\nWHERE {\n ?zo...,template_use_1var_m_answer,{'use': 'group care facilities'},"['C2', 'C3', 'C4']",I would like to build group care facilities. W...
3,\nSELECT ?zoning_label\n\nWHERE {\n ?zo...,template_use_1var_m_answer,{'use': 'dry cleaning plants'},"['FI1', 'FI2', 'FI3']",Which zoning districts allow dry cleaning plants?
4,\nSELECT ?zoning_label\n\nWHERE {\n ?zo...,template_use_1var_m_answer,{'use': 'dry cleaning plants'},"['FI1', 'FI2', 'FI3']",Which zoning districts permit dry cleaning pla...


### Creating classes dictionary:

The sequence classifier is a multiclass classifier and requires numerical class representation. This dictionary will also be used to convert the predicted answers and ground truth labels back to their natural language format for evaluation.

In [5]:
hf_dataset_df = hf_dataset_df.filter(['answer', 'question'], axis=1)

uni = np.unique(hf_dataset_df['answer'], return_counts=True)

d = dict(enumerate(uni[0].flatten(), 1))
inv_map = {v: k for k, v in d.items()}
print(inv_map)

{'No': 1, 'Yes': 2, "['0 [ft_i]']": 3, "['1 [du/acr_u]']": 4, "['10 [ft_i]']": 5, "['100 [ft_i]']": 6, "['10000 [sft_i]']": 7, "['12 [du/acr_u]']": 8, "['12 [u/acr_u]']": 9, "['125 [ft_i]']": 10, "['15 [ft_i]']": 11, "['150 [ft_i]']": 12, "['2 [du/acr_u]']": 13, "['20 [ft_i]']": 14, "['20000 [sft_i]']": 15, "['25 [ft_i]']": 16, "['30 [ft_i]']": 17, "['35 [ft_i]']": 18, "['35000 [sft_i]']": 19, "['4 [du/acr_u]']": 20, "['40 [ft_i]']": 21, "['5 [ft_i]']": 22, "['50 [ft_i]']": 23, "['6 [du/acr_u]']": 24, "['60 [ft_i]']": 25, "['6000 [sft_i]']": 26, "['70 [ft_i]']": 27, "['75 [ft_i]']": 28, "['8 [du/acr_u]']": 29, "['80 [ft_i]']": 30, "['90 [ft_i]']": 31, "['A1']": 32, "['A2']": 33, "['C1', 'C2', 'C3', 'C4', 'FI1', 'FI2', 'FI3']": 34, "['C1', 'C2', 'C3', 'C4']": 35, "['C2', 'C3', 'C4']": 36, "['C3', 'C4']": 37, "['C4']": 38, "['FI1', 'FI2', 'FI3']": 39, "['FI2', 'FI3']": 40, "['FI3']": 41, "['R1', 'R2', 'R3', 'C1', 'C2', 'C3', 'C4', 'FI1', 'FI2', 'FI3']": 42, "['R1', 'R2', 'R3', 'C1', 'C2'

Renaming columns for the finetuning process of the BERT tranformer

In [6]:
hf_dataset_df = hf_dataset_df.replace({'answer': inv_map})

hf_dataset_df.rename(columns={'answer': 'label', 'question': 'text'}, inplace=True)

hf_dataset_df.head()

Unnamed: 0,label,text
0,36,Which zoning districts allow group care facili...
1,36,Which zoning districts permit group care facil...
2,36,I would like to build group care facilities. W...
3,39,Which zoning districts allow dry cleaning plants?
4,39,Which zoning districts permit dry cleaning pla...


Test/train split and conversion to required json format

In [7]:
train, test = train_test_split(hf_dataset_df, test_size=0.1, random_state=246341428)

train.to_json(f'{file_path}/json/QAZoningTrain.json', orient='records', lines=True)
test.to_json(f'{file_path}/json/QAZoningTest.json', orient='records', lines=True)

### Tokenization and creation of Hugging Face Dataset class object

The BERT model and tokenizer from Hugging Face require that the data be converted to a dataset object, hence the need for the train/test split to exist as json

In [8]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", lower=True)

def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

data_files = {"train": f'{file_path}/json/QAZoningTrain.json', "test": f'{file_path}/json/QAZoningTest.json'} # * this is how to load multiple files, need to sklearn train_test_split into two sets first
print(data_files)
QA_dataset = load_dataset('json', data_files=data_files)
print(QA_dataset)

Using custom data configuration default-ffb13f45da879356


{'train': '/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/json/QAZoningTrain.json', 'test': '/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/json/QAZoningTest.json'}
Downloading and preparing dataset json/default to /home/jesusaur/.cache/huggingface/datasets/json/default-ffb13f45da879356/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

  

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/jesusaur/.cache/huggingface/datasets/json/default-ffb13f45da879356/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 955
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 107
    })
})


### Training and Evaluation Parameters

Initialization of the pretrained model and tokenization of dataset.

These are the parameters used for training and evaluation in the process of finetuning the model.

Metrics selected were Accuracy and F1 from the Hugging Face Evaluate library.

The model is evaluated and saved at each epoch.

There is opportunity for additional hyperparameter tuning at this stage but results were adequate using these initial parameter sets.

In [9]:
tokenized_data = QA_dataset.map(preprocess_function, batched=True)
    
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=48)

metric1 = evaluate.load('f1')
metric2 = evaluate.load('accuracy')

training_args = TrainingArguments(output_dir = "test_trainer",
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  do_train=True,
                                  do_eval=True,
                                  learning_rate=1e-5,
                                  logging_steps=50,
                                  eval_steps=50,
                                  per_device_train_batch_size=8,
                                  per_device_eval_batch_size=8,
                                  num_train_epochs=25,
                                  weight_decay=0.001,)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    f1 = metric1.compute(predictions=predictions, references=labels, average='macro')
    accuracy = metric2.compute(predictions=predictions, references=labels)
    return {"accuracy": accuracy['accuracy'], "f1": f1['f1']}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    compute_metrics=compute_metrics
)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Fine tunining the pretrained model begins here

In [10]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 955
  Num Epochs = 25
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3000
Trainer is attempting to log a value of "{0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2', 3: 'LABEL_3', 4: 'LABEL_4', 5: 'LABEL_5', 6: 'LABEL_6', 7: 'LABEL_7', 8: 'LABEL_8', 9: 'LABEL_9', 10: 'LABEL_10', 11: 'LABEL_11', 12: 'LABEL_12', 13: 'LABEL_13', 14: 'LABEL_14', 15: 'LABEL_15', 16: 'LABEL_16', 17: 'LABEL_17', 18: 'LABEL_18', 19: 'LABEL_19', 20: 'LABEL_20', 21: 'LABEL_21', 22: 'LABEL_22', 23: 'LABEL_23', 24: 'LABEL_24', 25: 'LABEL_25', 26: 'LABEL_26', 27: 'LABEL_27', 28: 'LABEL_28', 29: 'LABE

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,2.2009,1.643808,0.626168,0.082477
2,1.4269,1.308724,0.700935,0.146278
3,1.1696,1.182571,0.71028,0.164993
4,1.0229,1.114107,0.691589,0.145995
5,0.9492,1.039837,0.747664,0.213068
6,0.8805,0.968154,0.766355,0.27748
7,0.7503,0.903104,0.803738,0.328102
8,0.7447,0.812674,0.831776,0.383548
9,0.6686,0.755517,0.859813,0.441397
10,0.5893,0.704456,0.878505,0.47581


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-120
Configuration saved in test_trainer/checkpoint-120/config.json
Model weights saved in test_trainer/checkpoint-120/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-240
Configuration saved in test_trainer/checkpoint-240/config.json
Model weights saved in test_trainer/check

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-2040
Configuration saved in test_trainer/checkpoint-2040/config.json
Model weights saved in test_trainer/checkpoint-2040/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-2160
Configuration saved in test_trainer/checkpoint-2160/config.json
Model weights saved in test_trainer/

TrainOutput(global_step=3000, training_loss=0.6526463314692179, metrics={'train_runtime': 1432.0539, 'train_samples_per_second': 16.672, 'train_steps_per_second': 2.095, 'total_flos': 6284370917376000.0, 'train_loss': 0.6526463314692179, 'epoch': 25.0})

Evaluation occurs during training but addition of this call to evaluate() allows us to print the best model's final metrics

In [11]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8


{'eval_loss': 0.4185863137245178,
 'eval_accuracy': 0.897196261682243,
 'eval_f1': 0.5132275132275131,
 'eval_runtime': 2.0396,
 'eval_samples_per_second': 52.461,
 'eval_steps_per_second': 6.864,
 'epoch': 25.0}

Simple function to map classes back to their natural language counterparts

In [12]:
def get_key(d, value):
   return [k for k, v in d.items() if v == value]

### Checking some results:

Below is a sanity check to see what kind of results are returned after the model is trained.

We have provided two correct results but feel free to explore using the same format to find an incorrect example

In [22]:
prediction = trainer.predict(tokenized_data["test"])
results = (tokenized_data["test"][19]['text'], get_key(inv_map, np.argmax(prediction[0][19], axis=-1)), 
          get_key(inv_map, prediction[1][19]))
results

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 107
  Batch size = 8


('I would like to build physical fitness centers. Which zoning districts permits this use?',
 ["['C2', 'C3', 'C4']"],
 ["['C2', 'C3', 'C4']"])

In [21]:
prediction = trainer.predict(tokenized_data["test"])
results = (tokenized_data["test"][7]['text'], get_key(inv_map, np.argmax(prediction[0][7], axis=-1)), 
          get_key(inv_map, prediction[1][7]))
results

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 107
  Batch size = 8


('Are monument works allowed in a FI2 zoning district?', ['Yes'], ['Yes'])