# ROBERTA sequential classifier as simple QA system

This notebook is the code necessary to finetune a ROBERTA sequence classifier as a QA system on our provided training and testing data.

The data is comprised of zoning ordinance questions and their respective answers.

This system was not designed to be a full fledged QA system but is created as a contrast to the more fully featured systems tested in other notebooks and implementations. Specifically a ROBERTA model was required by the finetuning of the SQuAD based model in the next notebooks and therefore we selected ROBERTA here to have a direct comparison. 

To run this notebook simply run each cell in order.

In [1]:
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'

from sklearn.model_selection import train_test_split
from datasets import load_dataset
from datasets import load_metric
import pandas as pd
import numpy as np
import ipywidgets
import evaluate
import torch

file_path = f'{os.getcwd()}/data'

from transformers import AutoTokenizer
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

### Tokenization and creation of Hugging Face Dataset class object

The ROBERTA model and tokenizer from Hugging Face require that the data be converted to a dataset object, hence the need for the train/test split to exist as json

Since we have already created the json train/test split previously in the BERT notebook we can skip data examination and additional preprocessing

In [2]:
tokenizer = AutoTokenizer.from_pretrained('lexlms/roberta-base-uncased', lower=True)

def preprocess_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True)

data_files = {"train": f'{file_path}/json/QAZoningTrain.json', "test": f'{file_path}/json/QAZoningTest.json'} # * this is how to load multiple files, need to sklearn train_test_split into two sets first
print(data_files)
QA_dataset = load_dataset('json', data_files=data_files)
print(QA_dataset)

Downloading tokenizer_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.08M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Using custom data configuration default-ffb13f45da879356


{'train': '/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/json/QAZoningTrain.json', 'test': '/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/json/QAZoningTest.json'}


Found cached dataset json (/home/jesusaur/.cache/huggingface/datasets/json/default-ffb13f45da879356/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 955
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 107
    })
})


### Training and Evaluation Parameters

Initialization of the pretrained model and tokenization of dataset.

These are the parameters used for training and evaluation in the process of finetuning the model.

Metrics selected were Accuracy and F1 from the Hugging Face Evaluate library.

The model is evaluated and saved at each epoch.

There is opportunity for additional hyperparameter tuning at this stage but results were adequate using these initial parameter sets.

In [3]:
tokenized_data = QA_dataset.map(preprocess_function, batched=True)
    
model = AutoModelForSequenceClassification.from_pretrained('lexlms/roberta-base-uncased', num_labels=48)

metric1 = evaluate.load('f1')
metric2 = evaluate.load('accuracy')

training_args = TrainingArguments(output_dir = "test_trainer",
                                  evaluation_strategy = "epoch",
                                  save_strategy = "epoch",
                                  do_train=True,
                                  do_eval=True,
                                  learning_rate=1e-5,
                                  logging_steps=50,
                                  eval_steps=50,
                                  per_device_train_batch_size=8,
                                  per_device_eval_batch_size=8,
                                  num_train_epochs=25,
                                  weight_decay=0.001,)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    f1 = metric1.compute(predictions=predictions, references=labels, average='macro')
    accuracy = metric2.compute(predictions=predictions, references=labels)
    return {"accuracy": accuracy['accuracy'], "f1": f1['f1']}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    compute_metrics=compute_metrics
)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Downloading config.json:   0%|          | 0.00/692 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/475M [00:00<?, ?B/s]

Some weights of the model checkpoint at lexlms/roberta-base-uncased were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at lexlms/roberta-base-uncased and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias', 'classifier

Fine tunining the pretrained model begins here

In [4]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 955
  Num Epochs = 25
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3000


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Trainer is attempting to log a value of "{0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2', 3: 'LABEL_3', 4: 'LABEL_4', 5: 'LABEL_5', 6: 'LABEL_6', 7: 'LABEL_7', 8: 'LABEL_8', 9: 'LABEL_9', 10: 'LABEL_10', 11: 'LABEL_11', 12: 'LABEL_12', 13: 'LABEL_13', 14: 'LABEL_14', 15: 'LABEL_15', 16: 'LABEL_16', 17: 'LABEL_17', 18: 'LABEL_18', 19: 'LABEL_19', 20: 'LABEL_20', 21: 'LABEL_21', 22: 'LABEL_22', 23: 'LABEL_23', 24: 'LABEL_24', 25: 'LABEL_25', 26: 'LABEL_26', 27: 'LABEL_27', 28: 'LABEL_28', 29: 'LABEL_29', 30: 'LABEL_30', 31: 'LABEL_31', 32: 'LABEL_32', 33: 'LABEL_33', 34: 'LABEL_34', 35: 'LABEL_35', 36: 'LABEL_36', 37: 'LABEL_37', 38: 'LABEL_38', 39: 'LABEL_39', 40: 'LABEL_40', 41: 'LABEL_41', 42: 'LABEL_42', 43: 'LABEL_43', 44: 'LABEL_44', 45: 'LABEL_45', 46: 'LABEL_46', 47: 'LABEL_47'}" for key "id2label" as a parameter. MLflow's log_param() only accepts values no longer than 250 characters so we dropped this attribute. You can use `MLFLOW_FLATTEN_PARAMS` environment variable to flatten the p

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.9572,1.550662,0.598131,0.052844
2,1.2588,1.228285,0.682243,0.130226
3,1.0951,1.14937,0.682243,0.132505
4,0.9766,1.06705,0.738318,0.21075
5,0.8607,0.936458,0.794393,0.310231
6,0.7622,0.783563,0.813084,0.325286
7,0.5975,0.719934,0.850467,0.412667
8,0.5709,0.633422,0.878505,0.459615
9,0.4981,0.564662,0.878505,0.484984
10,0.4083,0.513802,0.869159,0.440947


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-120
Configuration saved in test_trainer/checkpoint-120/config.json
Model weights saved in test_trainer/checkpoint-120/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-240
Configuration saved in test_trainer/checkpoint-240/config.json
Model weights saved in test_t

Configuration saved in test_trainer/checkpoint-1920/config.json
Model weights saved in test_trainer/checkpoint-1920/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-2040
Configuration saved in test_trainer/checkpoint-2040/config.json
Model weights saved in test_trainer/checkpoint-2040/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8
Saving mod

TrainOutput(global_step=3000, training_loss=0.5237573135693868, metrics={'train_runtime': 1443.5754, 'train_samples_per_second': 16.539, 'train_steps_per_second': 2.078, 'total_flos': 6284370917376000.0, 'train_loss': 0.5237573135693868, 'epoch': 25.0})

Evaluation occurs during training but addition of this call to evaluate() allows us to print the best model's final metrics

In [5]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 107
  Batch size = 8


{'eval_loss': 0.29893869161605835,
 'eval_accuracy': 0.9345794392523364,
 'eval_f1': 0.6476190476190476,
 'eval_runtime': 1.9873,
 'eval_samples_per_second': 53.841,
 'eval_steps_per_second': 7.045,
 'epoch': 25.0}

Recreating the dictionary from the BERT notebook for use with the mapping function to convert class labels back into their natural language counterparts

In [6]:
conversion = {'No': 1, 'Yes': 2, "['0 [ft_i]']": 3, "['1 [du/acr_u]']": 4, "['10 [ft_i]']": 5, "['100 [ft_i]']": 6,
 "['10000 [sft_i]']": 7, "['12 [du/acr_u]']": 8, "['12 [u/acr_u]']": 9, "['125 [ft_i]']": 10, "['15 [ft_i]']": 11,
 "['150 [ft_i]']": 12, "['2 [du/acr_u]']": 13, "['20 [ft_i]']": 14, "['20000 [sft_i]']": 15, "['25 [ft_i]']": 16,
 "['30 [ft_i]']": 17, "['35 [ft_i]']": 18, "['35000 [sft_i]']": 19, "['4 [du/acr_u]']": 20, "['40 [ft_i]']": 21,
 "['5 [ft_i]']": 22, "['50 [ft_i]']": 23, "['6 [du/acr_u]']": 24, "['60 [ft_i]']": 25, "['6000 [sft_i]']": 26,
 "['70 [ft_i]']": 27, "['75 [ft_i]']": 28, "['8 [du/acr_u]']": 29, "['80 [ft_i]']": 30, "['90 [ft_i]']": 31,
 "['A1']": 32, "['A2']": 33, "['C1', 'C2', 'C3', 'C4', 'FI1', 'FI2', 'FI3']": 34, "['C1', 'C2', 'C3', 'C4']": 35,
 "['C2', 'C3', 'C4']": 36, "['C3', 'C4']": 37, "['C4']": 38, "['FI1', 'FI2', 'FI3']": 39, "['FI2', 'FI3']": 40,
 "['FI3']": 41, "['R1', 'R2', 'R3', 'C1', 'C2', 'C3', 'C4', 'FI1', 'FI2', 'FI3']": 42,
 "['R1', 'R2', 'R3', 'C1', 'C2', 'C3', 'C4']": 43, "['R1', 'R2', 'R3']": 44, "['R2', 'R3']": 45, "['R3']": 46,
 '[]': 47}

In [7]:
def get_key(d, value):
   return [k for k, v in d.items() if v == value]

### Checking some results:

Below is a sanity check to see what kind of results are returned after the model is trained.

We have provided one incorrect and one correct result but feel free to explore using the same format to find additional examples

In [8]:
prediction = trainer.predict(tokenized_data["test"])
results = (tokenized_data["test"][4]['text'], get_key(conversion, np.argmax(prediction[0][4], axis=-1)), 
          get_key(conversion, prediction[1][4]))
results

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 107
  Batch size = 8


('What is the minimum lot size in the R2a zoning district?',
 ["['6000 [sft_i]']"],
 ["['10000 [sft_i]']"])

In [9]:
prediction = trainer.predict(tokenized_data["test"])
results = (tokenized_data["test"][2]['text'], get_key(conversion, np.argmax(prediction[0][2], axis=-1)), 
          get_key(conversion, prediction[1][2]))
results

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 107
  Batch size = 8


('Are research or testing laboratories allowed in a FI2 zoning district?',
 ['Yes'],
 ['Yes'])