# ROBERTA SQuAD QA System Using Zoning Corpus with Fine Tuning

This notebook is the code necessary to experiment with a pretrained ROBERTA QA system trained on SQuAD data with the benefit of fine tuning on our annotated training data.

Results were expected to be mediocre but testing indicated that the system shows promise.

It should be noted that annotated data was made from a custom zoning corpus preprocessed by our custom corpus builder and that the amount of annotated data was limited to a size that did not allow for splitting into train and test sets. As we continue annotation of the corpus this problem will resolve itself and despite this shortcoming the system did often provide the expected answers.

To run this notebook simply run each cell in order.

In [1]:
import os
import sys
import json
import pandas as pd
from pathlib import Path
from pprint import pprint
import ipywidgets as widgets

os.environ['KMP_DUPLICATE_LIB_OK']='True' # caused by some haystack duplication of processes, currently only a workaround

sys.path.append('..') # for cheaha '..' is all that is needed here

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

from nlp.model import create_reader

from haystack.nodes import FARMReader
from haystack.nodes import TfidfRetriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.utils import clean_wiki_text, convert_files_to_docs

Printing the filepath to confirm that relative filepaths do not break on non local machines

In [2]:
print(f'{os.getcwd()}')

/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs


setting the path to the in library data store where the annotated SQuAD 2.0 dataset on zoning ordinances resides

In [3]:
data_name = 'zo_squad'

model_dir = f'../readers/{data_name}'
Path(model_dir).mkdir(parents=True, exist_ok=True)

### Custom Reader Class:

For fine tuning it was necessary to create a custom reader class that automates much of the processes needed to fine tune the pretrained Haystack reader.

Options for training not seen here are sequence length and increasing the dev split to provide an evaluation split in future implementations. However, on the limited amount of data available for training the defaults and using only 3 epochs proved sufficient for a proof of concept.

If you would like to view the custom readers, tokenizers, and corpus builders please look in the /nlp/ folder of this library.

Addtionally it was discovered that for training to occur on a single GPU the corpus data had to be split into 300 or less sentence chunks. 

**Do not run this section if you have previously trained the reader, for expediency we have saved the model at this step and it can be loaded into the reader object in the following cells.**

In [4]:
reader = create_reader(model_dir, data_name, dev_split=0.0, gpu=True, epochs=3)

INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1


/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/../nlp/model/data/question_answering/zo_squad/zo_squad.json
/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/../nlp/model/data/question_answering/zo_squad/.ipynb_checkpoints
/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/../nlp/model/data/question_answering/zo_squad/__init__.py
/data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/../nlp/model/data/question_answering/zo_squad/__pycache__


INFO - haystack.modeling.model.language_model -   * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO - haystack.modeling.model.language_model -  Auto-detected model language: english
INFO - haystack.modeling.model.language_model -  Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.data_handler.data_silo -  
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
INFO - haystack.modeling.data_handler.data_silo -  LOADING TRAIN DATA
INFO - haystack.modeling.data_handler.data_silo -  Loading train set from: /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/../nlp/model/data/question_answering/zo_squad/zo_squad.json 
Preprocessing dataset: 100%|██████████

## Creation of in memory document store:

Using Deepset's Haystack library we are able to create an in memory store of the documents from which answers should be retrieved. Haystack provides web based stores but local was necessary at this stage of development.

In [5]:
document_store = InMemoryDocumentStore()

doc_dir = f"{os.getcwd()}/data/text"

docs = convert_files_to_docs(dir_path=doc_dir)
document_store.write_documents(docs)

INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_2.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_3.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_14.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_11.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_12.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_18.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_13.txt
IN

creation of the document retriever object which tries to identify the correct documents to be used by the "reader" object in the QA system's answering phase

In [6]:
retriever = TfidfRetriever(document_store=document_store)

INFO - haystack.nodes.retriever.sparse -  Found 78 candidate paragraphs from 19 docs in DB


### Load the newly fine tuned model:

In [7]:
reader = FARMReader(model_name_or_path=model_dir, use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.model.language_model -   * LOADING MODEL: '../readers/zo_squad' (Roberta)
INFO - haystack.modeling.model.language_model -  Loaded '../readers/zo_squad' (Roberta model) from local file system.
INFO - haystack.modeling.model.adaptive_model -  Found files for loading 1 prediction heads
INFO - haystack.modeling.model.prediction_head -  Loading prediction head from ../readers/zo_squad/prediction_head_0.bin
INFO - haystack.modeling.data_handler.processor -  Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1


In [8]:
pipe = ExtractiveQAPipeline(reader, retriever)

we have added a text entry field here for testing the system without fine tuning. please enter a zoning related question into the text field after running this cell followed by running the cell after to see the QA system output

In [9]:
txtsl = widgets.Text( # 'Which zoning districts allow group care facilities?'
 placeholder='Enter your question.',
 description='Question:'
 )
display(txtsl)

Text(value='', description='Question:', placeholder='Enter your question.')

In [17]:
print(f'Question: {txtsl.value}')
prediction = pipe.run(
    query=txtsl.value, params={"Retriever": {"top_k": 20}, "Reader": {"top_k": 2}}
)

Question: Which zones permit indoor theaters?


Inferencing Samples: 100%|██████████| 7/7 [00:02<00:00,  2.95 Batches/s]


The following cells are the output of the model based on the question entered into the text field above. Please observe that confidence scores are provided with each answer and that we have decided to provide the top two answers since it was sometimes the case that the second answer was correct in the even that the first was not.

In [18]:
pprint(prediction['answers'])

[<Answer {'answer': 'C3, which includes amusement centers including bowling alleys, golf driving ranges, miniature golf courses, ice skating rinks, pool and billiard halls; and C4, which includes major automotive repair, manufacturing and commercial centers', 'type': 'extractive', 'score': 0.8879691958427429, 'context': 'C3, which includes amusement centers including bowling alleys, golf driving ranges, miniature golf courses, ice skating rinks, pool and billiard halls; and C4, which includes major automotive repair, manufacturing and commercial centers', 'offsets_in_document': [{'start': 4763, 'end': 4999}], 'offsets_in_context': [{'start': 0, 'end': 236}], 'document_id': '51d9574bffbcd3dd446782323416b218', 'meta': {'name': 'text_1.txt'}}>,
 <Answer {'answer': 'RESIDENTIAL ZONES', 'type': 'extractive', 'score': 0.8432620763778687, 'context': 'NTERNATIONAL ZONING CODE”  CHAPTER 5 RESIDENTIAL ZONES SECTION 501 RESIDENTIAL ZONES DEFINED 501.1 Residential zone. Allowable residential (R) z

In [14]:
from haystack.utils import print_answers

print_answers(prediction, details="all")


Query: Which zones allow quarries?
Answers:
[   <Answer {'answer': 'Division 3. Any use permitted in the FI, Division 2 zone and auto-dismantling yards, alcohol manufacturing, cotton gins, paper manufacturing, quarries, salt works, petroleum refining, and other similar uses', 'type': 'extractive', 'score': 0.9963598847389221, 'context': 'Division 3. Any use permitted in the FI, Division 2 zone and auto-dismantling yards, alcohol manufacturing, cotton gins, paper manufacturing, quarries, salt works, petroleum refining, and other similar uses', 'offsets_in_document': [{'start': 6278, 'end': 6484}], 'offsets_in_context': [{'start': 0, 'end': 206}], 'document_id': '43f2a4f50fd4bc8b4b8b8aae557ee365', 'meta': {'name': 'text_8.txt'}}>,
    <Answer {'answer': 'C1, which includes minor automotive repair and automotive fuel dispensing facilities; C2, which includes light commercial and group care facilities; C3, which includes amusement centers including bowling alleys, golf driving ranges, min

### Promising results:

Despite the small amount of training data the system appears to be able to answer zoning questions correctly even when worded outside of the explicit training questions. Though squad is limited in its ability to answer yes or no questions, it will provide context related to the keywords of the question sometimes to the equivalent effect of a affirmative response. The system has exhibited on occasion to answer questions on the provided corpus that were not at all similar to the training data.

Because of these facts we believe that there is promise to this method and are considering how we might ensemble it with the KG versions of our QA system, and that thes system can only improve using larger pretrained SQuAD models and a more complete annotation set for fine tuning.