# ROBERTA SQuAD QA System Using Zoning Corpus no Fine Tuning

This notebook is the code necessary to experiment with a pretrained ROBERTA QA system trained on SQuAD data but without the benefit of fine tuning on our annotated training data.

Results were expected to be poor and testing affirmed this expectation.

To run this notebook simply run each cell in order.

In [14]:
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'

import logging
import ipywidgets as widgets
from pprint import pprint

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

from haystack.nodes import FARMReader
from haystack.nodes import TfidfRetriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.utils import clean_wiki_text, convert_files_to_docs

## Creation of in memory document store:

Using Deepset's Haystack library we are able to create an in memory store of the documents from which answers should be retrieved. Haystack provides web based stores but local was necessary at this stage of development.

In [30]:
document_store = InMemoryDocumentStore()

doc_dir = f"{os.getcwd()}/data/text"

docs = convert_files_to_docs(dir_path=doc_dir)
document_store.write_documents(docs)

INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_2.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_3.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_11.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_12.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_5.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_7.txt
INFO - haystack.utils.preprocessing -  Converting /data/user/home/jesusaur/cs662-qa-land-dev-law-sys/programs/data/text/text_1.txt
INFO 

creation of the document retriever object which tries to identify the correct documents to be used by the "reader" object in the QA system's answering phase

In [31]:
retriever = TfidfRetriever(document_store=document_store)

INFO - haystack.nodes.retriever.sparse -  Found 13 candidate paragraphs from 13 docs in DB


creation of reader object which parses the question, coverts it to embedding, and runs it through the pretrained transformer model

In [32]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.model.language_model -   * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO - haystack.modeling.model.language_model -  Auto-detected model language: english
INFO - haystack.modeling.model.language_model -  Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1


 a pipeline is required so that the questions are passed through the retriever and reader in the correct sequence

In [33]:
pipe = ExtractiveQAPipeline(reader, retriever)

we have added a text entry field here for testing the system without fine tuning. please enter a zoning related question into the text field after running this cell followed by running the cell after to see the QA system output

In [25]:
txtsl = widgets.Text( # 'Which zoning districts allow group care facilities?'
 placeholder='Enter your question.',
 description='Question:'
 )
display(txtsl)

Text(value='', description='Question:', placeholder='Enter your question.')

In [38]:
prediction = pipe.run(
    query=txtsl.value, params={"Retriever": {"top_k": 20}, "Reader": {"top_k": 1}}
)

Inferencing Samples: 100%|██████████| 526/526 [03:28<00:00,  2.52 Batches/s]
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-19175, -19128) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-3134, -3109) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-3393, -3322) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-24641, -24600) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-5241, -5208) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-15063, -14986) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-9213, -9191) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-8147, -8124) with a span answer. 
ERROR - haystack.modeling.model.predictions -  Invalid end offset: 
(-28818, -28636) 

In [39]:
pprint(prediction['answers'])

[<Answer {'answer': 'to implement further the following pol', 'type': 'extractive', 'score': 0.909953773021698, 'context': 'the rural landscape. The intent of the district also is to implement further the following policies of the Comprehensive Plan: To further identify sce', 'offsets_in_document': [{'start': 14827, 'end': 14865}], 'offsets_in_context': [{'start': 56, 'end': 94}], 'document_id': 'ba462978e0dc0fba8a7a151b82d64e78', 'meta': {'name': 'text_8.txt'}}>]


### Answers are almost always incorrect

Without fine tuning we have observed that the answers are almost always incorrect and often incoherent, starting mid word and in sentences that do not have any relation to the asked question.