# Demo QA session

This notebook demonstrates a Question-answering session using a set of queries that will be answered using an article from a source directory.

In [1]:
# imports
REPO_ROOT = "..\\" # enter path too the rapidreview-copilot repository

import sys
sys.path.append(REPO_ROOT)
import os
from rrc.text_extraction import PDFExtractor
from rrc.run_session import RapidReviewSession



## Question-answering

This section demonstrates a question answering using `RapidReviewSession`. *Add more description here on what RapidReviewSession does, specifically inputs needed for the class and the `run_query` method of the class, as well as the outputs expected.*

In [2]:
# init RapidReviewSession
sess = RapidReviewSession(src_dir="./articles/", 
                          ret_models=("facebook/dpr-question_encoder-single-nq-base", "facebook/dpr-ctx_encoder-single-nq-base"),
                          qa_model="MBZUAI/LaMini-GPT-1.5B",
                          max_ans_length=100,
                          min_context_size=200,
                          seq_length_buffer=50,
                          use_gpu=True
                          ) 

Retriever MAX SEQ LENGTH: 1000000000000000019884624838656
QA model MAX SEQ LENGTH (Input limit): 512


Question 1

In [5]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "Where was the study conducted?"
# Tsai et al. - 2019 - Data science for extubation prediction and value o
params = {
    "retriever": {"filters":{"article_id": "a5049cdf-cea7-413d-82d8-446e4cbc8f1d"}, "top_k" : 3}
}

In [6]:
sess.run_query(prompt, query, params)

100%|██████████| 14/14 [00:00<00:00, 101.76it/s]
Writing Documents: 10000it [00:00, 22364.44it/s]          
Documents Processed: 10000 docs [00:46, 216.63 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
The prompt has been truncated from 415 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'results': [' \nThe study was conducted at hospitals.'],
 'invocation_context': {'query': 'Where was the study conducted?',
  'documents': [<Document: {'content': '362 ) withawaiverofinformedconsent. inourempiricalstudy, thedataiscollected from intellivue clinical information portfolio ( icip ) including patient data and several electronic health records from october 2015 to september 2016. the imbalanced panel datasets including severaltablesregardingbiochemistry, arterialbloodgas ( abg ), bloodcell, glasgowcomascale ( gcs ), apache, extubation, etc. are collected from different information systems, in our case hospitals. thereare23var', 'content_type': 'text', 'score': 0.6365285945908407, 'meta': {'article_id': 'a5049cdf-cea7-413d-82d8-446e4cbc8f1d', 'extracted_text': "Journal of\nClinical Medicine\nArticle\nData Science for Extubation Prediction and Value of\nInformation in Surgical Intensive Care Unit\nTsung-LunTsai1,† ,Min-HsinHuang2,† ,Chia-YenLee1,* andWu-WeiLai2,3\n1 Institute

Question 1

In [5]:
# init RapidReviewSession
sess = RapidReviewSession("./articles/", 
                          ("facebook/dpr-question_encoder-single-nq-base", "facebook/dpr-ctx_encoder-single-nq-base"),
                          "MBZUAI/LaMini-GPT-1.5B",
                          100,
                          200,
                          70,
                          use_gpu = True) 

Retriever MAX SEQ LENGTH: 1000000000000000019884624838656
QA model MAX SEQ LENGTH (Input limit): 512


In [6]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "Where was the study conducted?"
# D'Andrea et al. - 2020 - A bronchial-airway gene-expression classifier to i
params = {
    "retriever": {"filters":{"article_id": "1cdb31a4-64cc-4d23-b662-924d12bf93ca"},"top_k" : 3},
}

In [7]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 139.12it/s]
Writing Documents: 10000it [00:00, 34835.89it/s]          
Documents Processed: 10000 docs [00:37, 268.24 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': ' The study was conducted in the U.S.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['d1900a2a4e9da9652c409b651d524dd0', '19c7627cc8d635d8825c511c7b1d8e75', '8e75a47b7a923d6b9662f3449de39ace'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: location - based strategy and 7. 18 % ( 95 % ci, the proportion of cancers that went undetected at the end 4. 13 – 11. 4 % ) with the simplified strategy. in the actual clinical of the first screening round was 4. 65 % ( 95 % ci, 3. 23 – 6. 62 % ) setting, thesepatientsarereferredtofollow - upandwillundergo the second ldct. however, in our model, we assumed that theyprogressed tostage iito penalize b ##follow - up, exclud - rules ing background mortality, were 3. 25 % ( 95 % ci, 1. 90 – 4. 94 % ) of following the current guidelines, 3. 27

Question that is completely out of context

In [9]:
# init RapidReviewSession
sess = RapidReviewSession("./articles/", 
                          ("facebook/dpr-question_encoder-single-nq-base", "facebook/dpr-ctx_encoder-single-nq-base"),
                          "MBZUAI/LaMini-GPT-1.5B",
                          100,
                          200,
                          70,
                          use_gpu = False) 

Retriever MAX SEQ LENGTH: 1000000000000000019884624838656
QA model MAX SEQ LENGTH (Input limit): 512


In [10]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "What is the prices for first bidding in January 2020?"

params = {
    "retriever": {"filters":{"article_id": "6786b903-b18e-45d4-b202-d957b6b70431"},"top_k" : 3},
}

In [11]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 95.39it/s]
Writing Documents: 10000it [00:00, 25937.31it/s]          
Documents Processed: 10000 docs [00:49, 200.80 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': ' The prices for first bidding in January 2020 are not provided in the given paragraph.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['9b372a9b95f086ed56e41b832bd4f7fe', 'e6221fd636514db33a19b914eee73408', '9ceb658ec6c6c57abafc7dda480c17d4'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: sesilevel. usedinschedulingliteraturewhichiscalledunitpenaltyofjobj. ( 2 ) the greeter desk has an exponentially distributed service unitpenaltycanbedefinedas : time. 8 ( 3 ) the patients ’ arrival process is a non - stationary poisson > < 1 ; if c j > d j process. u ¼ 0 ; otherwise ð3þ j > : 3. 4. input / output inotherwords, whenthetardinessisgreaterthan0, we timetobed _ esi5 1. 99 111. 44 128. 57 144. 04 137. 03 2. 40 114. 04 128. 50 133. 54 135. 56 2. 35 114. 92 125. 37 132. 23 141. 49