# Demo QA session

This notebook demonstrates a Question-answering session using a set of queries that will be answered using an article from a source directory.

In [1]:
# imports
REPO_ROOT = "..\\" # enter path too the rapidreview-copilot repository

import sys
sys.path.append(REPO_ROOT)
import os
from rrc.text_extraction import PDFExtractor
from rrc.run_session import RapidReviewSession



## Text Extraction

This section demonstrates an the use of the `PDFExtractor` class which extracts text from PDFs in a source directory and stores the output in a destination directory specified by the user.

In [None]:
# run PDF extractor here
file_dir = "tutorials\\articles"

final_file_dir = os.path.join(REPO_ROOT, file_dir)

In [None]:
final_file_dir

In [None]:
# Create an instance of PDFExtractor
pdf_extractor = PDFExtractor("D:\\Semester_3.2\\HSRC\\HSRC_RapidReview_SDK\\rapidreview_copilot\\tutorials\\articles")

# Call mass_extract function to extract text from PDFs
extracted_dir = pdf_extractor.mass_extract(extractor='pdfplumber', dest_dir=final_file_dir)

# Print the directory where the extracted JSON files are stored
print(f"Extracted JSON files are stored in: {extracted_dir}")

In [None]:
pdf_extractor.get_extracted()

## Question-answering

This section demonstrates a question answering using `RapidReviewSession`. *Add more description here on what RapidReviewSession does, specifically inputs needed for the class and the `run_query` method of the class, as well as the outputs expected.*

In [2]:
# init RapidReviewSession
sess = RapidReviewSession("./articles/", 
                          ("facebook/dpr-question_encoder-single-nq-base", "facebook/dpr-ctx_encoder-single-nq-base"),
                          "MBZUAI/LaMini-GPT-1.5B",
                          100,
                          200,
                          50) 

Retriever MAX SEQ LENGTH: 1000000000000000019884624838656
QA model MAX SEQ LENGTH (Input limit): 512


Question 1

In [3]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "Where was the study conducted?"
# Tsai et al. - 2019 - Data science for extubation prediction and value o
params = {
    "Retriever": {"filters":{"article_id": "a5049cdf-cea7-413d-82d8-446e4cbc8f1d"},"top_k" : 3},
}

In [4]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 99.82it/s]
  return self.fget.__get__(instance, owner)()
Writing Documents: 10000it [00:00, 16279.14it/s]          
Documents Processed: 10000 docs [00:41, 241.11 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
The prompt has been truncated from 415 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': '\nThe study was conducted at different information systems from which 21 hospitals were chosen for sampling purposes.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['302e4d402e39329e02abfb000cfffe01', '244a045f460b324a2c3d33f66184fec8', '1b490570b990733178d988c927ed054c'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: 362 ) withawaiverofinformedconsent. inourempiricalstudy, thedataiscollected from intellivue clinical information portfolio ( icip ) including patient data and several electronic health records from october 2015 to september 2016. the imbalanced panel datasets including severaltablesregardingbiochemistry, arterialbloodgas ( abg ), bloodcell, glasgowcomascale ( gcs ), apache, extubation, etc. are collected from different information systems, in our case hospita

Question 1

In [5]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "Where was the study conducted?"
# D'Andrea et al. - 2020 - A bronchial-airway gene-expression classifier to i
params = {
    "Retriever": {"filters":{"article_id": "1cdb31a4-64cc-4d23-b662-924d12bf93ca"},"top_k" : 3},
}

In [6]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 112.02it/s]
Writing Documents: 10000it [00:00, 10334.14it/s]          
Documents Processed: 10000 docs [00:41, 239.07 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': ' The study was conducted in various locations including Greece and Brazil.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['e6e751721460588f0d50aa869f754c15', 'a4f7f2c7a3cc4cf19744cfeb6d72abe7', '2fb91b54a96fbfc56d078c836218188d'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: . 72 – 8. 00 % ) egy, thetotalestimatedroppedto58. 9 ( 95 % ci, 46. 2 – 73. 2 % ). with the location - based strategy and 7. 18 % ( 95 % ci, the proportion of cancers that went undetected at the end 4. 13 – 11. 4 % ) with the simplified strategy. in the actual clinical of the first screening round was 4. 65 % ( 95 % ci, 3. 23 – 6. 62 % ) setting, thesepatientsarereferredtofollow - under the simplified for strategy. thus, overalldeathswithinthe2 - yearfollow - up, exclud - rules ing background mortalit

Question that is completely out of context

In [7]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "What is the prices for first bidding in January 2020?"
# Ashour and Kremer - 2013 - A simulation analysis of the impact of FAHP-MAUT t
params = {
    "Retriever": {"filters":{"article_id": "6786b903-b18e-45d4-b202-d957b6b70431"},"top_k" : 3},
}

In [8]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 102.13it/s]
Writing Documents: 10000it [00:00, 12896.57it/s]          
Documents Processed: 10000 docs [00:41, 241.76 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
The prompt has been truncated from 421 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': ' and second choice plans at the end of the simulated training?\nThe pricing data for first and second choice plans at the end of the simi sple is not provided in the given paragraphs so I am unable to answer that question.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['faf69c5267df3064af2139fec95e43da', 'dffbca05fd55caa1a094cbca3c4b2b9f', 'd1b78a640dca1078d000baa4a49c434d'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: < 1 ; if c j > d j process. u ¼ 0 ; otherwise ð3þ j > : 3. 4. input / output inotherwords, whenthetardinessisgreaterthan0, weassign1. otherwise, weassign0. inourcase, weareusingthesameanalogy. thesystemvariablesusedinthesimulationmodelincludein - the canadian triage and acuity scale ( ctas ) is used to setup ter - arrivaltimes, treatmenttimes, delaytimes (