# Demo QA session

This notebook demonstrates a Question-answering session using a set of queries that will be answered using an article from a source directory.

In [1]:
# imports
REPO_ROOT = "..\\" # enter path too the rapidreview-copilot repository

import sys
sys.path.append(REPO_ROOT)
import os
from rrc.text_extraction import PDFExtractor
from rrc.run_session import RapidReviewSession



## Text Extraction

This section demonstrates an the use of the `PDFExtractor` class which extracts text from PDFs in a source directory and stores the output in a destination directory specified by the user.

In [None]:
# run PDF extractor here
file_dir = "tutorials\\articles"

final_file_dir = os.path.join(REPO_ROOT, file_dir)

In [None]:
final_file_dir

In [None]:
# Create an instance of PDFExtractor
pdf_extractor = PDFExtractor("D:\\Semester_3.2\\HSRC\\HSRC_RapidReview_SDK\\rapidreview_copilot\\tutorials\\articles")

# Call mass_extract function to extract text from PDFs
extracted_dir = pdf_extractor.mass_extract(extractor='pdfplumber', dest_dir=final_file_dir)

# Print the directory where the extracted JSON files are stored
print(f"Extracted JSON files are stored in: {extracted_dir}")

In [None]:
pdf_extractor.get_extracted()

## Question-answering

This section demonstrates a question answering using `RapidReviewSession`. *Add more description here on what RapidReviewSession does, specifically inputs needed for the class and the `run_query` method of the class, as well as the outputs expected.*

In [2]:
# init RapidReviewSession
sess = RapidReviewSession("./articles/", 
                          ("facebook/dpr-question_encoder-single-nq-base", "facebook/dpr-ctx_encoder-single-nq-base"),
                          2,
                          "MBZUAI/LaMini-GPT-1.5B",
                          100,
                          200,
                          50) # update

Retriever MAX SEQ LENGTH: 1000000000000000019884624838656
QA model MAX SEQ LENGTH (Input limit): 512


Question 1

In [3]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "Where was the study conducted?"
params = {
    "Retriever": {"filters":{"article_id": "c6cf4ec8-dbd8-4538-a084-fbfbd0671286"},"top_k" : 3},
}

In [4]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 1307.72it/s]
  return self.fget.__get__(instance, owner)()
Writing Documents: 10000it [00:00, 25950.40it/s]          
Documents Processed: 10000 docs [00:34, 289.06 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': ' The study was conducted in India.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['e6e751721460588f0d50aa869f754c15', 'a4f7f2c7a3cc4cf19744cfeb6d72abe7', '2fb91b54a96fbfc56d078c836218188d'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: . 72 – 8. 00 % ) egy, thetotalestimatedroppedto58. 9 ( 95 % ci, 46. 2 – 73. 2 % ). with the location - based strategy and 7. 18 % ( 95 % ci, the proportion of cancers that went undetected at the end 4. 13 – 11. 4 % ) with the simplified strategy. in the actual clinical of the first screening round was 4. 65 % ( 95 % ci, 3. 23 – 6. 62 % ) setting, thesepatientsarereferredtofollow - under the simplified for strategy. thus, overalldeathswithinthe2 - yearfollow - up, exclud - rules ing background mortality, were 3. 25 % ( 95 % ci, 1. 90 – 4. 94

Question 1

In [5]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "Where was the study conducted?"
# Third article
params = {
    "Retriever": {"filters":{"article_id": "f938a720-b8f7-4032-aa04-018e4aa52542"},"top_k" : 3},
}

In [6]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 806.72it/s]
Writing Documents: 10000it [00:00, 20160.00it/s]          
Documents Processed: 10000 docs [00:39, 253.87 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
The prompt has been truncated from 415 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': '\nThe study was conducted at two different hospitals, one in the UK and the other in the US. The countries in which the hospitals are located were not specified.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['302e4d402e39329e02abfb000cfffe01', '244a045f460b324a2c3d33f66184fec8', '1b490570b990733178d988c927ed054c'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: 362 ) withawaiverofinformedconsent. inourempiricalstudy, thedataiscollected from intellivue clinical information portfolio ( icip ) including patient data and several electronic health records from october 2015 to september 2016. the imbalanced panel datasets including severaltablesregardingbiochemistry, arterialbloodgas ( abg ), bloodcell, glasgowcomascale ( gcs ), apache, extubation, etc. are collected from differ

Question 2

In [7]:
# update
prompt = """Generate an answer with the information provided in the paragraphs. Do not repeat text
                             \n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:"""
query = "What are the academic disciplines of the authors?"
# Third article
params = {
    "Retriever": {"filters":{"article_id": "44b18615-ab6a-49fa-b086-8b28c0928331"},"top_k" : 3},
}

In [8]:
sess.run_query(prompt, query, params)

100%|██████████| 17/17 [00:00<00:00, 659.43it/s]
Writing Documents: 10000it [00:00, 17280.75it/s]          
Documents Processed: 10000 docs [00:40, 245.14 docs/s]         
Query filters are not implemented for the FAISSDocumentStore.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'W': <Answer {'answer': ' The academic disciplines of the authors are: \n    - Engineering: Unk, Adepartment of Industry and Manufacturing Engineering \n- Chemistry/Physics: [UNK] adepartmentofindustrialandmanufacturingengineering, thepennsylvaniastateundetails,. \n- Medicine (Thesis): williams, m. 2006. \n- Environmental Science and Engineering: [UNK] amobileemergencytriagedecisionsupportsystemeval', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['afed20ec41e1087e92aa5e23af8fd866', '5312990723d0caaccefe2933319af79f', 'fb3e729b9d4e0883007b66740637116e'], 'meta': {'prompt': 'Generate an answer with the information provided in the paragraphs. Do not repeat text\n                             \n\n Paragraphs: [CLS] expertsystemswithapplications40 ( 2013 ) 177 – 187 contentslistsavailableatsciversesciencedirect expert systems with applications journal homepage : www. elsevier. com / locate / eswa a simulation 