# PubMed QA using LlamaIndex

## Introduction
This notebook presents a RAG workflow for the [PubMed QA](https://pubmedqa.github.io/) task using [LlamaIndex](https://www.llamaindex.ai/). The code is written in a configurable fashion, giving you the flexibility to edit the RAG configuration and observe the change in output/responses.

It covers a step-by-step procedure for building the RAG workflow (Stages 1-4) and later runs the pipeline on a sample from the dataset. The notebook also covers the sparse, dense, hybrid retrieval strategies along with the re-ranker. We have alse added an optional component for RAG evaluation using the [Ragas](https://docs.ragas.io/en/stable/) library.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:

      Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```

## STAGE 0 - Set up the RAG workflow environment

#### Import libraries, custom classes and functions

In [1]:
%pip install --quiet \
  llama-index \
  google-cloud-secret-manager \
  datasets \
  llama-index-readers-json \
  llama-index-readers-file \
  llama-index-readers-gcs \
  llama-index-embeddings-vertex \
  llama-index-embeddings-google-genai \
  llama-index-embeddings-huggingface \
  llama-index-embeddings-text-embeddings-inference \
  llama-index-embeddings-vertex-endpoint \
  llama-index-llms-huggingface \
  llama-index-llms-openai-like \
  llama-index-llms-vertex \
  llama-index-llms-google-genai \
  faiss-cpu \
  llama-index-vector-stores-faiss \
  llama-index-vector-stores-weaviate \
  llama-index-vector-stores-vertexaivectorsearch \
  llama-index-retrievers-bm25 \
  rapidfuzz \
  ragas \
  pydantic>=2.10.4 \
  google-cloud-aiplatform>=1.76 \
  langchain-core \
  langchain-cohere \
  langchain-huggingface \
  langchain-google-vertexai
# jkwng: restart the kernel after this

In [88]:
import warnings
warnings.filterwarnings('ignore')

In [89]:
import sys
import os
import random

from pathlib import Path
from pprint import pprint

from llama_index.core import ServiceContext, Settings, set_global_handler
from llama_index.core.node_parser import SentenceSplitter


# jkwng: in order to make this notebook self contained, i just cut and paste these into the notebook
# from task_dataset import PubMedQATaskDataset

# from utils.hosting_utils import RAGLLM
# from utils.rag_utils import (
#     DocumentReader, RAGEmbedding, RAGQueryEngine, RagasEval,
#     extract_yes_no, validate_rag_cfg
#     )
# from utils.storage_utils import RAGIndex

#### Load config files

*jkwng: however we need these variables specifically for deployment on Google Cloud*

In [90]:
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
REGION = os.environ.get("GOOGLE_CLOUD_REGION")
GCS_URI = "jkwng-vertex-experiments/rag_bootcamp/pubmed_qa"

#### Set RAG configuration

Below: using `bge-base-en-v1.5` for embeddings and Llama 3.1 8B instruct for generation, both hosted on Vertex Endpoints. Use Gemini 2.0 Flash for LLM-based evals

In [138]:
rag_cfg = {
    # Node parser config
    "chunk_size": 256,
    "chunk_overlap": 0,

    # Embedding model config
    # "embed_model_type": "hf",
    # "embed_model_name": "BAAI/bge-base-en-v1.5",
    "embed_model_type": "vertex-endpoint",
    "embed_model_name": "BAAI/bge-base-en-v1.5",
    "embed_model_endpoint_id": "83814671873736704", # endpoint id
    "embed_model_use_dedicated_endpoint": True,
    "embed_model_dedicated_dns": "83814671873736704.us-central1-205512073711.prediction.vertexai.goog",

    # LLM config
    # "llm_type": "kscope",
    # "llm_name": "Meta-Llama-3.1-8B-Instruct",
    "llm_type": "vertex-endpoint",
    "llm_name": "meta-llama/Llama-3.1-8B-Instruct",
    "llm_endpoint_id": "133354267774812160",
    "llm_use_dedicated_endpoint": True,
    "llm_dedicated_dns": "133354267774812160.us-central1-205512073711.prediction.vertexai.goog",
    "max_new_tokens": 256,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": 50,
    "do_sample": False,

    # Vector DB config
    "vector_db_type": "weaviate", # "weaviate"
    #"vector_db_type": "vertex",
    "vector_db_name": "Pubmed_QA",
    # MODIFY THIS
    "weaviate_url": "https://ds4tx7ttr3ciaui5obmowg.c0.us-east1.gcp.weaviate.cloud",

    # Retriever and query config
    "retriever_type": "vector_index", # "vector_index"
    "retriever_similarity_top_k": 5,
    "query_mode": "default", # "default", "hybrid" - jkwng: changed to default
    "hybrid_search_alpha": 0.0, # float from 0.0 (sparse search - bm25) to 1.0 (vector search)
    "response_mode": "compact",
    "use_reranker": False,
    "rerank_top_k": 3,

    # Evaluation config
    # "eval_llm_type": "kscope",
    # "eval_llm_name": "Meta-Llama-3.1-8B-Instruct",
    "eval_llm_type": "vertex",
    "eval_llm_name": "gemini-2.0-flash-001"
}

#### Read Weaviate Key

*jkwng: load weaviate API key from secret manager*

In [139]:
from google.cloud import secretmanager

client = secretmanager.SecretManagerServiceClient()

# Access the secret
name = f"projects/{PROJECT_ID}/secrets/weaviate_key/versions/latest"
response = client.access_secret_version(request={"name": name})

# Extract and print the secret value
weaviate_key = response.payload.data.decode("UTF-8")
WEAVIATE_API_KEY = weaviate_key

# try:
#     f = open(Path.home() / ".weaviate.key", "r")
#     f.close()
# except Exception as err:
#     print(f"Could not read your Weaviate key. Please make sure this is available in plain text under your home directory in ~/.weaviate.key: {err}")

#### Preliminary config checks

In [140]:
#@title *jkwng: validate_rag_cfg from utils.rag_utils.py*
def validate_rag_cfg(cfg):
    if cfg["query_mode"] == "hybrid":
        assert (
            cfg["hybrid_search_alpha"] is not None
        ), "hybrid_search_alpha cannot be None if query_mode is set to 'hybrid'"
    if cfg["vector_db_type"] == "weaviate":
        assert (
            cfg["weaviate_url"] is not None
        ), "weaviate_url cannot be None for weaviate vector db"

In [141]:
validate_rag_cfg(rag_cfg)
pprint(rag_cfg)

{'chunk_overlap': 0,
 'chunk_size': 256,
 'do_sample': False,
 'embed_model_dedicated_dns': '83814671873736704.us-central1-205512073711.prediction.vertexai.goog',
 'embed_model_endpoint_id': '83814671873736704',
 'embed_model_name': 'BAAI/bge-base-en-v1.5',
 'embed_model_type': 'vertex-endpoint',
 'embed_model_use_dedicated_endpoint': True,
 'eval_llm_name': 'gemini-2.0-flash-001',
 'eval_llm_type': 'vertex',
 'hybrid_search_alpha': 0.0,
 'llm_dedicated_dns': '133354267774812160.us-central1-205512073711.prediction.vertexai.goog',
 'llm_endpoint_id': '133354267774812160',
 'llm_name': 'meta-llama/Llama-3.1-8B-Instruct',
 'llm_type': 'vertex-endpoint',
 'llm_use_dedicated_endpoint': True,
 'max_new_tokens': 256,
 'query_mode': 'default',
 'rerank_top_k': 3,
 'response_mode': 'compact',
 'retriever_similarity_top_k': 5,
 'retriever_type': 'vector_index',
 'temperature': 0.0,
 'top_k': 50,
 'top_p': 1.0,
 'use_reranker': False,
 'vector_db_name': 'Pubmed_QA',
 'vector_db_type': 'weaviate',

## STAGE 1 - Load dataset and documents

#### 1. Load PubMed QA dataset
PubMedQA ([github](https://github.com/pubmedqa/pubmedqa)) is a biomedical question answering dataset. Each instance consists of a question, a context (extracted from PubMed abstracts), a long answer and a yes/no/maybe answer. We make use of the test split of [this](https://huggingface.co/datasets/bigbio/pubmed_qa) huggingface dataset for this notebook.

**The context for each instance is stored as a text file** (referred to as documents), to align the task as a standard RAG use-case.

In [142]:
#@title *jkwng: task_dataset.py*
import os
import json
import torch.utils.data as data
from tqdm import tqdm
from datasets import load_dataset, concatenate_datasets

class PubMedQATaskDataset(data.Dataset):
    def __init__(self, name, all_folds=False, split="test"):
        self.name = name
        subset_str = "pubmed_qa_labeled_fold{fold_id}"
        folds = [0] if not all_folds else list(range(10))

        bigbio_data = []
        source_data = []
        for fold_id in folds:
            bb_data = load_dataset(
                self.name,
                f"{subset_str.format(fold_id=fold_id)}_bigbio_qa",
                split=split,
                trust_remote_code=True,
            )
            s_data = load_dataset(
                self.name,
                f"{subset_str.format(fold_id=fold_id)}_source",
                split=split,
                trust_remote_code=True,
            )
            bigbio_data.append(bb_data)
            source_data.append(s_data)
        bigbio_data = concatenate_datasets(bigbio_data)
        source_data = concatenate_datasets(source_data)

        keys_to_keep = ["id", "question", "context", "answer", "LONG_ANSWER"]
        data_elms = []
        for elm_idx in tqdm(range(len(bigbio_data)), desc="Preparing data"):
            data_elms.append({k: bigbio_data[elm_idx][k] for k in keys_to_keep[:4]})
            data_elms[-1].update(
                {keys_to_keep[-1].lower(): source_data[elm_idx][keys_to_keep[-1]]}
            )

        self.data = data_elms

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)

    def mock_knowledge_base(
        self,
        output_dir,
        one_file_per_sample=False,
        samples_per_file=500,
        sep="\n",
        jsonl=False,
    ):
        """
        Write PubMed contexts to a text file, newline seperated
        """
        pubmed_kb_dir = os.path.join(output_dir, "pubmed_doc")
        os.makedirs(pubmed_kb_dir, exist_ok=True)

        file_ext = "jsonl" if jsonl else "txt"

        if not one_file_per_sample:
            context_str = ""
            context_files = []
            for idx in range(len(self.data)):
                if (idx + 1) % samples_per_file == 0:
                    context_files.append(context_str.rstrip(sep))
                else:
                    if jsonl:
                        context_elm_str = json.dumps(
                            {
                                "id": self.data[idx]["id"],
                                "context": self.data[idx]["context"],
                            }
                        )
                    else:
                        context_elm_str = self.data[idx]["context"]
                    context_str += f"{context_elm_str}{sep}"

            for file_idx in range(len(context_files)):
                filepath = os.path.join(pubmed_kb_dir, f"context{file_idx}.{file_ext}")
                with open(filepath, "w") as f:
                    f.write(context_files[file_idx])

        else:
            assert not jsonl, "Does not support jsonl if one_file_per_sample is True"
            for idx in range(len(self.data)):
                filepath = os.path.join(
                    pubmed_kb_dir, f'{self.data[idx]["id"]}.{file_ext}'
                )
                with open(filepath, "w") as f:
                    f.write(self.data[idx]["context"])

In [143]:
print('Loading PubMed QA data ...')
pubmed_data = PubMedQATaskDataset('bigbio/pubmed_qa', all_folds=True, split='train')
print(f"Loaded data size: {len(pubmed_data)}")
pubmed_data.mock_knowledge_base(output_dir='./data', one_file_per_sample=True)

Loading PubMed QA data ...


Preparing data: 100%|██████████| 4500/4500 [00:03<00:00, 1414.56it/s]


Loaded data size: 4500


*jkwng: TODO: write knowledge base to GCS - to simulate loading knowledge base from object storage*

#### 2. Load documents
All metadata is excluded by default. Set the *exclude_llm_metadata_keys* and *exclude_embed_metadata_keys* flags to *false* for including it. Please refer to [this](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html) and the *DocumentReader* class from *rag_utils.py* for further details.

In [144]:
#@title *jkwng: DocumentReader from utils.rag_utils.py*

from llama_index.core import (
    SimpleDirectoryReader
)

from llama_index.readers.json import JSONReader

class DocumentReader:
    def __init__(
        self,
        input_dir,
        exclude_llm_metadata_keys=True,
        exclude_embed_metadata_keys=True,
    ):
        self.input_dir = input_dir
        self._file_ext = os.path.splitext(os.listdir(input_dir)[0])[1]

        self.exclude_llm_metadata_keys = exclude_llm_metadata_keys
        self.exclude_embed_metadata_keys = exclude_embed_metadata_keys

    def load_data(self):
        docs = None
        # Use reader based on file extension of documents
        # Only support '.txt' files as of now
        if self._file_ext == ".txt":
            reader = SimpleDirectoryReader(input_dir=self.input_dir)
            docs = reader.load_data()
        elif self._file_ext == ".jsonl":
            reader = JSONReader()
            docs = []
            for file in os.listdir(self.input_dir):
                docs.extend(
                    reader.load_data(os.path.join(self.input_dir, file), is_jsonl=True)
                )
        else:
            raise NotImplementedError(
                f"Does not support {self._file_ext} file extension for document files."
            )

        # Can choose if metadata need to be included as input when passing the doc to LLM or embeddings:
        # https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html
        # Exclude metadata keys from embeddings or LLMs based on flag
        if docs is not None:
            all_metadata_keys = list(docs[0].metadata.keys())
            if self.exclude_llm_metadata_keys:
                for doc in docs:
                    doc.excluded_llm_metadata_keys = all_metadata_keys
            if self.exclude_embed_metadata_keys:
                for doc in docs:
                    doc.excluded_embed_metadata_keys = all_metadata_keys

        return docs

In [145]:
print('Loading documents ...')
reader = DocumentReader(input_dir="./data/pubmed_doc")
docs = reader.load_data()
print(f'No. of documents loaded: {len(docs)}')

Loading documents ...
No. of documents loaded: 1000


*jkwng: TODO: load the knowledge base from GCS*

## STAGE 2 - Load node parser, embedding, LLM and set service context

#### 1. Load node parser to split documents into smaller chunks

In [146]:
print('Loading node parser ...')
node_parser = SentenceSplitter(chunk_size=rag_cfg['chunk_size'], chunk_overlap=rag_cfg['chunk_overlap'])
nodes = node_parser.get_nodes_from_documents(docs)

Loading node parser ...


#### 2. Load embedding model
LlamaIndex supports embedding models from OpenAI, Cohere, HuggingFace, etc. Please refer to [this](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#custom-embedding-model) for building a custom embedding model.

In [147]:
#@title *jkwng: RAGEmbedding from utils.rag_utils.py - update to support using Vertex AI Gemini Embeddings models*
from llama_index.embeddings.vertex_endpoint import VertexEndpointEmbedding
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.text_embeddings_inference import TextEmbeddingsInference
import google.auth

credentials, project_id = google.auth.default()
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)

class RAGEmbedding:
    """
    LlamaIndex supports embedding models from OpenAI, Cohere, HuggingFace, etc.
    https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html
    We can also build out custom embedding model:
    https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#custom-embedding-model
    """

    def __init__(self, model_type, model_name):
        self.model_type = model_type
        self.model_name = model_name

    def load_model(self, **kwargs):
        print(f"Loading {self.model_type} embedding model ...")
        if self.model_type == "hf":
            # Using bge base HuggingFace embeddings, can choose others based on leaderboard:
            # https://huggingface.co/spaces/mteb/leaderboard
            model = HuggingFaceEmbedding(
                model_name=self.model_name,
                device="cuda",
                trust_remote_code=True,
            )  # max_length does not have any effect?
        elif self.model_type == "vertex":
            model = GoogleGenAIEmbedding(
                model_name=self.model_name,
                vertexai_config={
                  "project": PROJECT_ID,
                  "location": REGION,
                },
                embed_batch_size=100,
            )
        elif self.model_type == "vertex-endpoint":
            model = VertexEndpointEmbedding(
                endpoint_id=kwargs["embed_model_endpoint_id"],
                project_id=PROJECT_ID,
                location=REGION,
                endpoint_kwargs={
                    "use_dedicated_endpoint": kwargs["embed_model_use_dedicated_endpoint"],
                },
            )  # max_length does not have any effect?
        elif self.model_type == "openai":
            # TODO - Add OpenAI embedding model
            # embed_model = OpenAIEmbedding()
            raise NotImplementedError

        return model

In [148]:
embed_model = RAGEmbedding(model_type=rag_cfg['embed_model_type'], model_name=rag_cfg['embed_model_name']).load_model(**rag_cfg)

Loading vertex-endpoint embedding model ...


#### 3. Load LLM for generation
LlamaIndex supports LLMs from OpenAI, Cohere, HuggingFace, AI21, etc. Please refer to [this](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html#example-using-a-custom-llm-model-advanced) for loading a custom LLM model for generation.

In [149]:
#@title *jkwng: RAGLLM from utils.hosting_utils.py - updated for Gemini on Vertex*

from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.openai_like import OpenAILike
from llama_index.llms.vertex import Vertex
from llama_index.llms.google_genai import GoogleGenAI

from google.genai.types import HarmCategory, HarmBlockThreshold

import google.auth
import openai

creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

class RAGLLM:
    """
    LlamaIndex supports OpenAI, Cohere, AI21 and HuggingFace LLMs
    https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html
    """

    def __init__(self, llm_type, llm_name, api_base=None, api_key=None):
        self.llm_type = llm_type
        self.llm_name = llm_name

        self._api_base = api_base
        self._api_key = api_key

        self.local_model_path = "/model-weights"

    def load_model(self, **kwargs):
        print(f"Configuring {self.llm_type} LLM model ...")
        gen_arg_keys = ["temperature", "top_p", "top_k", "do_sample"]
        gen_kwargs = {k: v for k, v in kwargs.items() if k in gen_arg_keys}
        if self.llm_type == "local":
            # Using local HuggingFace LLM stored at /model-weights
            llm = HuggingFaceLLM(
                tokenizer_name=f"{self.local_model_path}/{self.llm_name}",
                model_name=f"{self.local_model_path}/{self.llm_name}",
                device_map="auto",
                context_window=4096,
                max_new_tokens=kwargs["max_new_tokens"],
                generate_kwargs=gen_kwargs,
                # model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
            )
        # jkwng: add vertex support
        elif self.llm_type in ["vertex"]:
            safety_settings = {
                HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE
            }
            llm = GoogleGenAI(
                model=self.llm_name,
                temperature=kwargs["temperature"],
                max_tokens=kwargs["max_new_tokens"],
                safety_settings=safety_settings,
            )
        elif self.llm_type in ["vertex-endpoint"]:
            ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
                PROJECT_ID, REGION, kwargs["llm_endpoint_id"] # llm_name is the endpoint id
            )
            BASE_URL = (
              f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
            )
            try:
                if kwargs["llm_use_dedicated_endpoint"]:
                    BASE_URL = f"https://{kwargs['llm_dedicated_dns']}/v1/{ENDPOINT_RESOURCE_NAME}"
            except NameError:
                pass
            llm = OpenAILike(
                model=self.llm_name,
                temperature=kwargs["temperature"],
                max_tokens=kwargs["max_new_tokens"],
                api_base=BASE_URL,
                api_key=creds.token,
                is_chat_model=True,
                top_p=kwargs["top_p"],
                top_k=kwargs["top_k"],
            )
        elif self.llm_type in ["openai", "kscope"]:
            llm = OpenAILike(
                model=self.llm_name,
                api_base=self._api_base,
                api_key=self._api_key,
                is_chat_model=True,
                temperature=kwargs["temperature"],
                max_tokens=kwargs["max_new_tokens"],
                top_p=kwargs["top_p"],
                top_k=kwargs["top_k"],
            )
        return llm

In [150]:
llm = RAGLLM(
    llm_type=rag_cfg['llm_type'],
    llm_name=rag_cfg['llm_name'],
    # api_base=GENERATOR_BASE_URL,
    # api_key=OPENAI_API_KEY,
).load_model(**rag_cfg)

Configuring vertex-endpoint LLM model ...


#### 4. Use ```Settings``` to set the node parser, embedding model, LLM, etc.

In [151]:
Settings.text_splitter = node_parser
Settings.llm = llm
Settings.embed_model = embed_model

## STAGE 3 - Load index using the appropriate vector store
All vector stores supported by LlamaIndex along with their available features are listed [here](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html).

If you are using LangChain, the supported vector stores can be found [here](https://python.langchain.com/docs/modules/data_connection/vectorstores/).

*jkwng: Llama Index + Vertex AI Vector Store integration seems to be broken and has the following things that do not work:*

- *Batch Updates to a staging bucket gives an error, only Streaming Index works*
- *Retrieval is broken - the API has changed but the library has not been updated*

*For the purposes of the notebook - we will use Weaviate*

In [105]:
#@title *jkwng: RAGIndex from utils.storage_utils.py - modified to use Vertex Vector Store*
import faiss
import os
import weaviate
from google.cloud import aiplatform
from google.cloud import storage

from pathlib import Path

from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore

# from .rag_utils import get_embed_model_dim
def get_embed_model_dim(embed_model):
    embed_out = embed_model.get_text_embedding("Dummy Text")
    return len(embed_out)

class RAGIndex:
    """
    Use storage context to set custom vector store
    Available options: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html
    Use Chroma: https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo.html
    LangChain vector stores: https://python.langchain.com/docs/modules/data_connection/vectorstores/
    """

    def __init__(self, db_type, db_name):
        self.db_type = db_type
        self.db_name = db_name
        self._persist_dir = f"./.{db_type}_index_store/"

    def create_index(self, docs, save=True, **kwargs):
        vector_store = self._load_index(**kwargs)

        if os.path.isdir(self._persist_dir):
            # Load if index already saved
            print(f"Loading index from {self._persist_dir} ...")
            storage_context = StorageContext.from_defaults(
                vector_store=vector_store,
                persist_dir=self._persist_dir,
            )
            index = load_index_from_storage(storage_context)
        else:
            # Re-index
            print("Creating new index ...")
            storage_context = StorageContext.from_defaults(vector_store=vector_store)
            index = VectorStoreIndex.from_documents(
                docs, storage_context=storage_context
            )
            if save:
                os.makedirs(self._persist_dir, exist_ok=True)
                index.storage_context.persist(persist_dir=self._persist_dir)

        return index

    def load_index(self, **kwargs):
      print(f"Loading index from {self._persist_dir} ...")
      vector_store = self._load_index(**kwargs)

      return VectorStoreIndex.from_vector_store(vector_store)


    def _load_index(self, **kwargs):
        # Only supports Weaviate as of now
        if self.db_type == "weaviate":
            # with open(Path.home() / ".weaviate.key", "r") as f:
            #     weaviate_api_key = f.read().rstrip("\n")
            weaviate_client = weaviate.connect_to_wcs(
                cluster_url=kwargs["weaviate_url"],
                auth_credentials=weaviate.auth.AuthApiKey(WEAVIATE_API_KEY),
            )
            vector_store = WeaviateVectorStore(
                weaviate_client=weaviate_client,
                index_name=self.db_name,
            )
        elif self.db_type == "local":
            # Use FAISS vector database for local index
            faiss_dim = get_embed_model_dim(kwargs["embed_model"])
            faiss_index = faiss.IndexFlatL2(faiss_dim)
            vector_store = FaissVectorStore(faiss_index=faiss_index)
        # jkwng: added Vertex AI Vector Search support here
        elif self.db_type == "vertex":
          # check if storage bucket exists
          bucket_names = [
              bucket.name for bucket in storage.Client().list_buckets()
          ]

          dst_bucket = f"jkwng-{self.db_name.replace('_', '-').lower()}"

          if dst_bucket not in bucket_names:
              print(f"Creating bucket {dst_bucket} ...")
              storage.Client().create_bucket(dst_bucket, location=REGION)
              print(f"Bucket {dst_bucket} created")
          else:
              print(f"Bucket {dst_bucket} exists")

          # check if index exists already in vertex
          index_names = [
              index.resource_name
              for index in aiplatform.MatchingEngineIndex.list(
                  filter=f"display_name={self.db_name}"
              )
          ]

          # create the index if it doesn't exist
          if len(index_names) == 0:
              print(f"Creating Vector Search index {self.db_name} ...")
              vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
                  display_name=self.db_name,
                  dimensions=768,
                  approximate_neighbors_count=100,
                  distance_measure_type="DOT_PRODUCT_DISTANCE",
                  shard_size="SHARD_SIZE_SMALL",
                  index_update_method="STREAM_UPDATE",  # allowed values BATCH_UPDATE , STREAM_UPDATE
              )
              print(
                  f"Vector Search index {vs_index.display_name} created with resource name {vs_index.resource_name}"
              )
          else:
              vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])
              print(
                  f"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}"
              )

          # create an endpoint to serve the index
          endpoint_names = [
              endpoint.resource_name
              for endpoint in aiplatform.MatchingEngineIndexEndpoint.list(
                  filter=f"display_name={self.db_name}"
              )
          ]

          if len(endpoint_names) == 0:
              print(
                  f"Creating Vector Search index endpoint {self.db_name} ..."
              )
              vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
                  display_name=self.db_name, public_endpoint_enabled=True
              )
              print(
                  f"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}"
              )
          else:
              vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
                  index_endpoint_name=endpoint_names[0]
              )
              print(
                  f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
              )

          # check if endpoint exists
          index_endpoints = [
              (deployed_index.index_endpoint, deployed_index.deployed_index_id)
              for deployed_index in vs_index.deployed_indexes
          ]

          if len(index_endpoints) == 0:
              print(
                  f"Deploying Vector Search index {vs_index.display_name} at endpoint {vs_endpoint.display_name} ..."
              )
              vs_deployed_index = vs_endpoint.deploy_index(
                  index=vs_index,
                  deployed_index_id=self.db_name,
                  display_name=self.db_name,
                  machine_type="e2-standard-2",
                  min_replica_count=1,
                  max_replica_count=1,
              )
              print(
                  f"Vector Search index {vs_index.display_name} is deployed at endpoint {vs_deployed_index.display_name}"
              )
          else:
              vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(
                  index_endpoint_name=index_endpoints[0][0]
              )
              print(
                  f"Vector Search index {vs_index.display_name} is already deployed at endpoint {vs_deployed_index.display_name}"
              )

          # setup storage
          vector_store = VertexAIVectorStore(
              project_id=PROJECT_ID,
              region=REGION,
              index_id=vs_index.resource_name,
              endpoint_id=vs_endpoint.resource_name,
              gcs_bucket_name=dst_bucket,
          )

        else:
            raise NotImplementedError(f"Incorrect vector db type - {self.db_type}")

        return vector_store

In [106]:
index = RAGIndex(
    db_type=rag_cfg['vector_db_type'],
    db_name=rag_cfg['vector_db_name'],
).load_index(**rag_cfg)

Loading index from ./.weaviate_index_store/ ...


## STAGE 4 - Build query engine

Now build a query engine using *retriever* and *response_synthesizer*. LlamaIndex also supports different types of [retrievers](https://docs.llamaindex.ai/en/stable/api_reference/query/retrievers.html) and [response modes](https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/root.html#configuring-the-response-mode) for various use-cases.

[Weaviate hybrid search](https://weaviate.io/blog/hybrid-search-explained) explains how dense and sparse search is combined.

In [107]:
#@title *jkwng - RAGQueryEngine from utils.rag_utils.py - add support for Vertex AI*
from llama_index.core.postprocessor import LLMRerank
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.retrievers.bm25 import BM25Retriever

from llama_index.core import (
    PromptTemplate,
    get_response_synthesizer,
)

class RAGQueryEngine:
    """
    https://docs.llamaindex.ai/en/stable/understanding/querying/querying.html
    TODO - Check other args for RetrieverQueryEngine
    """

    def __init__(self, retriever_type, vector_index):
        self.retriever_type = retriever_type
        self.index = vector_index
        self.retriever = None
        self.node_postprocessor = None
        self.response_synthesizer = None

    def create(self, similarity_top_k, response_mode, **kwargs):
        self.set_retriever(similarity_top_k, **kwargs)
        self.set_response_synthesizer(response_mode=response_mode)
        if kwargs["use_reranker"]:
            self.set_node_postprocessors(rerank_top_k=kwargs["rerank_top_k"])
        query_engine = RetrieverQueryEngine(
            retriever=self.retriever,
            node_postprocessors=self.node_postprocessor,
            response_synthesizer=self.response_synthesizer,
        )
        return query_engine

    def set_retriever(self, similarity_top_k, **kwargs):
        # Other retrievers can be used based on the type of index: List, Tree, Knowledge Graph, etc.
        # https://docs.llamaindex.ai/en/stable/api_reference/query/retrievers.html
        # Find LlamaIndex equivalents for the following:
        # Check MultiQueryRetriever from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever
        # Check Contextual compression from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/
        # Check Ensemble Retriever from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble
        # Check self-query from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/self_query
        # Check WebSearchRetriever from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/web_research
        if self.retriever_type == "vector_index":
            self.retriever = VectorIndexRetriever(
                index=self.index,
                similarity_top_k=similarity_top_k,
                vector_store_query_mode=kwargs["query_mode"],
                alpha=kwargs["hybrid_search_alpha"],
            )
        elif self.retriever_type == "bm25":
            self.retriever = BM25Retriever(
                nodes=kwargs["nodes"],
                tokenizer=kwargs["tokenizer"],
                similarity_top_k=similarity_top_k,
            )
        else:
            raise NotImplementedError(
                f"Incorrect retriever type - {self.retriever_type}"
            )

    def set_node_postprocessors(self, rerank_top_k=2):
        # Node postprocessor: Porcessing nodes after retrieval before passing to the LLM for generation
        # Re-ranking step can be performed here!
        # Nodes can be re-ordered to include more relevant ones at the top: https://python.langchain.com/docs/modules/data_connection/document_transformers/post_retrieval/long_context_reorder
        # https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/node_postprocessors.html

        self.node_postprocessor = [LLMRerank(top_n=rerank_top_k)]

    def set_response_synthesizer(self, response_mode):
        # Other response modes: https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/root.html#configuring-the-response-mode
        qa_prompt_tmpl = (
            "Context information is below.\n"
            "---------------------\n"
            "{context_str}\n"
            "---------------------\n"
            "Given the context information and not prior knowledge, answer the query while providing an explanation. "
            "If your answer is in favour of the query, end your response with 'yes' otherwise end your response with 'no'.\n"
            "Query: {query_str}\n"
            "Answer: "
        )
        qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl)

        self.response_synthesizer = get_response_synthesizer(
            text_qa_template=qa_prompt_tmpl,
            response_mode=response_mode,
        )

In [108]:
def set_query_engine_args(rag_cfg, docs):
    query_engine_args = {
        "similarity_top_k": rag_cfg['retriever_similarity_top_k'],
        "response_mode": rag_cfg['response_mode'],
        "use_reranker": False,
    }

    # jkwng: add that retriever type vector_index could be "vertex" too
    # jkwng: note we don't actually use hybrid search for vertex ai vector search
    if (rag_cfg["retriever_type"] == "vector_index") and (rag_cfg["vector_db_type"] == "weaviate"):
        query_engine_args.update({
            "query_mode": rag_cfg["query_mode"],
            "hybrid_search_alpha": rag_cfg["hybrid_search_alpha"]
        })
    elif (rag_cfg["retriever_type"] == "vector_index") and (rag_cfg["vector_db_type"] == "vertex"):
        query_engine_args.update({
            # jkwng: only default mode works with VVS
            "query_mode": "default",
            "hybrid_search_alpha": 0.0,
        })
    elif rag_cfg["retriever_type"] == "bm25":
        nodes = Settings.text_splitter.get_nodes_from_documents(docs)
        tokenizer = Settings.embed_model._tokenizer
        query_engine_args.update({"nodes": nodes, "tokenizer": tokenizer})

    if rag_cfg["use_reranker"]:
        query_engine_args.update({"use_reranker": True, "rerank_top_k": rag_cfg["rerank_top_k"]})

    return query_engine_args

In [109]:
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)

{'hybrid_search_alpha': 0.0,
 'query_mode': 'default',
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': False}


In [110]:
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index,
).create(**query_engine_args)

## STAGE 5 - Finally query the model !
**Note:** We are using keyword based search or sparse search since *hybrid_search_alpha* is set to 0.0 by default.

#### [TODO] Change seed to experiment with a different sample

In [111]:
random.seed(237)

In [112]:
sample_idx = random.randint(0, len(pubmed_data)-1)
sample_elm = pubmed_data[sample_idx]
pprint(sample_elm)

{'answer': ['yes'],
 'context': 'Ascitis and undernutrition are frequent complications of '
            'cirrhosis, however ascitis volume and anthropometric assessment '
            'are not routinely documented or considered in prognostic '
            'evaluation. In a homogeneous cohort followed during two years '
            'these variables were scrutinized, aiming to ascertain relevance '
            'for longterm outcome. Population (N = 25, all males with '
            'alcoholic cirrhosis) was recruited among patients hospitalized '
            'for uncomplicated ascitis. Exclusion criteria were refractory or '
            'tense ascitis, cancer, spontaneous bacterial peritonitis, '
            'bleeding varices and critical illness. Measurements included '
            'ultrasonographically estimated ascitis volume, dry body mass '
            'index/BMI , upper arm anthropometrics, hematologic counts and '
            'liver function tests. Population (age 48.3 ± 11.3 years,

In [113]:
#@title *jkwng: extract_yes_no from utils.rag_utils.py*
import re

def extract_yes_no(resp):
    match_pat = r"\b(?:yes|no)\b"
    match_txt = re.search(match_pat, resp, re.IGNORECASE)
    if match_txt:
        return match_txt.group(0)
    return "none"

In [114]:
query = sample_elm['question']

response = query_engine.query(query)

delim = "".join(["-"]*25)
print(f'QUERY: {query}\n')
print(f'RESPONSE:\n{delim}\n{response.response}\n{delim}\n')
print(f'YES/NO: {extract_yes_no(response.response)}\n')
print(f'GT ANSWER: {sample_elm["answer"][0]}\n')
print(f'GT LONG ANSWER:\n{delim}\n{sample_elm["long_answer"]}\n{delim}')

QUERY: Should ascitis volume and anthropometric measurements be estimated in hospitalized alcoholic cirrotics?

RESPONSE:
-------------------------
Based on the provided context information, it appears that ascitis volume and anthropometric measurements are relevant for long-term outcome in hospitalized alcoholic cirrotics. The study found that admission ascitis volume corresponded to 7.1 ± 3.6 L and dry BMI to 18.3 ± 3.5 kg/m², and that these variables, along with arm circumference below the 5th percentile, were associated with rehospitalization and mortality.

In particular, the study found that:

* Ascitis volume was similar to matches for mortality as the Child-Pugh index.
* Dry BMI was similar to matches for mortality as the Child-Pugh index.
* Arm circumference below the 5th percentile was highly significantly associated with rehospitalization.

Therefore, considering the relevance of these variables for long-term outcome, it would be beneficial to estimate ascitis volume and ant

# Batch evaluation with Ragas

Select the first 10 questions to generate the answers for in batch evaluation.

#### [OPTIONAL] [Ragas](https://docs.ragas.io/en/latest/) evaluation
Following are the commonly used metrics for evaluating a RAG workflow:
* [Faithfulness](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/faithfulness/): Measures the factual correctness of the generated answer based on the retrived context. Value lies between 0 and 1. **Evaluated using a LLM.**
* [Answer Relevance](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/answer_relevance/): Measures how relevant the answer is to the given query. Value lies between 0 and 1. **Evaluated using a LLM.**
* [Context Precision](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/context_precision/): Precision of the retriever as measured using the retrieved and the ground truth context. Value lies between 0 and 1. LLM can be used for evaluation.
* [Context Recall](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/context_recall/): Recall of the retriever as measured using the retrieved and the ground truth context. Value lies between 0 and 1. LLM can be used for evaluation.

In [115]:
#@title *jkwng RagasEval from utils.rag_utils.py - update to include support for Vertex AI Gemini*

from langchain_cohere import ChatCohere
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint
from langchain_openai import ChatOpenAI
from langchain_google_vertexai import VertexAIEmbeddings, ChatVertexAI

from ragas import EvaluationDataset, evaluate as ragas_evaluate
from ragas.embeddings import LangchainEmbeddingsWrapper, LlamaIndexEmbeddingsWrapper
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import (
    Faithfulness,
    NonLLMContextPrecisionWithReference,
    NonLLMContextRecall,
    ResponseRelevancy,
)
from ragas.run_config import RunConfig

RAGAS_METRIC_MAP = {
    "faithfulness": Faithfulness(),
    "relevancy": ResponseRelevancy(),
    "recall": NonLLMContextRecall(),
    "precision": NonLLMContextPrecisionWithReference(),
}

class RagasEval:
    def __init__(
        self, metrics, eval_llm_type, eval_llm_name, embed_model_type, embed_model_name, **kwargs
    ):
        self.eval_llm_type = eval_llm_type  # "openai", "cohere", "local", "kscope", "vertex"
        self.eval_llm_name = eval_llm_name

        self.temperature = kwargs.get("temperature", 0.0)
        self.max_tokens = kwargs.get("max_tokens", 256)

        self.embed_model_type = embed_model_type # "openai", "vertex", "vertex-endpoint"
        self.embed_model_name = embed_model_name
        self.embed_model_endpoint_id = kwargs.get("embed_model_endpoint_id", None)
        self.embed_model_use_dedicated_endpoint = kwargs.get("embed_model_use_dedicated_endpoint", False)

        self._prepare_embedding()
        self._prepare_llm()

        self.metrics = [RAGAS_METRIC_MAP[elm] for elm in metrics]

    def _prepare_data(self, data):
        return EvaluationDataset.from_list(data)

    def _prepare_embedding(self):
        model_kwargs = {"device": "cuda", "trust_remote_code": True}
        encode_kwargs = {
            "normalize_embeddings": True
        }  # set True to compute cosine similarity

        if self.embed_model_type == "openai":
          self.eval_embedding = LangchainEmbeddingsWrapper(
              HuggingFaceEmbeddings(
                model_name=self.embed_model_name,
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs,
              )
          )
        elif self.embed_model_type == "vertex":
          self.eval_embedding = LangchainEmbeddingsWrapper(
              VertexAIEmbeddings(
                  model_name=self.embed_model_name,
                  credentials=credentials,
              )
          )
        elif self.embed_model_type == "vertex-endpoint":
          self.eval_embedding = LlamaIndexEmbeddingsWrapper(
              VertexEndpointEmbedding(
                endpoint_id=self.embed_model_endpoint_id,
                project_id=PROJECT_ID,
                location=REGION,
                endpoint_kwargs={
                    "use_dedicated_endpoint": self.embed_model_use_dedicated_endpoint,
                },
              )
          )

    def _prepare_llm(self):
        if self.eval_llm_type == "local":
            self.eval_llm = LangchainLLMWrapper(
                HuggingFaceEndpoint(
                    repo_id=f"meta-llama/{self.eval_llm_name}",
                    temperautre=self.temperature,
                    max_new_tokens=self.max_tokens,
                    huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"],
                )
            )
        elif self.eval_llm_type == "kscope":
            self.eval_llm = LangchainLLMWrapper(
                ChatOpenAI(
                    model=self.eval_llm_name,
                    temperature=self.temperature,
                    max_tokens=self.max_tokens,
                )
            )
        elif self.eval_llm_type == "openai":
            self.eval_llm = LangchainLLMWrapper(
                ChatOpenAI(
                    model=self.eval_llm_name,
                    temperature=self.temperature,
                    max_tokens=self.max_tokens,
                    base_url=os.environ["RAGAS_OPENAI_BASE_URL"],
                    api_key=os.environ["RAGAS_OPENAI_API_KEY"],
                )
            )
        elif self.eval_llm_type == "cohere":
            self.eval_llm = LangchainLLMWrapper(
                ChatCohere(
                    model=self.eval_llm_name,
                )
            )
        elif self.eval_llm_type == "vertex":
            self.eval_llm = LangchainLLMWrapper(
                ChatVertexAI(
                  model_name=self.eval_llm_name,
                  temperature=self.temperature,
                  max_tokens=self.max_tokens,
              )
            )

    def evaluate(self, data):
        eval_data = self._prepare_data(data)

        result = ragas_evaluate(
            dataset=eval_data,
            metrics=self.metrics,
            llm=self.eval_llm,
            embeddings=self.eval_embedding,
            run_config=RunConfig(max_workers=64)
        )
        return result

In [116]:
eval_obj = RagasEval(
    metrics=["faithfulness", "relevancy", "recall", "precision"],
    max_tokens=1024,
    **rag_cfg
)

In [117]:
all_eval_data = []
for i in range(10):
  sample_elm = pubmed_data[i]
  query = sample_elm['question']
  retrieved_nodes = query_engine.retriever.retrieve(query)
  response = query_engine.query(query)

  eval_data = dict({
      "user_input": query,
      "response": response.response,
      "retrieved_contexts": [node.text for node in retrieved_nodes],
      "reference": sample_elm['long_answer'],
      "reference_contexts": [sample_elm["context"]],
  })

  all_eval_data.append(eval_data)
  # pprint(eval_data)



Run the eval on the first sample


In [57]:
sample_elm = all_eval_data[0]

result = eval_obj.evaluate([sample_elm])
df = result.to_pandas()
df.head()

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

ERROR:asyncio:Exception in callback PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()
handle: <Handle PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,non_llm_context_recall,non_llm_context_precision_with_reference
0,Does histologic chorioamnionitis correspond to...,[To evaluate the degree to which histologic ch...,[To evaluate the degree to which histologic ch...,"Based on the provided context information, it ...",Histologic chorioamnionitis is a reliable indi...,0.0,0.552486,1.0,1.0


Run the eval on the 10 samples

In [118]:
eval_result = eval_obj.evaluate(all_eval_data)
pprint(eval_result)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

ERROR:asyncio:Exception in callback PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()
handle: <Handle PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
ERROR:asyncio:Exception in callback PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()
handle: <Handle PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cy

{'faithfulness': 0.8463, 'answer_relevancy': 0.9436, 'non_llm_context_recall': 1.0000, 'non_llm_context_precision_with_reference': 1.0000}


In [119]:
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 2000)
eval_result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,non_llm_context_recall,non_llm_context_precision_with_reference
0,Does histologic chorioamnionitis correspond to clinical chorioamnionitis?,"[To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother. A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnionitis (n = 52) served as controls. Cases and controls were selected sequentially. Maternal medical records were reviewed for indicators of maternal infection. Histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was significantly associated with the presence of histologic chorioamnionitis (p = 0.019)., Congenital cytomegalovirus infection is currently the leading cause of congenital infection in 0.2-2.2% of live births worldwide leading to variable serious sequalae. The aim of the study was to determine if low birth weight is an indicator of CMV congenital infection evidenced by detecting CMV-DNA in umbilical cord blood at the time of delivery. CMV-IgG and IgM antibodies and CMV-DNAemia were assessed in umbilical cord blood of two hundreds newborns, one hundred of whom had birth weight<or = 2700 gram and/or head circumference<or = 32 cm. CMV-IgM was not detected, while CMV-IgG was positive in 80-90% of the two hundreds tested newborns. CMV-DNA was detected in four out of the 200 newborns. One of them was over the adopted weight limit (>2700 gram)., To determine whether spectral Doppler measurements obtained from bilateral uterine, arcuate, radial, and spiral arteries in early gestation correlate with adverse pregnancy outcome. One hundred five pregnant women underwent transvaginal Doppler sonographic examination of uteroplacental circulation at 6-12 weeks' gestation. Resistance ind...","[To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother. A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnionitis (n = 52) served as controls. Cases and controls were selected sequentially. Maternal medical records were reviewed for indicators of maternal infection. Histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was significantly associated with the presence of histologic chorioamnionitis (p = 0.019).]","Based on the context information, histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was also significantly associated with the presence of histologic chorioamnionitis (p = 0.019). This suggests that histologic chorioamnionitis does correspond to clinical chorioamnionitis, as the presence of histologic chorioamnionitis is linked to clinical indicators of infection.\n\nyes",Histologic chorioamnionitis is a reliable indicator of infection whether or not it is clinically apparent.,1.0,0.875398,1.0,1.0
1,Can vitamin C prevent complex regional pain syndrome in patients with wrist fractures?,"[Complex regional pain syndrome type I is treated symptomatically. A protective effect of vitamin C (ascorbic acid) has been reported previously. A dose-response study was designed to evaluate its effect in patients with wrist fractures. In a double-blind, prospective, multicenter trial, 416 patients with 427 wrist fractures were randomly allocated to treatment with placebo or treatment with 200, 500, or 1500 mg of vitamin C daily for fifty days. The effect of gender, age, fracture type, and cast-related complaints on the occurrence of complex regional pain syndrome was analyzed. Three hundred and seventeen patients with 328 fractures were randomized to receive vitamin C, and ninety-nine patients with ninety-nine fractures were randomized to receive a placebo. The prevalence of complex regional pain syndrome was 2.4% (eight of 328) in the vitamin C group and 10.1% (ten of ninety-nine) in the placebo group (p=0.002); all of the affected patients were elderly women., Analysis of the different doses of vitamin C showed that the prevalence of complex regional pain syndrome was 4.2% (four of ninety-six) in the 200-mg group (relative risk, 0.41; 95% confidence interval, 0.13 to 1.27), 1.8% (two of 114) in the 500-mg group (relative risk, 0.17; 95% confidence interval, 0.04 to 0.77), and 1.7% (two of 118) in the 1500-mg group (relative risk, 0.17; 95% confidence interval, 0.04 to 0.75). Early cast-related complaints predicted the development of complex regional pain syndrome (relative risk, 5.35; 95% confidence interval, 2.13 to 13.42)., The aetiology of osteochondritis dissecans is still unclear. The aim of this prospective pilot study was to analyse whether vitamin D insufficiency, or deficiency, might be a contributing etiological factor in the development of an OCD lesion. The serum level of vitamin D3 in 23 consecutive patients (12 male and 11 female) suffering from a stage III, or stages III and IV, OCD lesion (mostly stage III) admitted for surgery was measured....","[Complex regional pain syndrome type I is treated symptomatically. A protective effect of vitamin C (ascorbic acid) has been reported previously. A dose-response study was designed to evaluate its effect in patients with wrist fractures. In a double-blind, prospective, multicenter trial, 416 patients with 427 wrist fractures were randomly allocated to treatment with placebo or treatment with 200, 500, or 1500 mg of vitamin C daily for fifty days. The effect of gender, age, fracture type, and cast-related complaints on the occurrence of complex regional pain syndrome was analyzed. Three hundred and seventeen patients with 328 fractures were randomized to receive vitamin C, and ninety-nine patients with ninety-nine fractures were randomized to receive a placebo. The prevalence of complex regional pain syndrome was 2.4% (eight of 328) in the vitamin C group and 10.1% (ten of ninety-nine) in the placebo group (p=0.002); all of the affected patients were elderly women. Analysis of the different doses of vitamin C showed that the prevalence of complex regional pain syndrome was 4.2% (four of ninety-six) in the 200-mg group (relative risk, 0.41; 95% confidence interval, 0.13 to 1.27), 1.8% (two of 114) in the 500-mg group (relative risk, 0.17; 95% confidence interval, 0.04 to 0.77), and 1.7% (two of 118) in the 1500-mg group (relative risk, 0.17; 95% confidence interval, 0.04 to 0.75). Early cast-related complaints predicted the development of complex regional pain syndrome (relative risk, 5.35; 95% confidence interval, 2.13 to 13.42).]","Based on the provided context information, the answer to the query is yes.\n\nThe context information states that in a double-blind, prospective, multicenter trial, 416 patients with 427 wrist fractures were randomly allocated to treatment with placebo or treatment with 200, 500, or 1500 mg of vitamin C daily for fifty days. The results showed that the prevalence of complex regional pain syndrome was 2.4% (eight of 328) in the vitamin C group and 10.1% (ten of ninety-nine) in the placebo group (p=0.002). This indicates a significant protective effect of vitamin C against complex regional pain syndrome in patients with wrist fractures.\n\nAdditionally, the analysis of the different doses of vitamin C showed that the prevalence of complex regional pain syndrome was lower in the vitamin C groups compared to the placebo group, with the 500-mg and 1500-mg groups having the lowest prevalence rates. However, the context information does not provide a clear conclusion on the optimal dose of vitamin C for preventing complex regional pain syndrome.\n\nOverall, the results suggest that vitamin C may have a protective effect against complex regional pain syndrome in patients with wrist fractures, making the answer to the query yes.",Vitamin C reduces the prevalence of complex regional pain syndrome after wrist fractures. A daily dose of 500 mg for fifty days is recommended.,0.777778,0.988141,1.0,1.0
2,It's Fournier's gangrene still dangerous?,"[Fournier's gangrene is known to have an impact in the morbidity and despite antibiotics and aggressive debridement, the mortality rate remains high. To assess the morbidity and mortality in the treatment of Fournier's gangrene in our experience. The medical records of 14 patients with Fournier's gangrene who presented at the University Hospital Center ""Mother Teresa"" from January 1997 to December 2006 were reviewed retrospectively to analyze the outcome and identify the risk factor and prognostic indicators of mortality. Of the 14 patients, 5 died and 9 survived. Mean age was 54 years (range from 41-61): it was 53 years in the group of survivors and 62 years in deceased group. There was a significant difference in leukocyte count between patients who survived (range 4900-17000/mm) and those died (range 20.300-31000/mm3). Mean hospital stay was about 19 days (range 2-57 days)., Blood stream infection (BSI) and the subsequent development of sepsis are among the most common infection complications occurring in severe burn patients. This study was designed to evaluate the relationship between the burn wound flora and BSI pathogens. Documentation of all bacterial and fungal wound and blood isolates from severe burn patients hospitalized in the burn unit and intensive care unit was obtained from medical records retrieved retrospectively from a computerized, hospital-wide database over a 13-year period. All data were recorded in relation to the Ryan score. Of 195 severe burn patients, 88 had at least 1 BSI episode. Transmission of the same pathogen from wound to blood was documented in 30% of the patients, with a rising BSI frequency as the Ryan score increased. There were a total of 263 bacteremic episodes in 88 study patients, 44% of blood isolates were documented previously in wound cultures, and transmission of the same pathogen from wound to blood was noted in 65% of bacteremic patients., The accepted treatment protocol for necrotizing fasciitis (NF) consists of ...","[Fournier's gangrene is known to have an impact in the morbidity and despite antibiotics and aggressive debridement, the mortality rate remains high. To assess the morbidity and mortality in the treatment of Fournier's gangrene in our experience. The medical records of 14 patients with Fournier's gangrene who presented at the University Hospital Center ""Mother Teresa"" from January 1997 to December 2006 were reviewed retrospectively to analyze the outcome and identify the risk factor and prognostic indicators of mortality. Of the 14 patients, 5 died and 9 survived. Mean age was 54 years (range from 41-61): it was 53 years in the group of survivors and 62 years in deceased group. There was a significant difference in leukocyte count between patients who survived (range 4900-17000/mm) and those died (range 20.300-31000/mm3). Mean hospital stay was about 19 days (range 2-57 days).]","Based on the provided context information, it appears that Fournier's gangrene is still a serious condition with a high mortality rate. The study mentioned that despite antibiotics and aggressive debridement, the mortality rate remains high, with 5 out of 14 patients dying from the condition. Additionally, the mean age of the deceased group was 62 years, which is higher than the mean age of the survivors (53 years). The significant difference in leukocyte count between the two groups also suggests that Fournier's gangrene is a severe condition that can lead to serious complications.\n\nFurthermore, the study found that the mean hospital stay was about 19 days, indicating that patients with Fournier's gangrene require prolonged hospitalization and treatment. The high mortality rate and prolonged hospital stay suggest that Fournier's gangrene is still a dangerous condition that requires prompt and aggressive treatment.\n\nTherefore, the answer to the query is: yes.","The interval from the onset of clinical symptoms to the initial surgical intervention seems to be the most important prognostic factor with a significant impact on outcome. Despite extensive therapeutic efforts, Fournier's gangrene remains a surgical emergency and early recognition with prompt radical debridement is the mainstays of management.",0.9,0.873308,1.0,1.0
3,Can the condition of the cell microenvironment of mediastinal lymph nodes help predict the risk of metastases in non-small cell lung cancer?,"[The aim of this study was to analyze the properties of the immune cell microenvironment of regional lymph nodes (LNs) positive for lung cancer. Twenty-four patients operated on for stages T1 and T2 of the NSCLC, were enrolled in the study. Peripheral blood and LN tissue were obtained from different lymph node sites and levels. As a control, LN tissue was taken from patients diagnosed with emphysema or pneumothorax. The cells from randomly chosen LN were tested by multi-color flow cytometry. Separate portions of LN were snap-frozen and examined for the presence of cytokeratin positive cells (CK). Propensity for apoptosis, level of TCR zeta chain expression of T cells and the number and maturation status of dendritic cells were confronted with the presence of CK-positive cells. The presence of metastases correlated with the downregulation of TCR zeta, especially CD8(+) T cells. The most striking feature was the reduction in the number of myeloid CD11c(+) dendritic cells in the LN of patients with LN metastases. This could be a reflection of the immunodeficient state observed in lung cancer patients., Mediastinal lymph node dissection is an essential component of lung cancer surgery; however, choosing mediastinal lymph nodes stations to be dissected is subjective. We carried out this research to investigate the need for dissection of station 9 lymph nodes during lung cancer surgery. Patients with primary lung cancer who underwent radical surgery between 2010 and 2014 were retrospectively reviewed. Clinical, pathologic, and prognosis data were obtained and analyzed. A total number of 1397 patients were included in this research. The metastasis rate of station 9 was 3.45%, which was significantly lower than other mediastinal stations. This metastasis rate was significantly correlated with pT stage, the lobe where the tumor was located, metastasis status of intrapulmonary lymph nodes, pTNM stage, and most of the other mediastinal lymph node stations. In males or grou...","[The aim of this study was to analyze the properties of the immune cell microenvironment of regional lymph nodes (LNs) positive for lung cancer. Twenty-four patients operated on for stages T1 and T2 of the NSCLC, were enrolled in the study. Peripheral blood and LN tissue were obtained from different lymph node sites and levels. As a control, LN tissue was taken from patients diagnosed with emphysema or pneumothorax. The cells from randomly chosen LN were tested by multi-color flow cytometry. Separate portions of LN were snap-frozen and examined for the presence of cytokeratin positive cells (CK). Propensity for apoptosis, level of TCR zeta chain expression of T cells and the number and maturation status of dendritic cells were confronted with the presence of CK-positive cells. The presence of metastases correlated with the downregulation of TCR zeta, especially CD8(+) T cells. The most striking feature was the reduction in the number of myeloid CD11c(+) dendritic cells in the LN of patients with LN metastases. This could be a reflection of the immunodeficient state observed in lung cancer patients. Even in the absence of metastases in the regional LN, the same type of changes in the LN microenvironment were observed in those LN located nearer the primary tumor.]","Based on the provided context information, the answer to the query is yes.\n\nThe study found that the presence of cytokeratin-positive cells (CK) in the mediastinal lymph nodes (LNs) correlated with the downregulation of TCR zeta chain expression, especially in CD8(+) T cells, and a reduction in the number of myeloid CD11c(+) dendritic cells. These changes were observed in patients with LN metastases, suggesting that the condition of the cell microenvironment of mediastinal lymph nodes can help predict the risk of metastases in non-small cell lung cancer.\n\nAdditionally, the study also found that even in the absence of metastases in the regional LN, the same type of changes in the LN microenvironment were observed in those LN located nearer the primary tumor, indicating that the condition of the cell microenvironment of mediastinal lymph nodes can be an early indicator of potential metastasis.\n\nTherefore, the condition of the cell microenvironment of mediastinal lymph nodes can be a useful predictor of the risk of metastases in non-small cell lung cancer.",The preliminary results of this study suggest that this approach may be helpful as an independent tumor staging factor. It is also worth noting that part of the staging process could also be based on features describing the immune cells in the peripheral blood.,0.857143,0.997821,1.0,1.0
4,Pancreas retransplantation: a second chance for diabetic patients?,"[If pancreas transplantation is a validated alternative for type 1 diabetic patients with end-stage renal disease, the management of patients who have lost their primary graft is poorly defined. This study aims at evaluating pancreas retransplantation outcome. Between 1976 and 2008, 569 pancreas transplantations were performed in Lyon and Geneva, including 37 second transplantations. Second graft survival was compared with primary graft survival of the same patients and the whole population. Predictive factors of second graft survival were sought. Patient survival and impact on kidney graft function and survival were evaluated. Second pancreas survival of the 17 patients transplanted from 1995 was close to primary graft survival of the whole population (71% vs. 79% at 1 year and 59% vs. 69% at 5 years; P=0.5075) and significantly better than their first pancreas survival (71% vs. 29% at 1 year and 59% vs. 7% at 5 years; P=0.0008) regardless of the cause of first pancreas loss. The same results were observed with all 37 retransplantations., Survival of second simultaneous pancreas and kidney transplantations was better than survival of second pancreas after kidney. Patient survival was excellent (89% at 5 years). Pancreas retransplantation had no impact on kidney graft function and survival (100% at 5 years)., The surgical treatment of diabetes had witnessed progressive development and success since the first case of pancreatic transplantation. Although this was a great step, wide clinical application was limited by several factors. Bariatric surgery such as gastric bypass is emerging as a promising option in obese patients with type 2 diabetes. The aim of this article is to explore the current application of gastric bypass in patients with type 2 diabetes and the theoretical bases of gastric bypass as a treatment option for type 1 diabetes. We performed a MEDLINE search for articles published from August 1955 to December 2008 using the words ""surgical treatment ...","[If pancreas transplantation is a validated alternative for type 1 diabetic patients with end-stage renal disease, the management of patients who have lost their primary graft is poorly defined. This study aims at evaluating pancreas retransplantation outcome. Between 1976 and 2008, 569 pancreas transplantations were performed in Lyon and Geneva, including 37 second transplantations. Second graft survival was compared with primary graft survival of the same patients and the whole population. Predictive factors of second graft survival were sought. Patient survival and impact on kidney graft function and survival were evaluated. Second pancreas survival of the 17 patients transplanted from 1995 was close to primary graft survival of the whole population (71% vs. 79% at 1 year and 59% vs. 69% at 5 years; P=0.5075) and significantly better than their first pancreas survival (71% vs. 29% at 1 year and 59% vs. 7% at 5 years; P=0.0008) regardless of the cause of first pancreas loss. The same results were observed with all 37 retransplantations. Survival of second simultaneous pancreas and kidney transplantations was better than survival of second pancreas after kidney. Patient survival was excellent (89% at 5 years). Pancreas retransplantation had no impact on kidney graft function and survival (100% at 5 years).]","Based on the provided context information, pancreas retransplantation appears to be a viable option for diabetic patients who have lost their primary graft. The study mentioned in the context shows that second pancreas survival rates are comparable to primary graft survival rates in the whole population, and significantly better than the first pancreas survival rates of the same patients. Additionally, the study found that pancreas retransplantation had no impact on kidney graft function and survival.\n\nThe context also mentions that patient survival was excellent (89% at 5 years) for patients who underwent second simultaneous pancreas and kidney transplantations. This suggests that pancreas retransplantation can provide a second chance for diabetic patients to manage their condition and improve their quality of life.\n\nTherefore, the answer to the query is: yes.",Pancreas retransplantation is a safe procedure with acceptable graft survival that should be proposed to diabetic patients who have lost their primary graft.,0.777778,0.922969,1.0,1.0
5,Gender difference in survival of resected non-small cell lung cancer: histology-related phenomenon?,"[It remains controversial whether there is a gender difference in survival of patients with resected non-small cell lung cancer. We retrospectively analyzed 2770 patients (1689 men and 1081 women) with non-small cell lung cancer who underwent pulmonary resection between 1995 and 2005 at the National Cancer Center Hospital, Tokyo. A gender difference in survival was studied in all patients, in those divided according to histology or pathologic stage, and in propensity-matched gender pairs. There were no differences in background, such as preoperative pulmonary function, operation procedures, or operative mortality. The proportions of adenocarcinoma and pathologic stage I in women were greater than those in men (93.6% vs 61.7% and 71.4% vs 58.6%, respectively) (P<.001). Overall 5-year survival of women was better than that of men (81% vs 70%, P<.001)., In adenocarcinoma, the overall 5-year survival for women was better than that for men in pathologic stage I (95% vs 87%, P<.001) and in pathologic stage II or higher (58% vs 51%, P = .017). In non-adenocarcinoma, there was no significant gender difference in survival in pathologic stage I (P = .313) or pathologic stage II or higher (P = .770). The variables such as age, smoking status, histology, and pathologic stage were used for propensity score matching, and survival analysis of propensity score-matched gender pairs did not show a significant difference (P = .69)., We review our results on surgical treatment of patients with stage I non-small cell lung carcinoma and we attempted to clarify the prognostic significance of some surgical--pathologic variables. From 1993 to 1999, 667 patients received curative lung resection and complete hilar and mediastinal lymphadenectomy for non-small cell lung cancer. Of these, there were 436 Stage I disease (65%), of whom 144 T1N0 and 292 T2N0. No patients had pre- or postoperative radio- or chemotherapy. Prognostic significance of the following independent variables was tested ...","[It remains controversial whether there is a gender difference in survival of patients with resected non-small cell lung cancer. We retrospectively analyzed 2770 patients (1689 men and 1081 women) with non-small cell lung cancer who underwent pulmonary resection between 1995 and 2005 at the National Cancer Center Hospital, Tokyo. A gender difference in survival was studied in all patients, in those divided according to histology or pathologic stage, and in propensity-matched gender pairs. There were no differences in background, such as preoperative pulmonary function, operation procedures, or operative mortality. The proportions of adenocarcinoma and pathologic stage I in women were greater than those in men (93.6% vs 61.7% and 71.4% vs 58.6%, respectively) (P<.001). Overall 5-year survival of women was better than that of men (81% vs 70%, P<.001). In adenocarcinoma, the overall 5-year survival for women was better than that for men in pathologic stage I (95% vs 87%, P<.001) and in pathologic stage II or higher (58% vs 51%, P = .017). In non-adenocarcinoma, there was no significant gender difference in survival in pathologic stage I (P = .313) or pathologic stage II or higher (P = .770). The variables such as age, smoking status, histology, and pathologic stage were used for propensity score matching, and survival analysis of propensity score-matched gender pairs did not show a significant difference (P = .69).]","Based on the provided context information, the answer to the query is 'yes'. \n\nThe context information shows that there is a gender difference in survival of patients with resected non-small cell lung cancer, with women having a better overall 5-year survival rate than men (81% vs 70%, P<.001). However, this difference is mainly observed in adenocarcinoma patients, where women have a better overall 5-year survival rate than men in both pathologic stage I (95% vs 87%, P<.001) and pathologic stage II or higher (58% vs 51%, P = .017). \n\nIn non-adenocarcinoma patients, there is no significant gender difference in survival in pathologic stage I (P = .313) or pathologic stage II or higher (P = .770). This suggests that the gender difference in survival of resected non-small cell lung cancer may be related to histology, specifically adenocarcinoma. \n\nAdditionally, the propensity score-matched gender pairs analysis did not show a significant difference in survival (P = .69), which further supports the idea that the gender difference in survival is related to histology rather than other factors. \n\nTherefore, the answer","Women had better survival than men; however, there was no survival advantage in propensity-matched gender pairs. A gender difference in survival was observed only in the adenocarcinoma subset, suggesting pathobiology in adenocarcinoma in women might be different from that of men.",0.7,0.940209,1.0,1.0
6,Is HIV/STD control in Jamaica making a difference?,"[To assess the impact of the comprehensive HIV/STD Control Program established in Jamaica since the late 1980s on the HIV/AIDS epidemic. AIDS case reports, HIV testing of blood donors, antenatal clinic attenders (ANC), food service workers, sexually transmitted disease (STD) clinic attenders, female prostitutes, homosexuals and other groups were used to monitor the HIV/AIDS epidemic. Primary and secondary syphilis and cases of congenital syphilis were also monitored. National knowledge, attitude and practice (KAP) surveys were conducted in 1988, 1989, 1992, 1994 and 1996. The annual AIDS incidence rate in Jamaica increased only marginally in the past three years from 18.5 per 100000 population to 21.4 in 1997. HIV prevalence in the general population groups tested has been about 1% or less. Among those at high risk, HIV prevalence rates have risen to 6.3% (95% confidence interval 5.0-8.0) in STD clinic attenders, around 10% and 21% in female prostitutes in Kingston and Montego Bay respectively and approximately 30% among homosexuals., The number of new diagnoses of HIV infection is rising in the northwestern hemisphere and it is becoming increasingly important to understand the mechanisms behind this trend. To evaluate whether reported unsafe sexual behaviour among HIV- infected individuals is changing over time. Participants in the Swiss HIV Cohort Study were asked about their sexual practices every 6 months for 3 years during regular follow-up of the cohort beginning on 1 April 2000. : Logistic regression models were fit using generalized estimating equations assuming a constant correlation between responses from the same individual. At least one sexual behaviour questionnaire was obtained for 6545 HIV-infected individuals and the median number of questionnaires completed per individual was five. There was no evidence of an increase in reported unsafe sex over time in this population [odds ratio (OR), 1.0; 95% confidence interval (CI), 0.96-1.05]., Recently, t...","[To assess the impact of the comprehensive HIV/STD Control Program established in Jamaica since the late 1980s on the HIV/AIDS epidemic. AIDS case reports, HIV testing of blood donors, antenatal clinic attenders (ANC), food service workers, sexually transmitted disease (STD) clinic attenders, female prostitutes, homosexuals and other groups were used to monitor the HIV/AIDS epidemic. Primary and secondary syphilis and cases of congenital syphilis were also monitored. National knowledge, attitude and practice (KAP) surveys were conducted in 1988, 1989, 1992, 1994 and 1996. The annual AIDS incidence rate in Jamaica increased only marginally in the past three years from 18.5 per 100000 population to 21.4 in 1997. HIV prevalence in the general population groups tested has been about 1% or less. Among those at high risk, HIV prevalence rates have risen to 6.3% (95% confidence interval 5.0-8.0) in STD clinic attenders, around 10% and 21% in female prostitutes in Kingston and Montego Bay respectively and approximately 30% among homosexuals. Syphilis rates and congenital syphilis cases have declined. The proportion of men aged 15-49 years reporting sex with a non-regular partner declined from 35% in 1994 to 26% in 1996 (P<0.001). Women ever having used condoms increased from 51% in 1988 to 62.5% in 1992 and 73% in 1994 and 1996 (P<0.001), while condom use with a non-regular partner increased from 37% in 1992 to 73% in 1996 (P= 0.006). Condom use among men was high over the period: around 81% had ever used condoms and 77% used them with non-regular partners. Gay men, inner-city adults and adolescents aged 12-14 years all reported increases in condom use while condom sales and distribution increased from around 2 million in 1985 to 10 million in 1995.]","Based on the context information provided, it appears that HIV/STD control in Jamaica is making a difference. The annual AIDS incidence rate in Jamaica increased only marginally from 18.5 per 100,000 population to 21.4 in 1997, indicating a relatively slow growth in the epidemic. Additionally, HIV prevalence in the general population groups tested has been about 1% or less, which suggests that the control efforts are having a positive impact.\n\nFurthermore, the study on the Swiss HIV Cohort Study found no evidence of an increase in reported unsafe sex over time, which is a crucial factor in controlling the spread of HIV. This suggests that the education and awareness efforts in Jamaica may be effective in reducing high-risk behaviors.\n\nWhile the study on the pre-HAART era patients in Buenos Aires, Argentina, does not directly relate to the HIV/STD control efforts in Jamaica, it does provide some insights into the challenges of achieving and maintaining viral load suppression over time, which is an important aspect of treatment as prevention (TasP) strategies.\n\nTherefore, based on the available information, it appears that HIV/STD control in Jamaica is making a difference. \n\nYes.","HIV/STD control measures appear to have slowed the HIV/AIDS epidemic in Jamaica, however a significant minority of persons continue to have unprotected sex in high risk situations.",0.916667,0.990716,1.0,1.0
7,Delayed peripheral facial palsy in the stapes surgery: can it be prevented?,"[The aim of this study was to evaluate poststapedectomy-delayed facial palsy etiopathogenesis, risk factors, evolution, and prevention. Seven hundred six stapedectomies performed in 580 patients were reviewed. In all patients who developed delayed facial palsy, the dates of onset and subside of facial palsy, the anatomic and pathologic predisposing factors, and a possible history for recurrent labial herpetic lesions were considered. The House-Brackmann (H-B) grading system was used to evaluate the facial function. Virus-specific immunoglobulin (Ig) G and IgM antibodies against herpes simplex virus type 1 (HSV-1) were determined by enzyme-linked immunosorbent assay (ELISA) 3 weeks after the onset of the paralysis. The results were compared with a control group without a history of recurrent herpes labialis. Poststapedectomy facial palsy developed in 7 out of 706 procedures. All 7 patients referred a history of recurrent labial herpetic lesions. One patient showed a facial palsy H-B grade II, 2 a grade III, and 3 a grade IV., This study evaluated the outcomes and complications of the surgical treatment of condylar fractures by the retromandibular transparotid approach. The authors hypothesized that such an approach would be safe and reliable for the treatment of most condylar fractures. A retrospective evaluation of patients who underwent surgical reduction of a condylar fracture from January 2012 to December 2014 at the Clinic of Dentistry and Maxillofacial Surgery of the University Hospital of Verona (Verona, Italy) was performed. Inclusion criteria were having undergone surgical treatment of condylar fractures with a retromandibular transparotid approach and the availability of computed tomograms of the preoperative and postoperative facial skeleton with a minimum follow-up of 1 year. Static and dynamic occlusal function, temporomandibular joint health status, presence of neurologic impairments, and esthetic outcomes were evaluated in all patients. The sample ...","[The aim of this study was to evaluate poststapedectomy-delayed facial palsy etiopathogenesis, risk factors, evolution, and prevention. Seven hundred six stapedectomies performed in 580 patients were reviewed. In all patients who developed delayed facial palsy, the dates of onset and subside of facial palsy, the anatomic and pathologic predisposing factors, and a possible history for recurrent labial herpetic lesions were considered. The House-Brackmann (H-B) grading system was used to evaluate the facial function. Virus-specific immunoglobulin (Ig) G and IgM antibodies against herpes simplex virus type 1 (HSV-1) were determined by enzyme-linked immunosorbent assay (ELISA) 3 weeks after the onset of the paralysis. The results were compared with a control group without a history of recurrent herpes labialis. Poststapedectomy facial palsy developed in 7 out of 706 procedures. All 7 patients referred a history of recurrent labial herpetic lesions. One patient showed a facial palsy H-B grade II, 2 a grade III, and 3 a grade IV. After acyclovir therapy, 6 subjects recovered completely, whereas 1 maintained an H-B grade II. An increased IgG antibody titer was found in 6 of the patients with delayed facial palsy and in 1 out of 7 controls. Mean IgG titer was 1:14,050 in the subjects with delayed facial palsy and 1:2,300 in controls (P<.001).]","Based on the context information provided, it appears that delayed peripheral facial palsy in stapes surgery is associated with a history of recurrent labial herpetic lesions. In the study evaluating poststapedectomy-delayed facial palsy, all 7 patients who developed delayed facial palsy referred a history of recurrent labial herpetic lesions.\n\nGiven this information, it can be inferred that preventing or managing recurrent labial herpetic lesions may be a potential strategy to prevent delayed peripheral facial palsy in stapes surgery.\n\nTherefore, the answer to the query is: yes.","Poststapedectomy-delayed facial palsy is likely caused by a reactivation of HSV-1, latent within the geniculate ganglion. The activation of the latent virus is more frequent in patients with a history of herpes labialis and can be prevented by an adequate acyclovir therapy.",0.833333,0.88329,1.0,1.0
8,Is external palliative radiotherapy for gallbladder carcinoma effective?,"[Gallbladder carcinoma is characterized by delayed diagnosis, ineffective treatment and poor prognosis. Surgical resection has been thought to be the treatment of choice, while the role of radiotherapy as adjuvant or palliative treatment has not been fully clarified in the literature. We present the case of a 45-year-old female, with unresectable gallbladder carcinoma, grade IV, histologically diagnosed during laparotomy. The patient was treated with palliative intent with percutaneous transhepatic biliary drainage. Furthermore, she received external radiotherapy by (60)Co, using a three-field technique (anterior-posterior and right lateral). The total dose was 3,000 cGy in 10 fractions, with 300 cGy per fraction, 5 days weekly. The patient showed clinico-laboratory improvement and was discharged with a permanent percutaneous transhepatic endoprosthesis. During follow-up (10 and 12 months postirradiation), abdominal CTs showed no local extension of the tumor, while the patient had a good performance status. So far, 1 year after the diagnosis of gallbladder cancer she is still alive., To analyze, retrospectively, the patterns and behavior of metastatic lesions in prostate cancer patients treated with external beam radiotherapy and to investigate whether patients with<or =5 lesions had an improved outcome relative to patients with>5 lesions. The treatment and outcome of 369 eligible patients with Stage T1-T3aN0-NXM0 prostate cancer were analyzed during a minimal 10-year follow-up period. All patients were treated with curative intent to a mean dose of 65 Gy. The full history of any metastatic disease was documented for each subject, including the initial site of involvement, any progression over time, and patient survival. The overall survival rate for the 369 patients was 75% at 5 years and 45% at 10 years. The overall survival rate of patients who never developed metastases was 90% and 81% at 5 and 10 years, respectively. However, among the 74 patients (20%) who...","[Gallbladder carcinoma is characterized by delayed diagnosis, ineffective treatment and poor prognosis. Surgical resection has been thought to be the treatment of choice, while the role of radiotherapy as adjuvant or palliative treatment has not been fully clarified in the literature. We present the case of a 45-year-old female, with unresectable gallbladder carcinoma, grade IV, histologically diagnosed during laparotomy. The patient was treated with palliative intent with percutaneous transhepatic biliary drainage. Furthermore, she received external radiotherapy by (60)Co, using a three-field technique (anterior-posterior and right lateral). The total dose was 3,000 cGy in 10 fractions, with 300 cGy per fraction, 5 days weekly. The patient showed clinico-laboratory improvement and was discharged with a permanent percutaneous transhepatic endoprosthesis. During follow-up (10 and 12 months postirradiation), abdominal CTs showed no local extension of the tumor, while the patient had a good performance status. So far, 1 year after the diagnosis of gallbladder cancer she is still alive.]","Based on the provided context information, the answer to the query is yes.\n\nThe context information presents a case study of a 45-year-old female patient with unresectable gallbladder carcinoma, grade IV, who was treated with palliative intent using percutaneous transhepatic biliary drainage and external radiotherapy with a total dose of 3,000 cGy in 10 fractions. The patient showed clinico-laboratory improvement and was discharged with a permanent percutaneous transhepatic endoprosthesis. During follow-up, abdominal CTs showed no local extension of the tumor, and the patient had a good performance status. The patient remained alive 1 year after the diagnosis of gallbladder cancer.\n\nThis case study suggests that external palliative radiotherapy can be effective in managing gallbladder carcinoma, particularly in patients with unresectable disease. The patient's improvement in clinico-laboratory status and lack of local tumor extension indicate a positive response to the treatment.\n\nAdditionally, the context information provides general information on the treatment and outcome of patients with prostate cancer and esophageal carcinoma, but it does not directly contradict the effectiveness of external palliative radiotherapy for gallbladder carcinoma. Therefore, based on the provided information, the answer","External radiotherapy seems to be a safe and effective method of palliative management of gallbladder carcinoma. However, further studies are necessary to determine the role of radiotherapy in palliative or adjuvant treatment of gallbladder carcinoma.",0.8,0.994196,1.0,1.0
9,Can elevated troponin I levels predict complicated clinical course and inhospital mortality in patients with acute pulmonary embolism?,"[The purpose of this study was to evaluate the value of elevated cardiac troponin I (cTnI) for prediction of complicated clinical course and in-hospital mortality in patients with confirmed acute pulmonary embolism (PE). This study was a retrospective chart review of patients diagnosed as having PE, in whom cTnI testing was obtained at emergency department (ED) presentation between January 2002 and April 2006. Clinical characteristics; echocardiographic right ventricular dysfunction; inhospital mortality; and adverse clinical events including need for inotropic support, mechanical ventilation, and thrombolysis were compared in patients with elevated cTnI levels vs patients with normal cTnI levels. One hundred sixteen patients with PE were identified, and 77 of them (66%) were included in the study. Thirty-three patients (42%) had elevated cTnI levels. Elevated cTnI levels were associated with inhospital mortality (P = .02), complicated clinical course (P<.001), and right ventricular dysfunction (P<.001)., In patients with elevated cTnI levels, inhospital mortality (odds ratio [OR], 3.31; 95% confidence interval [CI], 1.82-9.29), hypotension (OR, 7.37; 95% CI, 2.31-23.28), thrombolysis (OR, 5.71; 95% CI, 1.63-19.92), need for mechanical ventilation (OR, 5.00; 95% CI, 1.42-17.57), and need for inotropic support (OR, 3.02; 95% CI, 1.03-8.85) were more prevalent. The patients with elevated cTnI levels had more serious vital parameters (systolic blood pressure, pulse, and oxygen saturation) at ED presentation., Elevation of cardiac troponin (cTn) is considered specific for myocardial damage. Elevated cTn and echocardiogrpahic documentation of wall motion abnormalities (WMAs) that were recorded after extreme physical effort raise the question whether dobutamine stress echo (DSE), can also induce elevation of troponin. we prospective enrolled stable patients (age>18 years) referred to DSE. The exam was performed under standardized conditions. Blood samples for cTnI wer...","[The purpose of this study was to evaluate the value of elevated cardiac troponin I (cTnI) for prediction of complicated clinical course and in-hospital mortality in patients with confirmed acute pulmonary embolism (PE). This study was a retrospective chart review of patients diagnosed as having PE, in whom cTnI testing was obtained at emergency department (ED) presentation between January 2002 and April 2006. Clinical characteristics; echocardiographic right ventricular dysfunction; inhospital mortality; and adverse clinical events including need for inotropic support, mechanical ventilation, and thrombolysis were compared in patients with elevated cTnI levels vs patients with normal cTnI levels. One hundred sixteen patients with PE were identified, and 77 of them (66%) were included in the study. Thirty-three patients (42%) had elevated cTnI levels. Elevated cTnI levels were associated with inhospital mortality (P = .02), complicated clinical course (P<.001), and right ventricular dysfunction (P<.001). In patients with elevated cTnI levels, inhospital mortality (odds ratio [OR], 3.31; 95% confidence interval [CI], 1.82-9.29), hypotension (OR, 7.37; 95% CI, 2.31-23.28), thrombolysis (OR, 5.71; 95% CI, 1.63-19.92), need for mechanical ventilation (OR, 5.00; 95% CI, 1.42-17.57), and need for inotropic support (OR, 3.02; 95% CI, 1.03-8.85) were more prevalent. The patients with elevated cTnI levels had more serious vital parameters (systolic blood pressure, pulse, and oxygen saturation) at ED presentation.]","Based on the provided context information, the answer to the query is yes.\n\nThe context information states that the purpose of the study was to evaluate the value of elevated cardiac troponin I (cTnI) for prediction of complicated clinical course and in-hospital mortality in patients with confirmed acute pulmonary embolism (PE). The study found that elevated cTnI levels were associated with in-hospital mortality (P = .02), complicated clinical course (P<.001), and right ventricular dysfunction (P<.001). Additionally, in patients with elevated cTnI levels, inhospital mortality (odds ratio [OR], 3.31; 95% confidence interval [CI], 1.82-9.29), hypotension (OR, 7.37; 95% CI, 2.31-23.28), thrombolysis (OR, 5.71; 95% CI, 1.63-19.92), need for mechanical ventilation (OR, 5.00; 95% CI, 1.42-17.57), and need for inotropic support (OR, 3.02; 95% CI, 1.03-8.85",Our results indicate that elevated cTnI levels are associated with higher risk for inhospital mortality and complicated clinical course. Troponin I may play an important role for the risk assessment of patients with PE. The idea that an elevation in cTnI levels is a valuable parameter for the risk stratification of patients with PE needs to be examined in larger prospective studies.,0.9,0.969885,1.0,1.0


# Experiment tracking with RAG Metrics using Vertex AI Evaluation service

We can also use similar metrics in the Vertex AI Evaluation service and track the experiments there.






In [120]:
from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples, PointwiseMetric

# See all the available metric examples
MetricPromptTemplateExamples.list_example_metric_names()

['coherence',
 'fluency',
 'safety',
 'groundedness',
 'instruction_following',
 'verbosity',
 'text_quality',
 'summarization_quality',
 'question_answering_quality',
 'multi_turn_chat_quality',
 'multi_turn_safety',
 'pairwise_coherence',
 'pairwise_fluency',
 'pairwise_safety',
 'pairwise_groundedness',
 'pairwise_instruction_following',
 'pairwise_verbosity',
 'pairwise_text_quality',
 'pairwise_summarization_quality',
 'pairwise_question_answering_quality',
 'pairwise_multi_turn_chat_quality',
 'pairwise_multi_turn_safety']

The Ragas metrics for *faithfulness* roughly maps to *groundedness*, and the metric for *relevancy* roughly maps to *question_answering_quality*.

In [121]:
print(f"Groundedness:\n {MetricPromptTemplateExamples.get_prompt_template('groundedness')}")

Groundedness:
 
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Metric Definition
You will be assessing groundedness, which measures the ability to provide or reference information included only in the user prompt.

## Criteria
Groundedness: The response contains information included only in the user prompt. The response does not reference any outside information.

## Rating Rubric
1: (Fully grounded). All aspects of the response are attributable to the context.
0: (No

In [122]:
print(f"Question answering quality:\n {MetricPromptTemplateExamples.get_prompt_template('question_answering_quality')}")

Question answering quality:
 
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in user input. The instruction for performing a question-answering task is provided in the user prompt.

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruc

Format the dataset.

In [127]:
eval_dataset_vertex_arr = []
for d in all_eval_data:
  # print(d)
  eval_data2 = dict({
      # "user_input": query,
      "response": d['response'],
      # "retrieved_contexts": [node.text for node in retrieved_nodes],
      # "reference": sample_elm['long_answer'],
      # "reference_contexts": [sample_elm["context"]],
      # jkwng: construct the full prompt from query and retrieved contexts
      "contexts": d['retrieved_contexts'],
      "prompt": f"""
Answer the question based on the contexts provided.
Question: {d['user_input']}
Contexts:\n
{d['retrieved_contexts']}
"""
  })

  print(eval_data2)
  eval_dataset_vertex_arr.append(eval_data2)


eval_dataset_vertex = df.from_dict(eval_dataset_vertex_arr)

{'response': 'Based on the context information, histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was also significantly associated with the presence of histologic chorioamnionitis (p = 0.019). This suggests that histologic chorioamnionitis does correspond to clinical chorioamnionitis, as the presence of histologic chorioamnionitis is linked to clinical indicators of infection.\n\nyes', 'contexts': ['To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother. A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnioni

In [132]:
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate, MetricPromptTemplateExamples

EXPERIMENT = "pubmedqa-rag-experiment"

eval_task = EvalTask(
    dataset=eval_dataset_vertex,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.QUESTION_ANSWERING_QUALITY],
    experiment=EXPERIMENT,
    output_uri_prefix="gs://jkwng-vertex-experiments/pubmedqa-rag-experiment",
)

eval_result = eval_task.evaluate(experiment_run_name="baseline")

INFO:google.cloud.aiplatform.metadata.experiment_resources:Associating projects/205512073711/locations/us-central1/metadataStores/default/contexts/pubmedqa-rag-experiment-baseline to Experiment: pubmedqa-rag-experiment


INFO:vertexai.evaluation._evaluation:Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 20/20 [00:19<00:00,  1.00it/s]
INFO:vertexai.evaluation._evaluation:All 20 metric requests are successfully computed.
INFO:vertexai.evaluation._evaluation:Evaluation Took:20.005880799000806 seconds


### 5.1 - Dense Search
Set *hybrid_search_alpha* to 1.0 for dense vector search.

In [133]:
rag_cfg["hybrid_search_alpha"] = 1.0

In [152]:
# Recreate query engine
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index
).create(**query_engine_args)

# Get response
all_eval_data = []
eval_dataset_vertex_arr = []
for i in range(10):
  sample_elm = pubmed_data[i]
  query = sample_elm['question']
  retrieved_nodes = query_engine.retriever.retrieve(query)
  response = query_engine.query(query)

  eval_data = dict({
      "user_input": query,
      "response": response.response,
      "retrieved_contexts": [node.text for node in retrieved_nodes],
      "reference": sample_elm['long_answer'],
      "reference_contexts": [sample_elm["context"]],
  })

  eval_data2 = dict({
      "response": response.response,
      # jkwng: construct the full prompt from query and retrieved contexts
      "contexts": d['retrieved_contexts'],
      "prompt": f"""
Answer the question based on the contexts provided.
Question: {query}
Contexts:\n
{retrieved_nodes}
"""
  })

  #print(eval_data2)
  eval_dataset_vertex_arr.append(eval_data2)

eval_dataset_vertex = df.from_dict(eval_dataset_vertex_arr)

eval_task = EvalTask(
    dataset=eval_dataset_vertex,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.QUESTION_ANSWERING_QUALITY],
    experiment=EXPERIMENT,
    output_uri_prefix="gs://jkwng-vertex-experiments/pubmedqa-rag-experiment",
)

eval_result = eval_task.evaluate(experiment_run_name="dense-search")

{'hybrid_search_alpha': 0.0,
 'query_mode': 'default',
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': False}


INFO:google.cloud.aiplatform.metadata.experiment_resources:Associating projects/205512073711/locations/us-central1/metadataStores/default/contexts/pubmedqa-rag-experiment-dense-search to Experiment: pubmedqa-rag-experiment


INFO:vertexai.evaluation._evaluation:Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 20/20 [00:30<00:00,  1.54s/it]
INFO:vertexai.evaluation._evaluation:All 20 metric requests are successfully computed.
INFO:vertexai.evaluation._evaluation:Evaluation Took:30.725962171998617 seconds


### 5.2 - Hybrid Search
Set *hybrid_search_alpha* to 0.5 for hybrid search with equal weightage for dense and sparse (keyword-based) search.

In [153]:
rag_cfg["hybrid_search_alpha"] = 0.5

In [154]:
# Recreate query engine
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index
).create(**query_engine_args)

# Get response
response = query_engine.query(query)

# Print response
print(f'\n\nQUERY: {query}\n')
print(f'RESPONSE:\n{delim}\n{response.response}\n{delim}\n')
print(f'YES/NO: {extract_yes_no(response.response)}\n')
print(f'GT ANSWER: {sample_elm["answer"][0]}\n')
print(f'GT LONG ANSWER:\n{delim}\n{sample_elm["long_answer"]}\n{delim}')

{'hybrid_search_alpha': 0.5,
 'query_mode': 'default',
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': False}


QUERY: Can elevated troponin I levels predict complicated clinical course and inhospital mortality in patients with acute pulmonary embolism?

RESPONSE:
-------------------------
Based on the provided context information, the answer to the query is yes.

The context information states that the purpose of the study was to evaluate the value of elevated cardiac troponin I (cTnI) for prediction of complicated clinical course and in-hospital mortality in patients with confirmed acute pulmonary embolism (PE). The study found that elevated cTnI levels were associated with in-hospital mortality (P = .02), complicated clinical course (P<.001), and right ventricular dysfunction (P<.001). Additionally, in patients with elevated cTnI levels, inhospital mortality (odds ratio [OR], 3.31; 95% confidence interval [CI], 1.82-9.29), hypotension (OR, 7.37; 95% CI, 2.31-23

Collect the query responses and evaluate.

In [155]:
# Get response
all_eval_data = []
eval_dataset_vertex_arr = []
for i in range(10):
  sample_elm = pubmed_data[i]
  query = sample_elm['question']
  retrieved_nodes = query_engine.retriever.retrieve(query)
  response = query_engine.query(query)

  eval_data = dict({
      "user_input": query,
      "response": response.response,
      "retrieved_contexts": [node.text for node in retrieved_nodes],
      "reference": sample_elm['long_answer'],
      "reference_contexts": [sample_elm["context"]],
  })

  eval_data2 = dict({
      "response": response.response,
      # jkwng: construct the full prompt from query and retrieved contexts
      "contexts": d['retrieved_contexts'],
      "prompt": f"""
Answer the question based on the contexts provided.
Question: {query}
Contexts:\n
{retrieved_nodes}
"""
  })

  #print(eval_data2)
  eval_dataset_vertex_arr.append(eval_data2)

eval_dataset_vertex = df.from_dict(eval_dataset_vertex_arr)

eval_task = EvalTask(
    dataset=eval_dataset_vertex,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.QUESTION_ANSWERING_QUALITY],
    experiment=EXPERIMENT,
    output_uri_prefix="gs://jkwng-vertex-experiments/pubmedqa-rag-experiment",
)

eval_result = eval_task.evaluate(experiment_run_name="hybrid-search")

INFO:google.cloud.aiplatform.metadata.experiment_resources:Associating projects/205512073711/locations/us-central1/metadataStores/default/contexts/pubmedqa-rag-experiment-hybrid-search to Experiment: pubmedqa-rag-experiment


INFO:vertexai.evaluation._evaluation:Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 20/20 [00:27<00:00,  1.37s/it]
INFO:vertexai.evaluation._evaluation:All 20 metric requests are successfully computed.
INFO:vertexai.evaluation._evaluation:Evaluation Took:27.486273589998746 seconds


### 5.3 - Using Re-ranker
Set *use_reranker* to *True* to re-rank the context after retrieving it from the vector database.

In [156]:
rag_cfg["use_reranker"] = True
rag_cfg["hybrid_search_alpha"] = 1.0 # Using dense search

In [157]:
# Recreate query engine
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index
).create(**query_engine_args)

# Get response
response = query_engine.query(query)

# Print response
print(f'\n\nQUERY: {query}\n')
print(f'RESPONSE:\n{delim}\n{response.response}\n{delim}\n')
print(f'YES/NO: {extract_yes_no(response.response)}\n')
print(f'GT ANSWER: {sample_elm["answer"][0]}\n')
print(f'GT LONG ANSWER:\n{delim}\n{sample_elm["long_answer"]}\n{delim}')

{'hybrid_search_alpha': 1.0,
 'query_mode': 'default',
 'rerank_top_k': 3,
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': True}


QUERY: Can elevated troponin I levels predict complicated clinical course and inhospital mortality in patients with acute pulmonary embolism?

RESPONSE:
-------------------------
Based on the provided context information, the answer to the query is yes.

Explanation: The study found a significant association between elevated cardiac troponin I (cTnI) levels and complicated clinical course, as well as in-hospital mortality in patients with acute pulmonary embolism. The results showed that patients with elevated cTnI levels had a higher prevalence of inhospital mortality, hypotension, thrombolysis, need for mechanical ventilation, and need for inotropic support. Additionally, these patients had more serious vital parameters at emergency department presentation. The statistical analysis also supported these findings, with P-values indicat

Collect the responses for evaluation

In [158]:
# Get response
all_eval_data = []
eval_dataset_vertex_arr = []
for i in range(10):
  sample_elm = pubmed_data[i]
  query = sample_elm['question']
  retrieved_nodes = query_engine.retriever.retrieve(query)
  response = query_engine.query(query)

  eval_data = dict({
      "user_input": query,
      "response": response.response,
      "retrieved_contexts": [node.text for node in retrieved_nodes],
      "reference": sample_elm['long_answer'],
      "reference_contexts": [sample_elm["context"]],
  })

  eval_data2 = dict({
      "response": response.response,
      # jkwng: construct the full prompt from query and retrieved contexts
      "contexts": d['retrieved_contexts'],
      "prompt": f"""
Answer the question based on the contexts provided.
Question: {query}
Contexts:\n
{retrieved_nodes}
"""
  })

  #print(eval_data2)
  eval_dataset_vertex_arr.append(eval_data2)

eval_dataset_vertex = df.from_dict(eval_dataset_vertex_arr)

eval_task = EvalTask(
    dataset=eval_dataset_vertex,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.QUESTION_ANSWERING_QUALITY],
    experiment=EXPERIMENT,
    output_uri_prefix="gs://jkwng-vertex-experiments/pubmedqa-rag-experiment",
)

eval_result = eval_task.evaluate(experiment_run_name="dense-search-rerank")

INFO:google.cloud.aiplatform.metadata.experiment_resources:Associating projects/205512073711/locations/us-central1/metadataStores/default/contexts/pubmedqa-rag-experiment-dense-search-rerank to Experiment: pubmedqa-rag-experiment


INFO:vertexai.evaluation._evaluation:Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 20/20 [00:31<00:00,  1.55s/it]
INFO:vertexai.evaluation._evaluation:All 20 metric requests are successfully computed.
INFO:vertexai.evaluation._evaluation:Evaluation Took:31.057819511999696 seconds
