# PubMed QA using LlamaIndex

## Introduction
This notebook presents a RAG workflow for the [PubMed QA](https://pubmedqa.github.io/) task using [LlamaIndex](https://www.llamaindex.ai/). The code is written in a configurable fashion, giving you the flexibility to edit the RAG configuration and observe the change in output/responses.

It covers a step-by-step procedure for building the RAG workflow (Stages 1-4) and later runs the pipeline on a sample from the dataset. The notebook also covers the sparse, dense, hybrid retrieval strategies along with the re-ranker. We have alse added an optional component for RAG evaluation using the [Ragas](https://docs.ragas.io/en/stable/) library.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:

      Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```

## STAGE 0 - Set up the RAG workflow environment

#### Import libraries, custom classes and functions

In [None]:
%pip install --quiet \
  llama-index \
  google-cloud-secret-manager \
  datasets \
  llama-index-readers-json \
  llama-index-readers-file \
  llama-index-readers-gcs \
  llama-index-embeddings-vertex \
  llama-index-embeddings-google-genai \
  llama-index-embeddings-huggingface \
  llama-index-embeddings-text-embeddings-inference \
  llama-index-embeddings-vertex-endpoint \
  llama-index-llms-huggingface \
  llama-index-llms-openai-like \
  llama-index-llms-vertex \
  faiss-cpu \
  llama-index-vector-stores-faiss \
  llama-index-vector-stores-weaviate \
  llama-index-vector-stores-vertexaivectorsearch \
  llama-index-retrievers-bm25 \
  rapidfuzz \
  ragas \
  pydantic>=2.10.4 \
  google-cloud-aiplatform>=1.76 \
  langchain-core \
  langchain-cohere \
  langchain-huggingface \
  langchain-google-vertexai
# jkwng: restart the kernel after this

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import sys
import os
import random

from pathlib import Path
from pprint import pprint

from llama_index.core import ServiceContext, Settings, set_global_handler
from llama_index.core.node_parser import SentenceSplitter


# jkwng: in order to make this notebook self contained, i just cut and paste these into the notebook
# from task_dataset import PubMedQATaskDataset

# from utils.hosting_utils import RAGLLM
# from utils.rag_utils import (
#     DocumentReader, RAGEmbedding, RAGQueryEngine, RagasEval,
#     extract_yes_no, validate_rag_cfg
#     )
# from utils.storage_utils import RAGIndex

#### Load config files

*jkwng: we don't need this cell*

In [None]:
# Add root folder of the rag_bootcamp repo to PYTHONPATH
current_dir = Path().resolve()
parent_dir = current_dir.parent
sys.path.insert(0, str(parent_dir))


# from utils.load_secrets import load_env_file
# load_env_file()

*jkwng: or this cell*

In [None]:
# GENERATOR_BASE_URL = os.environ.get("OPENAI_BASE_URL")

# OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

*jkwng: however we need these variables specifically for deployment on Google Cloud*

In [None]:
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
REGION = os.environ.get("GOOGLE_CLOUD_REGION")
GCS_URI = "jkwng-vertex-experiments/rag_bootcamp/pubmed_qa"

#### Set RAG configuration

Below: using `bge-base-en-v1.5` for embeddings and Llama 3.1 8B instruct for generation, both hosted on Vertex Endpoints. Use Gemini 2.0 Flash for LLM-based evals

In [None]:
rag_cfg = {
    # Node parser config
    "chunk_size": 256,
    "chunk_overlap": 0,

    # Embedding model config
    # "embed_model_type": "hf",
    # "embed_model_name": "BAAI/bge-base-en-v1.5",
    "embed_model_type": "vertex-endpoint",
    "embed_model_name": "BAAI/bge-base-en-v1.5",
    "embed_model_endpoint_id": "83814671873736704", # endpoint id
    "embed_model_use_dedicated_endpoint": True,
    "embed_model_dedicated_dns": "83814671873736704.us-central1-205512073711.prediction.vertexai.goog",

    # LLM config
    # "llm_type": "kscope",
    # "llm_name": "Meta-Llama-3.1-8B-Instruct",
    "llm_type": "vertex-endpoint",
    "llm_name": "meta-llama/Llama-3.1-8B-Instruct",
    "llm_endpoint_id": "133354267774812160",
    "llm_use_dedicated_endpoint": True,
    "llm_dedicated_dns": "133354267774812160.us-central1-205512073711.prediction.vertexai.goog",
    "max_new_tokens": 256,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": 50,
    "do_sample": False,

    # Vector DB config
    "vector_db_type": "weaviate", # "weaviate"
    # "vector_db_type": "vertex",
    "vector_db_name": "Pubmed_QA",
    # MODIFY THIS
    "weaviate_url": "https://ds4tx7ttr3ciaui5obmowg.c0.us-east1.gcp.weaviate.cloud",

    # Retriever and query config
    "retriever_type": "vector_index", # "vector_index"
    "retriever_similarity_top_k": 5,
    "query_mode": "default", # "default", "hybrid" - jkwng: changed to default
    "hybrid_search_alpha": 0.0, # float from 0.0 (sparse search - bm25) to 1.0 (vector search)
    "response_mode": "compact",
    "use_reranker": False,
    "rerank_top_k": 3,

    # Evaluation config
    # "eval_llm_type": "kscope",
    # "eval_llm_name": "Meta-Llama-3.1-8B-Instruct",
    "eval_llm_type": "vertex",
    "eval_llm_name": "gemini-2.0-flash-001"
}

Also provided: Using `text-embedding-005` for embeddings, Gemini 2.0 Flash for generation,

In [None]:
rag_cfg = {
    # Node parser config
    "chunk_size": 256,
    "chunk_overlap": 0,

    # Embedding model config
    # "embed_model_type": "hf",
    # "embed_model_name": "BAAI/bge-base-en-v1.5",
    "embed_model_type": "vertex",
    "embed_model_name": "text-embedding-005",

    # LLM config
    # "llm_type": "kscope",
    # "llm_name": "Meta-Llama-3.1-8B-Instruct",
    "llm_type": "vertex",
    "llm_name": "gemini-2.0-flash-001",
    "max_new_tokens": 256,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": 50,
    "do_sample": False,

    # Vector DB config
    "vector_db_type": "weaviate", # "weaviate"
    # "vector_db_type": "vertex",
    "vector_db_name": "Pubmed_QA",
    # MODIFY THIS
    "weaviate_url": "https://ds4tx7ttr3ciaui5obmowg.c0.us-east1.gcp.weaviate.cloud",

    # Retriever and query config
    "retriever_type": "vector_index", # "vector_index"
    "retriever_similarity_top_k": 5,
    "query_mode": "default", # "default", "hybrid" - jkwng: changed to default
    "hybrid_search_alpha": 0.0, # float from 0.0 (sparse search - bm25) to 1.0 (vector search)
    "response_mode": "compact",
    "use_reranker": False,
    "rerank_top_k": 3,

    # Evaluation config
    # "eval_llm_type": "kscope",
    # "eval_llm_name": "Meta-Llama-3.1-8B-Instruct",
    "eval_llm_type": "vertex",
    "eval_llm_name": "gemini-2.0-flash-001"
}

#### Read Weaviate Key

*jkwng: load weaviate API key from secret manager*

In [None]:
from google.cloud import secretmanager

client = secretmanager.SecretManagerServiceClient()

# Access the secret
name = f"projects/{PROJECT_ID}/secrets/weaviate_key/versions/latest"
response = client.access_secret_version(request={"name": name})

# Extract and print the secret value
weaviate_key = response.payload.data.decode("UTF-8")
WEAVIATE_API_KEY = weaviate_key

# try:
#     f = open(Path.home() / ".weaviate.key", "r")
#     f.close()
# except Exception as err:
#     print(f"Could not read your Weaviate key. Please make sure this is available in plain text under your home directory in ~/.weaviate.key: {err}")

#### Preliminary config checks

In [None]:
#@title *jkwng: validate_rag_cfg from utils.rag_utils.py*
def validate_rag_cfg(cfg):
    if cfg["query_mode"] == "hybrid":
        assert (
            cfg["hybrid_search_alpha"] is not None
        ), "hybrid_search_alpha cannot be None if query_mode is set to 'hybrid'"
    if cfg["vector_db_type"] == "weaviate":
        assert (
            cfg["weaviate_url"] is not None
        ), "weaviate_url cannot be None for weaviate vector db"

In [None]:
validate_rag_cfg(rag_cfg)
pprint(rag_cfg)

{'chunk_overlap': 0,
 'chunk_size': 256,
 'do_sample': False,
 'embed_model_name': 'text-embedding-005',
 'embed_model_type': 'vertex',
 'eval_llm_name': 'gemini-2.0-flash-001',
 'eval_llm_type': 'vertex',
 'hybrid_search_alpha': 0.0,
 'llm_name': 'gemini-2.0-flash-001',
 'llm_type': 'vertex',
 'max_new_tokens': 256,
 'query_mode': 'default',
 'rerank_top_k': 3,
 'response_mode': 'compact',
 'retriever_similarity_top_k': 5,
 'retriever_type': 'vector_index',
 'temperature': 0.0,
 'top_k': 50,
 'top_p': 1.0,
 'use_reranker': False,
 'vector_db_name': 'Pubmed_QA',
 'vector_db_type': 'weaviate',
 'weaviate_url': 'https://ds4tx7ttr3ciaui5obmowg.c0.us-east1.gcp.weaviate.cloud'}


## STAGE 1 - Load dataset and documents

#### 1. Load PubMed QA dataset
PubMedQA ([github](https://github.com/pubmedqa/pubmedqa)) is a biomedical question answering dataset. Each instance consists of a question, a context (extracted from PubMed abstracts), a long answer and a yes/no/maybe answer. We make use of the test split of [this](https://huggingface.co/datasets/bigbio/pubmed_qa) huggingface dataset for this notebook.

**The context for each instance is stored as a text file** (referred to as documents), to align the task as a standard RAG use-case.

In [None]:
#@title *jkwng: task_dataset.py*
import os
import json
import torch.utils.data as data
from tqdm import tqdm
from datasets import load_dataset, concatenate_datasets

class PubMedQATaskDataset(data.Dataset):
    def __init__(self, name, all_folds=False, split="test"):
        self.name = name
        subset_str = "pubmed_qa_labeled_fold{fold_id}"
        folds = [0] if not all_folds else list(range(10))

        bigbio_data = []
        source_data = []
        for fold_id in folds:
            bb_data = load_dataset(
                self.name,
                f"{subset_str.format(fold_id=fold_id)}_bigbio_qa",
                split=split,
                trust_remote_code=True,
            )
            s_data = load_dataset(
                self.name,
                f"{subset_str.format(fold_id=fold_id)}_source",
                split=split,
                trust_remote_code=True,
            )
            bigbio_data.append(bb_data)
            source_data.append(s_data)
        bigbio_data = concatenate_datasets(bigbio_data)
        source_data = concatenate_datasets(source_data)

        keys_to_keep = ["id", "question", "context", "answer", "LONG_ANSWER"]
        data_elms = []
        for elm_idx in tqdm(range(len(bigbio_data)), desc="Preparing data"):
            data_elms.append({k: bigbio_data[elm_idx][k] for k in keys_to_keep[:4]})
            data_elms[-1].update(
                {keys_to_keep[-1].lower(): source_data[elm_idx][keys_to_keep[-1]]}
            )

        self.data = data_elms

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)

    def mock_knowledge_base(
        self,
        output_dir,
        one_file_per_sample=False,
        samples_per_file=500,
        sep="\n",
        jsonl=False,
    ):
        """
        Write PubMed contexts to a text file, newline seperated
        """
        pubmed_kb_dir = os.path.join(output_dir, "pubmed_doc")
        os.makedirs(pubmed_kb_dir, exist_ok=True)

        file_ext = "jsonl" if jsonl else "txt"

        if not one_file_per_sample:
            context_str = ""
            context_files = []
            for idx in range(len(self.data)):
                if (idx + 1) % samples_per_file == 0:
                    context_files.append(context_str.rstrip(sep))
                else:
                    if jsonl:
                        context_elm_str = json.dumps(
                            {
                                "id": self.data[idx]["id"],
                                "context": self.data[idx]["context"],
                            }
                        )
                    else:
                        context_elm_str = self.data[idx]["context"]
                    context_str += f"{context_elm_str}{sep}"

            for file_idx in range(len(context_files)):
                filepath = os.path.join(pubmed_kb_dir, f"context{file_idx}.{file_ext}")
                with open(filepath, "w") as f:
                    f.write(context_files[file_idx])

        else:
            assert not jsonl, "Does not support jsonl if one_file_per_sample is True"
            for idx in range(len(self.data)):
                filepath = os.path.join(
                    pubmed_kb_dir, f'{self.data[idx]["id"]}.{file_ext}'
                )
                with open(filepath, "w") as f:
                    f.write(self.data[idx]["context"])

In [None]:
print('Loading PubMed QA data ...')
pubmed_data = PubMedQATaskDataset('bigbio/pubmed_qa')
print(f"Loaded data size: {len(pubmed_data)}")
pubmed_data.mock_knowledge_base(output_dir='./data', one_file_per_sample=True)

Loading PubMed QA data ...


Preparing data: 100%|██████████| 500/500 [00:00<00:00, 1133.98it/s]


Loaded data size: 500


*jkwng: TODO: write knowledge base to GCS - to simulate loading knowledge base from object storage*

#### 2. Load documents
All metadata is excluded by default. Set the *exclude_llm_metadata_keys* and *exclude_embed_metadata_keys* flags to *false* for including it. Please refer to [this](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html) and the *DocumentReader* class from *rag_utils.py* for further details.

In [None]:
#@title *jkwng: DocumentReader from utils.rag_utils.py*

from llama_index.core import (
    SimpleDirectoryReader
)

from llama_index.readers.json import JSONReader

class DocumentReader:
    def __init__(
        self,
        input_dir,
        exclude_llm_metadata_keys=True,
        exclude_embed_metadata_keys=True,
    ):
        self.input_dir = input_dir
        self._file_ext = os.path.splitext(os.listdir(input_dir)[0])[1]

        self.exclude_llm_metadata_keys = exclude_llm_metadata_keys
        self.exclude_embed_metadata_keys = exclude_embed_metadata_keys

    def load_data(self):
        docs = None
        # Use reader based on file extension of documents
        # Only support '.txt' files as of now
        if self._file_ext == ".txt":
            reader = SimpleDirectoryReader(input_dir=self.input_dir)
            docs = reader.load_data()
        elif self._file_ext == ".jsonl":
            reader = JSONReader()
            docs = []
            for file in os.listdir(self.input_dir):
                docs.extend(
                    reader.load_data(os.path.join(self.input_dir, file), is_jsonl=True)
                )
        else:
            raise NotImplementedError(
                f"Does not support {self._file_ext} file extension for document files."
            )

        # Can choose if metadata need to be included as input when passing the doc to LLM or embeddings:
        # https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html
        # Exclude metadata keys from embeddings or LLMs based on flag
        if docs is not None:
            all_metadata_keys = list(docs[0].metadata.keys())
            if self.exclude_llm_metadata_keys:
                for doc in docs:
                    doc.excluded_llm_metadata_keys = all_metadata_keys
            if self.exclude_embed_metadata_keys:
                for doc in docs:
                    doc.excluded_embed_metadata_keys = all_metadata_keys

        return docs

In [None]:
print('Loading documents ...')
reader = DocumentReader(input_dir="./data/pubmed_doc")
docs = reader.load_data()
print(f'No. of documents loaded: {len(docs)}')

Loading documents ...
No. of documents loaded: 500


*jkwng: TODO: load the knowledge base from GCS*

## STAGE 2 - Load node parser, embedding, LLM and set service context

#### 1. Load node parser to split documents into smaller chunks

In [None]:
print('Loading node parser ...')
node_parser = SentenceSplitter(chunk_size=rag_cfg['chunk_size'], chunk_overlap=rag_cfg['chunk_overlap'])
nodes = node_parser.get_nodes_from_documents(docs)

Loading node parser ...


#### 2. Load embedding model
LlamaIndex supports embedding models from OpenAI, Cohere, HuggingFace, etc. Please refer to [this](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#custom-embedding-model) for building a custom embedding model.

In [None]:
#@title *jkwng: RAGEmbedding from utils.rag_utils.py - update to support using Vertex AI Gemini Embeddings models*
from llama_index.embeddings.vertex_endpoint import VertexEndpointEmbedding
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.text_embeddings_inference import TextEmbeddingsInference
import google.auth

credentials, project_id = google.auth.default()
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)

class RAGEmbedding:
    """
    LlamaIndex supports embedding models from OpenAI, Cohere, HuggingFace, etc.
    https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html
    We can also build out custom embedding model:
    https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#custom-embedding-model
    """

    def __init__(self, model_type, model_name):
        self.model_type = model_type
        self.model_name = model_name

    def load_model(self, **kwargs):
        print(f"Loading {self.model_type} embedding model ...")
        if self.model_type == "hf":
            # Using bge base HuggingFace embeddings, can choose others based on leaderboard:
            # https://huggingface.co/spaces/mteb/leaderboard
            model = HuggingFaceEmbedding(
                model_name=self.model_name,
                device="cuda",
                trust_remote_code=True,
            )  # max_length does not have any effect?
        elif self.model_type == "vertex":
            model = GoogleGenAIEmbedding(
                model_name=self.model_name,
                vertexai_config={
                  "project": PROJECT_ID,
                  "location": REGION,
                },
                embed_batch_size=100,
            )
        elif self.model_type == "vertex-endpoint":
            model = VertexEndpointEmbedding(
                endpoint_id=kwargs["embed_model_endpoint_id"],
                project_id=PROJECT_ID,
                location=REGION,
                endpoint_kwargs={
                    "use_dedicated_endpoint": kwargs["embed_model_use_dedicated_endpoint"],
                },
            )  # max_length does not have any effect?
        elif self.model_type == "openai":
            # TODO - Add OpenAI embedding model
            # embed_model = OpenAIEmbedding()
            raise NotImplementedError

        return embed_model

In [None]:
embed_model = RAGEmbedding(model_type=rag_cfg['embed_model_type'], model_name=rag_cfg['embed_model_name']).load_model(**rag_cfg)

Loading vertex embedding model ...


#### 3. Load LLM for generation
LlamaIndex supports LLMs from OpenAI, Cohere, HuggingFace, AI21, etc. Please refer to [this](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html#example-using-a-custom-llm-model-advanced) for loading a custom LLM model for generation.

In [None]:
#@title *jkwng: RAGLLM from utils.hosting_utils.py - updated for Gemini on Vertex*

from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.openai_like import OpenAILike
from llama_index.llms.vertex import Vertex

import google.auth
import openai

creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

class RAGLLM:
    """
    LlamaIndex supports OpenAI, Cohere, AI21 and HuggingFace LLMs
    https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html
    """

    def __init__(self, llm_type, llm_name, api_base=None, api_key=None):
        self.llm_type = llm_type
        self.llm_name = llm_name

        self._api_base = api_base
        self._api_key = api_key

        self.local_model_path = "/model-weights"

    def load_model(self, **kwargs):
        print(f"Configuring {self.llm_type} LLM model ...")
        gen_arg_keys = ["temperature", "top_p", "top_k", "do_sample"]
        gen_kwargs = {k: v for k, v in kwargs.items() if k in gen_arg_keys}
        if self.llm_type == "local":
            # Using local HuggingFace LLM stored at /model-weights
            llm = HuggingFaceLLM(
                tokenizer_name=f"{self.local_model_path}/{self.llm_name}",
                model_name=f"{self.local_model_path}/{self.llm_name}",
                device_map="auto",
                context_window=4096,
                max_new_tokens=kwargs["max_new_tokens"],
                generate_kwargs=gen_kwargs,
                # model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
            )
        # jkwng: add vertex support
        elif self.llm_type in ["vertex"]:
            llm = Vertex(
                model=self.llm_name,
                temperature=kwargs["temperature"],
                max_tokens=kwargs["max_new_tokens"],
            )
        elif self.llm_type in ["vertex-endpoint"]:
            ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
                PROJECT_ID, REGION, kwargs["llm_endpoint_id"] # llm_name is the endpoint id
            )
            BASE_URL = (
              f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
            )
            try:
                if kwargs["llm_use_dedicated_endpoint"]:
                    BASE_URL = f"https://{kwargs['llm_dedicated_dns']}/v1/{ENDPOINT_RESOURCE_NAME}"
            except NameError:
                pass
            llm = OpenAILike(
                model=self.llm_name,
                temperature=kwargs["temperature"],
                max_tokens=kwargs["max_new_tokens"],
                api_base=BASE_URL,
                api_key=creds.token,
                is_chat_model=True,
                top_p=kwargs["top_p"],
                top_k=kwargs["top_k"],
            )
        elif self.llm_type in ["openai", "kscope"]:
            llm = OpenAILike(
                model=self.llm_name,
                api_base=self._api_base,
                api_key=self._api_key,
                is_chat_model=True,
                temperature=kwargs["temperature"],
                max_tokens=kwargs["max_new_tokens"],
                top_p=kwargs["top_p"],
                top_k=kwargs["top_k"],
            )
        return llm

In [None]:
llm = RAGLLM(
    llm_type=rag_cfg['llm_type'],
    llm_name=rag_cfg['llm_name'],
    # api_base=GENERATOR_BASE_URL,
    # api_key=OPENAI_API_KEY,
).load_model(**rag_cfg)

Configuring vertex LLM model ...


#### 4. Use ```Settings``` to set the node parser, embedding model, LLM, etc.

In [None]:
Settings.text_splitter = node_parser
Settings.llm = llm
Settings.embed_model = embed_model

## STAGE 3 - Create index using the appropriate vector store
All vector stores supported by LlamaIndex along with their available features are listed [here](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html).

If you are using LangChain, the supported vector stores can be found [here](https://python.langchain.com/docs/modules/data_connection/vectorstores/).

*jkwng: Llama Index + Vertex AI Vector Store integration seems to be broken and has the following things that do not work:*

- *Batch Updates to a staging bucket gives an error, only Streaming Index works*
- *Retrieval is broken - the API has changed but the library has not been updated*

*For the purposes of the notebook - we will use Weaviate*

In [None]:
#@title *jkwng: RAGIndex from utils.storage_utils.py - modified to use Vertex Vector Store*
import faiss
import os
import weaviate
from google.cloud import aiplatform
from google.cloud import storage

from pathlib import Path

from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore

# from .rag_utils import get_embed_model_dim
def get_embed_model_dim(embed_model):
    embed_out = embed_model.get_text_embedding("Dummy Text")
    return len(embed_out)

class RAGIndex:
    """
    Use storage context to set custom vector store
    Available options: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html
    Use Chroma: https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo.html
    LangChain vector stores: https://python.langchain.com/docs/modules/data_connection/vectorstores/
    """

    def __init__(self, db_type, db_name):
        self.db_type = db_type
        self.db_name = db_name
        self._persist_dir = f"./.{db_type}_index_store/"

    def create_index(self, docs, save=True, **kwargs):
        # Only supports Weaviate as of now
        if self.db_type == "weaviate":
            # with open(Path.home() / ".weaviate.key", "r") as f:
            #     weaviate_api_key = f.read().rstrip("\n")
            weaviate_client = weaviate.connect_to_wcs(
                cluster_url=kwargs["weaviate_url"],
                auth_credentials=weaviate.auth.AuthApiKey(WEAVIATE_API_KEY),
            )
            vector_store = WeaviateVectorStore(
                weaviate_client=weaviate_client,
                index_name=self.db_name,
            )
        elif self.db_type == "local":
            # Use FAISS vector database for local index
            faiss_dim = get_embed_model_dim(kwargs["embed_model"])
            faiss_index = faiss.IndexFlatL2(faiss_dim)
            vector_store = FaissVectorStore(faiss_index=faiss_index)
        # jkwng: added Vertex AI Vector Search support here
        elif self.db_type == "vertex":
          # check if storage bucket exists
          bucket_names = [
              bucket.name for bucket in storage.Client().list_buckets()
          ]

          dst_bucket = f"jkwng-{self.db_name.replace('_', '-').lower()}"

          if dst_bucket not in bucket_names:
              print(f"Creating bucket {dst_bucket} ...")
              storage.Client().create_bucket(dst_bucket, location=REGION)
              print(f"Bucket {dst_bucket} created")
          else:
              print(f"Bucket {dst_bucket} exists")

          # check if index exists already in vertex
          index_names = [
              index.resource_name
              for index in aiplatform.MatchingEngineIndex.list(
                  filter=f"display_name={self.db_name}"
              )
          ]

          # create the index if it doesn't exist
          if len(index_names) == 0:
              print(f"Creating Vector Search index {self.db_name} ...")
              vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
                  display_name=self.db_name,
                  dimensions=768,
                  approximate_neighbors_count=100,
                  distance_measure_type="DOT_PRODUCT_DISTANCE",
                  shard_size="SHARD_SIZE_SMALL",
                  index_update_method="STREAM_UPDATE",  # allowed values BATCH_UPDATE , STREAM_UPDATE
              )
              print(
                  f"Vector Search index {vs_index.display_name} created with resource name {vs_index.resource_name}"
              )
          else:
              vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])
              print(
                  f"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}"
              )

          # create an endpoint to serve the index
          endpoint_names = [
              endpoint.resource_name
              for endpoint in aiplatform.MatchingEngineIndexEndpoint.list(
                  filter=f"display_name={self.db_name}"
              )
          ]

          if len(endpoint_names) == 0:
              print(
                  f"Creating Vector Search index endpoint {self.db_name} ..."
              )
              vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
                  display_name=self.db_name, public_endpoint_enabled=True
              )
              print(
                  f"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}"
              )
          else:
              vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
                  index_endpoint_name=endpoint_names[0]
              )
              print(
                  f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
              )

          # check if endpoint exists
          index_endpoints = [
              (deployed_index.index_endpoint, deployed_index.deployed_index_id)
              for deployed_index in vs_index.deployed_indexes
          ]

          if len(index_endpoints) == 0:
              print(
                  f"Deploying Vector Search index {vs_index.display_name} at endpoint {vs_endpoint.display_name} ..."
              )
              vs_deployed_index = vs_endpoint.deploy_index(
                  index=vs_index,
                  deployed_index_id=self.db_name,
                  display_name=self.db_name,
                  machine_type="e2-standard-2",
                  min_replica_count=1,
                  max_replica_count=1,
              )
              print(
                  f"Vector Search index {vs_index.display_name} is deployed at endpoint {vs_deployed_index.display_name}"
              )
          else:
              vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(
                  index_endpoint_name=index_endpoints[0][0]
              )
              print(
                  f"Vector Search index {vs_index.display_name} is already deployed at endpoint {vs_deployed_index.display_name}"
              )

          # setup storage
          vector_store = VertexAIVectorStore(
              project_id=PROJECT_ID,
              region=REGION,
              index_id=vs_index.resource_name,
              endpoint_id=vs_endpoint.resource_name,
              gcs_bucket_name=dst_bucket,
          )

        else:
            raise NotImplementedError(f"Incorrect vector db type - {self.db_type}")

        if os.path.isdir(self._persist_dir):
            # Load if index already saved
            print(f"Loading index from {self._persist_dir} ...")
            storage_context = StorageContext.from_defaults(
                vector_store=vector_store,
                persist_dir=self._persist_dir,
            )
            index = load_index_from_storage(storage_context)
        else:
            # Re-index
            print("Creating new index ...")
            storage_context = StorageContext.from_defaults(vector_store=vector_store)
            index = VectorStoreIndex.from_documents(
                docs, storage_context=storage_context
            )
            if save:
                os.makedirs(self._persist_dir, exist_ok=True)
                index.storage_context.persist(persist_dir=self._persist_dir)

        return index

In [None]:
index = RAGIndex(
    db_type=rag_cfg['vector_db_type'],
    db_name=rag_cfg['vector_db_name'],
).create_index(docs, weaviate_url=rag_cfg["weaviate_url"])

Loading index from ./.weaviate_index_store/ ...


## STAGE 4 - Build query engine

Now build a query engine using *retriever* and *response_synthesizer*. LlamaIndex also supports different types of [retrievers](https://docs.llamaindex.ai/en/stable/api_reference/query/retrievers.html) and [response modes](https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/root.html#configuring-the-response-mode) for various use-cases.

[Weaviate hybrid search](https://weaviate.io/blog/hybrid-search-explained) explains how dense and sparse search is combined.

In [None]:
#@title *jkwng - RAGQueryEngine from utils.rag_utils.py - add support for Vertex AI*
from llama_index.core.postprocessor import LLMRerank
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.retrievers.bm25 import BM25Retriever

from llama_index.core import (
    PromptTemplate,
    get_response_synthesizer,
)

class RAGQueryEngine:
    """
    https://docs.llamaindex.ai/en/stable/understanding/querying/querying.html
    TODO - Check other args for RetrieverQueryEngine
    """

    def __init__(self, retriever_type, vector_index):
        self.retriever_type = retriever_type
        self.index = vector_index
        self.retriever = None
        self.node_postprocessor = None
        self.response_synthesizer = None

    def create(self, similarity_top_k, response_mode, **kwargs):
        self.set_retriever(similarity_top_k, **kwargs)
        self.set_response_synthesizer(response_mode=response_mode)
        if kwargs["use_reranker"]:
            self.set_node_postprocessors(rerank_top_k=kwargs["rerank_top_k"])
        query_engine = RetrieverQueryEngine(
            retriever=self.retriever,
            node_postprocessors=self.node_postprocessor,
            response_synthesizer=self.response_synthesizer,
        )
        return query_engine

    def set_retriever(self, similarity_top_k, **kwargs):
        # Other retrievers can be used based on the type of index: List, Tree, Knowledge Graph, etc.
        # https://docs.llamaindex.ai/en/stable/api_reference/query/retrievers.html
        # Find LlamaIndex equivalents for the following:
        # Check MultiQueryRetriever from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever
        # Check Contextual compression from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/
        # Check Ensemble Retriever from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble
        # Check self-query from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/self_query
        # Check WebSearchRetriever from LangChain: https://python.langchain.com/docs/modules/data_connection/retrievers/web_research
        if self.retriever_type == "vector_index":
            self.retriever = VectorIndexRetriever(
                index=self.index,
                similarity_top_k=similarity_top_k,
                vector_store_query_mode=kwargs["query_mode"],
                alpha=kwargs["hybrid_search_alpha"],
            )
        elif self.retriever_type == "bm25":
            self.retriever = BM25Retriever(
                nodes=kwargs["nodes"],
                tokenizer=kwargs["tokenizer"],
                similarity_top_k=similarity_top_k,
            )
        else:
            raise NotImplementedError(
                f"Incorrect retriever type - {self.retriever_type}"
            )

    def set_node_postprocessors(self, rerank_top_k=2):
        # Node postprocessor: Porcessing nodes after retrieval before passing to the LLM for generation
        # Re-ranking step can be performed here!
        # Nodes can be re-ordered to include more relevant ones at the top: https://python.langchain.com/docs/modules/data_connection/document_transformers/post_retrieval/long_context_reorder
        # https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/node_postprocessors.html

        self.node_postprocessor = [LLMRerank(top_n=rerank_top_k)]

    def set_response_synthesizer(self, response_mode):
        # Other response modes: https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/root.html#configuring-the-response-mode
        qa_prompt_tmpl = (
            "Context information is below.\n"
            "---------------------\n"
            "{context_str}\n"
            "---------------------\n"
            "Given the context information and not prior knowledge, answer the query while providing an explanation. "
            "If your answer is in favour of the query, end your response with 'yes' otherwise end your response with 'no'.\n"
            "Query: {query_str}\n"
            "Answer: "
        )
        qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl)

        self.response_synthesizer = get_response_synthesizer(
            text_qa_template=qa_prompt_tmpl,
            response_mode=response_mode,
        )

In [None]:
def set_query_engine_args(rag_cfg, docs):
    query_engine_args = {
        "similarity_top_k": rag_cfg['retriever_similarity_top_k'],
        "response_mode": rag_cfg['response_mode'],
        "use_reranker": False,
    }

    # jkwng: add that retriever type vector_index could be "vertex" too
    # jkwng: note we don't actually use hybrid search for vertex ai vector search
    if (rag_cfg["retriever_type"] == "vector_index") and (rag_cfg["vector_db_type"] == "weaviate"):
        query_engine_args.update({
            "query_mode": rag_cfg["query_mode"],
            "hybrid_search_alpha": rag_cfg["hybrid_search_alpha"]
        })
    elif (rag_cfg["retriever_type"] == "vector_index") and (rag_cfg["vector_db_type"] == "vertex"):
        query_engine_args.update({
            # jkwng: only default mode works with VVS
            "query_mode": "default"
        })
    elif rag_cfg["retriever_type"] == "bm25":
        nodes = Settings.text_splitter.get_nodes_from_documents(docs)
        tokenizer = Settings.embed_model._tokenizer
        query_engine_args.update({"nodes": nodes, "tokenizer": tokenizer})

    if rag_cfg["use_reranker"]:
        query_engine_args.update({"use_reranker": True, "rerank_top_k": rag_cfg["rerank_top_k"]})

    return query_engine_args

In [None]:
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)

{'hybrid_search_alpha': 0.0,
 'query_mode': 'default',
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': False}


In [None]:
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index,
).create(**query_engine_args)

## STAGE 5 - Finally query the model !
**Note:** We are using keyword based search or sparse search since *hybrid_search_alpha* is set to 0.0 by default.

#### [TODO] Change seed to experiment with a different sample

In [None]:
random.seed(237)

In [None]:
sample_idx = random.randint(0, len(pubmed_data)-1)
sample_elm = pubmed_data[sample_idx]
pprint(sample_elm)

{'answer': ['no'],
 'context': 'Human immunodeficiency virus (HIV)-infected patients have '
            'generally been excluded from transplantation. Recent advances in '
            'the management and prognosis of these patients suggest that this '
            'policy should be reevaluated. To explore the current views of '
            'U.S. transplant centers toward transplanting asymptomatic '
            'HIV-infected patients with end-stage renal disease, a written '
            'survey was mailed to the directors of transplantation at all 248 '
            'renal transplant centers in the United States. All 148 responding '
            'centers said they require HIV testing of prospective kidney '
            'recipients, and 84% of these centers would not transplant an '
            'individual who refuses HIV testing. The vast majority of '
            'responding centers would not transplant a kidney from a cadaveric '
            '(88%) or a living donor (91%) into an asymp

In [None]:
#@title *jkwng: extract_yes_no from utils.rag_utils.py*
import re

def extract_yes_no(resp):
    match_pat = r"\b(?:yes|no)\b"
    match_txt = re.search(match_pat, resp, re.IGNORECASE)
    if match_txt:
        return match_txt.group(0)
    return "none"

In [None]:
query = sample_elm['question']

response = query_engine.query(query)

delim = "".join(["-"]*25)
print(f'QUERY: {query}\n')
print(f'RESPONSE:\n{delim}\n{response.response}\n{delim}\n')
print(f'YES/NO: {extract_yes_no(response.response)}\n')
print(f'GT ANSWER: {sample_elm["answer"][0]}\n')
print(f'GT LONG ANSWER:\n{delim}\n{sample_elm["long_answer"]}\n{delim}')

QUERY: Should all human immunodeficiency virus-infected patients with end-stage renal disease be excluded from transplantation?

RESPONSE:
-------------------------
Based on the provided context, the majority of U.S. transplant centers surveyed would not transplant a kidney into an asymptomatic HIV-infected patient. The primary reasons cited are concerns about harm to the patient and the potential waste of organs. However, the initial statement in the first paragraph suggests that this policy should be reevaluated due to advances in HIV management. The information does not explicitly state that *all* HIV-infected patients with end-stage renal disease should be excluded, but it indicates a strong reluctance among transplant centers to perform such transplants. Therefore, the information does not support the query.

no

-------------------------

YES/NO: no

GT ANSWER: no

GT LONG ANSWER:
-------------------------
The great majority of U.S. renal transplant centers will not transplant ki

#### [OPTIONAL] [Ragas](https://docs.ragas.io/en/latest/) evaluation
Following are the commonly used metrics for evaluating a RAG workflow:
* [Faithfulness](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/faithfulness/): Measures the factual correctness of the generated answer based on the retrived context. Value lies between 0 and 1. **Evaluated using a LLM.**
* [Answer Relevance](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/answer_relevance/): Measures how relevant the answer is to the given query. Value lies between 0 and 1. **Evaluated using a LLM.**
* [Context Precision](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/context_precision/): Precision of the retriever as measured using the retrieved and the ground truth context. Value lies between 0 and 1. LLM can be used for evaluation.
* [Context Recall](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/context_recall/): Recall of the retriever as measured using the retrieved and the ground truth context. Value lies between 0 and 1. LLM can be used for evaluation.

Note: If you are planning to use **OpenAI models as evaluation LLMs**, store your OpenAI API key in ```~/.ragas_openai.env``` using the following format:

```bash
   export RAGAS_OPENAI_BASE_URL="https://api.openai.com/v1"
   export RAGAS_OPENAI_API_KEY=<openai_api_key>
```

Once done, **uncomment the next cell** to load these environment variables

In [None]:
# from utils.load_secrets import load_env_file_ragas
# load_env_file_ragas()

In [None]:
#@title *jkwng RagasEval from utils.rag_utils.py - update to include support for Vertex AI Gemini*

from langchain_cohere import ChatCohere
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint
from langchain_openai import ChatOpenAI
from langchain_google_vertexai import VertexAIEmbeddings, ChatVertexAI

from ragas import EvaluationDataset, evaluate as ragas_evaluate
from ragas.embeddings import LangchainEmbeddingsWrapper, LlamaIndexEmbeddingsWrapper
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import (
    Faithfulness,
    NonLLMContextPrecisionWithReference,
    NonLLMContextRecall,
    ResponseRelevancy,
)

RAGAS_METRIC_MAP = {
    "faithfulness": Faithfulness(),
    "relevancy": ResponseRelevancy(),
    "recall": NonLLMContextRecall(),
    "precision": NonLLMContextPrecisionWithReference(),
}

class RagasEval:
    def __init__(
        self, metrics, eval_llm_type, eval_llm_name, embed_model_type, embed_model_name, **kwargs
    ):
        self.eval_llm_type = eval_llm_type  # "openai", "cohere", "local", "kscope", "vertex"
        self.eval_llm_name = eval_llm_name

        self.temperature = kwargs.get("temperature", 0.0)
        self.max_tokens = kwargs.get("max_tokens", 256)

        self.embed_model_type = embed_model_type # "openai", "vertex", "vertex-endpoint"
        self.embed_model_name = embed_model_name
        self.embed_model_endpoint_id = kwargs.get("embed_model_endpoint_id", None)
        self.embed_model_use_dedicated_endpoint = kwargs.get("embed_model_use_dedicated_endpoint", False)

        self._prepare_embedding()
        self._prepare_llm()

        self.metrics = [RAGAS_METRIC_MAP[elm] for elm in metrics]

    def _prepare_data(self, data):
        return EvaluationDataset.from_list(data)

    def _prepare_embedding(self):
        model_kwargs = {"device": "cuda", "trust_remote_code": True}
        encode_kwargs = {
            "normalize_embeddings": True
        }  # set True to compute cosine similarity

        if self.embed_model_type == "openai":
          self.eval_embedding = LangchainEmbeddingsWrapper(
              HuggingFaceEmbeddings(
                model_name=self.embed_model_name,
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs,
              )
          )
        elif self.embed_model_type == "vertex":
          self.eval_embedding = LangchainEmbeddingsWrapper(
              VertexAIEmbeddings(
                  model_name=self.embed_model_name,
                  credentials=credentials,
              )
          )
        elif self.embed_model_type == "vertex-endpoint":
          self.eval_embedding = LlamaIndexEmbeddingsWrapper(
              VertexEndpointEmbedding(
                endpoint_id=self.embed_model_endpoint_id,
                project_id=PROJECT_ID,
                location=REGION,
                endpoint_kwargs={
                    "use_dedicated_endpoint": self.embed_model_use_dedicated_endpoint,
                },
              )
          )

    def _prepare_llm(self):
        if self.eval_llm_type == "local":
            self.eval_llm = LangchainLLMWrapper(
                HuggingFaceEndpoint(
                    repo_id=f"meta-llama/{self.eval_llm_name}",
                    temperautre=self.temperature,
                    max_new_tokens=self.max_tokens,
                    huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"],
                )
            )
        elif self.eval_llm_type == "kscope":
            self.eval_llm = LangchainLLMWrapper(
                ChatOpenAI(
                    model=self.eval_llm_name,
                    temperature=self.temperature,
                    max_tokens=self.max_tokens,
                )
            )
        elif self.eval_llm_type == "openai":
            self.eval_llm = LangchainLLMWrapper(
                ChatOpenAI(
                    model=self.eval_llm_name,
                    temperature=self.temperature,
                    max_tokens=self.max_tokens,
                    base_url=os.environ["RAGAS_OPENAI_BASE_URL"],
                    api_key=os.environ["RAGAS_OPENAI_API_KEY"],
                )
            )
        elif self.eval_llm_type == "cohere":
            self.eval_llm = LangchainLLMWrapper(
                ChatCohere(
                    model=self.eval_llm_name,
                )
            )
        elif self.eval_llm_type == "vertex":
            self.eval_llm = LangchainLLMWrapper(
                ChatVertexAI(
                  model_name=self.eval_llm_name,
                  temperature=self.temperature,
                  max_tokens=self.max_tokens,
              )
            )

    def evaluate(self, data):
        eval_data = self._prepare_data(data)
        result = ragas_evaluate(
            dataset=eval_data,
            metrics=self.metrics,
            llm=self.eval_llm,
            embeddings=self.eval_embedding,
        )
        return result

In [None]:
retrieved_nodes = query_engine.retriever.retrieve(query)

eval_data = [dict({
    "user_input": query,
    "response": response.response,
    "retrieved_contexts": [node.text for node in retrieved_nodes],
    "reference": sample_elm['long_answer'],
    "reference_contexts": [sample_elm["context"]],
})]
pprint(eval_data)

[{'reference': 'The great majority of U.S. renal transplant centers will not '
               'transplant kidneys to HIV-infected patients with end-stage '
               'renal disease, even if their infection is asymptomatic. '
               'However, advances in the management of HIV infection and a '
               'review of relevant ethical issues suggest that this approach '
               'should be reconsidered.',
  'reference_contexts': ['Human immunodeficiency virus (HIV)-infected patients '
                         'have generally been excluded from transplantation. '
                         'Recent advances in the management and prognosis of '
                         'these patients suggest that this policy should be '
                         'reevaluated. To explore the current views of U.S. '
                         'transplant centers toward transplanting asymptomatic '
                         'HIV-infected patients with end-stage renal disease, '
                

In [None]:
eval_obj = RagasEval(
    metrics=["faithfulness", "relevancy", "recall", "precision"],
    max_tokens=1024,
    **rag_cfg
)

In [None]:
eval_result = eval_obj.evaluate(eval_data)
pprint(eval_result)

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

{'faithfulness': 0.8571, 'answer_relevancy': 0.0000, 'non_llm_context_recall': 1.0000, 'non_llm_context_precision_with_reference': 1.0000}


### 5.1 - Dense Search
Set *hybrid_search_alpha* to 1.0 for dense vector search.

In [None]:
rag_cfg["hybrid_search_alpha"] = 1.0

In [None]:
# Recreate query engine
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index
).create(**query_engine_args)

# Get response
response = query_engine.query(query)

# Print response
print(f'\n\nQUERY: {query}\n')
print(f'RESPONSE:\n{delim}\n{response.response}\n{delim}\n')
print(f'YES/NO: {extract_yes_no(response.response)}\n')
print(f'GT ANSWER: {sample_elm["answer"][0]}\n')
print(f'GT LONG ANSWER:\n{delim}\n{sample_elm["long_answer"]}\n{delim}')

{'hybrid_search_alpha': 1.0,
 'query_mode': 'default',
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': False}


QUERY: Should all human immunodeficiency virus-infected patients with end-stage renal disease be excluded from transplantation?

RESPONSE:
-------------------------
Based on the provided context, the majority of U.S. transplant centers surveyed would not transplant a kidney into an asymptomatic HIV-infected patient. The primary reasons cited are concerns about harm to the patient and the potential waste of organs. However, the initial statement in the first paragraph suggests that this policy should be reevaluated due to advances in HIV management. The information does not explicitly state that *all* HIV-infected patients with end-stage renal disease should be excluded, but it indicates a strong reluctance among transplant centers to perform such transplants. Therefore, the information does not support the query.

no

-------------------------

YES/NO: n

#### [OPTIONAL] Ragas evaluation

In [None]:
retrieved_nodes = query_engine.retriever.retrieve(query)

eval_data = [dict({
    "user_input": query,
    "response": response.response,
    "retrieved_contexts": [node.text for node in retrieved_nodes],
    "reference": sample_elm['long_answer'],
    "reference_contexts": [sample_elm["context"]],
})]

eval_result = eval_obj.evaluate(eval_data)
pprint(eval_result)

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

{'faithfulness': 0.8571, 'answer_relevancy': 0.0000, 'non_llm_context_recall': 1.0000, 'non_llm_context_precision_with_reference': 1.0000}


### 5.2 - Hybrid Search
Set *hybrid_search_alpha* to 0.5 for hybrid search with equal weightage for dense and sparse (keyword-based) search.

In [None]:
rag_cfg["hybrid_search_alpha"] = 0.5

In [None]:
# Recreate query engine
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index
).create(**query_engine_args)

# Get response
response = query_engine.query(query)

# Print response
print(f'\n\nQUERY: {query}\n')
print(f'RESPONSE:\n{delim}\n{response.response}\n{delim}\n')
print(f'YES/NO: {extract_yes_no(response.response)}\n')
print(f'GT ANSWER: {sample_elm["answer"][0]}\n')
print(f'GT LONG ANSWER:\n{delim}\n{sample_elm["long_answer"]}\n{delim}')

{'hybrid_search_alpha': 0.5,
 'query_mode': 'default',
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': False}


QUERY: Should all human immunodeficiency virus-infected patients with end-stage renal disease be excluded from transplantation?

RESPONSE:
-------------------------
Based on the provided context, the majority of U.S. transplant centers surveyed would not transplant a kidney into an asymptomatic HIV-infected patient. The primary reasons cited are concerns about harm to the patient and the potential waste of organs. However, the initial statement in the first paragraph suggests that this policy should be reevaluated due to advances in HIV management. The information does not explicitly state that *all* HIV-infected patients with end-stage renal disease should be excluded, but it indicates a strong reluctance among transplant centers to perform such transplants. Therefore, the information does not support the query.

no

-------------------------

YES/NO: n

#### [OPTIONAL] Ragas evaluation

In [None]:
retrieved_nodes = query_engine.retriever.retrieve(query)

eval_data = [dict({
    "user_input": query,
    "response": response.response,
    "retrieved_contexts": [node.text for node in retrieved_nodes],
    "reference": sample_elm['long_answer'],
    "reference_contexts": [sample_elm["context"]],
})]

eval_result = eval_obj.evaluate(eval_data)
pprint(eval_result)

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

ERROR:asyncio:Exception in callback PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()
handle: <Handle PollerCompletionQueue._handle_events(<_UnixSelecto...e debug=False>)()>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable


{'faithfulness': 0.8571, 'answer_relevancy': 0.0000, 'non_llm_context_recall': 1.0000, 'non_llm_context_precision_with_reference': 1.0000}


### 5.3 - Using Re-ranker
Set *use_reranker* to *True* to re-rank the context after retrieving it from the vector database.

In [None]:
rag_cfg["use_reranker"] = True
rag_cfg["hybrid_search_alpha"] = 1.0 # Using dense search

In [None]:
# Recreate query engine
query_engine_args = set_query_engine_args(rag_cfg, docs)
pprint(query_engine_args)
query_engine = RAGQueryEngine(
    retriever_type=rag_cfg['retriever_type'],
    vector_index=index
).create(**query_engine_args)

# Get response
response = query_engine.query(query)

# Print response
print(f'\n\nQUERY: {query}\n')
print(f'RESPONSE:\n{delim}\n{response.response}\n{delim}\n')
print(f'YES/NO: {extract_yes_no(response.response)}\n')
print(f'GT ANSWER: {sample_elm["answer"][0]}\n')
print(f'GT LONG ANSWER:\n{delim}\n{sample_elm["long_answer"]}\n{delim}')

{'hybrid_search_alpha': 1.0,
 'query_mode': 'default',
 'rerank_top_k': 3,
 'response_mode': 'compact',
 'similarity_top_k': 5,
 'use_reranker': True}


ResponseValidationError: The model response did not complete successfully.
Finish reason: 2.
Finish message: .
Safety ratings: [].
To protect the integrity of the chat session, the request and response were not added to chat history.
To skip the response validation, specify `model.start_chat(response_validation=False)`.
Note that letting blocked or otherwise incomplete responses into chat history might lead to future interactions being blocked by the service.

#### [OPTIONAL] Ragas evaluation

In [None]:
retrieved_nodes = query_engine.retriever.retrieve(query)

eval_data = [dict({
    "user_input": query,
    "response": response.response,
    "retrieved_contexts": [node.text for node in retrieved_nodes],
    "reference": sample_elm['long_answer'],
    "reference_contexts": [sample_elm["context"]],
})]

eval_result = eval_obj.evaluate(eval_data)
pprint(eval_result)