# Overview

This notebook will be in charge of demonstrating how the system works. I not intended by any chance that this approach would be used in a productive environment (not even development) as there are other options as dockerizing and hosting the service in a microservice-like architecture (using fastAPI for example) or even recurring to cloud based solutions as AWS Bedrock to create a flow and expose as an endpoint without the overhead of de deployment process.

Nevertheles, one could think that the job that this notebook will do is basically what would happen at a user query time. The proposed architecture is only composed of 2 main steps, the retrieval and generation, this is not a productive architecture by any chance as other several components as *intent classifiers*, *routers*, *guardrails* could be applied too (always balancing the cost-latency tradeoff)

The reasons on why this approach was selected to address this takehome assesment are already detailed in the README file you can find at the root of this repository.

---
# Setup

Before we start, it's needed to perform some setup for our environment as installing the needed libraries, loading the ENV variables, config files and downloading the data model that was trained in the training pipeline (please refer to /model_training)

In [1]:
%pip install --quiet -r rag_system/requirements.txt

You should consider upgrading via the '/Users/nicolas.dominutti/Desktop/ml/medical-qa-system/.venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
from dotenv import load_dotenv
load_dotenv("rag_system/.env")
MODEL_PATH = 'rag_system/model'

In [3]:
from general_utils import load_config
CONFIG = load_config()

In [4]:
from general_utils import S3Manager
s3_client = S3Manager.get_client('s3')

Specifically in the following cell I will download the fine-tuned FlagEmbedding bge-base-en-v1.5 model, then I'll load it utilizing the FlagEmbedding library. It's important to recall that the selected model is *instruction tuned*, so I will pass it the very same instruction used at fine-tuning time

In [5]:
import os
if not os.listdir(MODEL_PATH):
    #run only if the model was not already downloaded
    S3Manager.download_folder(s3_client, CONFIG['S3_BUCKET'], 'finetuned_model/', MODEL_PATH)

In [6]:
from general_utils.flagembedding import FlagEmbeddingManager
embedding_service = FlagEmbeddingManager()
model = embedding_service.get_model(
    local_model_path=MODEL_PATH,
    query_instruction=CONFIG['QUERY_INSTRUCTION_AT_RETRIEVAL']
)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
print(f"Utilized instruction: '{CONFIG['QUERY_INSTRUCTION_AT_RETRIEVAL']}'")

Utilized instruction: 'Represent this sentence for searching relevant passages:'


---

# Services access

After the setup is done, I'll proceed with initializing the connection to the needed services, as stated in the README file I will use:
* AWS OpenSearch: as the vector DB that stores the corpus embeddings produced with my fine-tuned model (please refer to /training) and allows me to perform semantic retrieval on them
* AWD Bedrock: as the fully managed solution to call the generator model API

The BedrockManager class (can be found in rag_system.utils.bedrock) encapsulates all the needed logic to query Bedrock. While the FlagEmbeddingManager class (can be found in general_utils.flagembedding) encapsulate the logic to connect and search from OpenSearch

In [8]:
from rag_system.utils import BedrockManager
bedrock = BedrockManager()

---

# Some examples

Here I'll provide some examples of retrieved context and responses from the model. It's worth it to notice that the utilized system prompt was not tuned due to a lack of time, being that a possible improvement spot.

In [9]:
print(f"Utilized untuned system prompt: {CONFIG['SYSTEM_PROMPT']}")

Utilized untuned system prompt:  You are a knowledgeable and empathetic medical assistant. Your task is to answer patients' questions strictly based on the context provided below. Do not include any information outside of this context. If the answer is not contained within the context, respond politely that you don’t have enough information to answer.
Context {CONTEXT} 


In [17]:
from typing import Dict, Any

def chat(query:str) -> Dict[str,Any]:
    """
    Auxiliar function to concatenate the retrieval and generation steps.

    Args:
        query(str): user's query
    
    Returns:
        Dict[str,Any]: returned dict from Bedrock InvokeModel action. Contains
        the generation answer under the key 'generation'
    """
    context = embedding_service.search(
        endpoint_url=CONFIG['OPENSEARCH_INDEX_URL'],
        flag_embedding_model=model, 
        query=query,
        top_k=CONFIG['ITEMS_TO_RETRIEVE']
    )
    return bedrock.ask(
        query,
        context,
        CONFIG['MODEL'],
        CONFIG['SYSTEM_PROMPT'],
        CONFIG['TEMPERATURE'],
        CONFIG['TOP_P'],
        CONFIG['MAX_OUTPUT_TOKENS']
    ), context

In [42]:
CONFIG["OPENSEARCH_INDEX_URL"]

'https://search-medical-qa-system-wllp2yik3gws7durfruiomvfky.us-east-2.es.amazonaws.com/embedding-finetuned-v1'

In [44]:
query_vector = model.encode("What is high blood pressure?").tolist()
payload = {
            "query": {"knn": {"vector_field": {"vector": query_vector, "k": CONFIG["ITEMS_TO_RETRIEVE"]}}}
}


In [47]:
response

<Response [502]>

In [46]:
import requests
response = requests.post(
    url=f'{CONFIG["OPENSEARCH_INDEX_URL"]}/_search',
    json=payload,
    auth=flag_embedding.awsauth,
    headers={"Content-Type": "application/json"},
)

In [18]:
response, context = chat("What is high blood pressure?")
print(response['generation'])
print("\n".join([x for x in context]))

ERROR:root:Error in search with args=(<general_utils.flagembedding.FlagEmbeddingManager object at 0x11a786fe0>,), kwargs={'endpoint_url': 'https://search-medical-qa-system-wllp2yik3gws7durfruiomvfky.us-east-2.es.amazonaws.com/embedding-finetuned-v1', 'flag_embedding_model': <FlagEmbedding.inference.embedder.encoder_only.base.BaseEmbedder object at 0x11a786fb0>, 'query': 'What is high blood pressure?', 'top_k': 3} | Exception: Expecting value: line 1 column 1 (char 0)


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

> _Disclaimer_

Latency is a topic that was not addressed at all:
* for this test the embedding model was running at full precision (fp32) in a CPU
* the machine that hosts the AWS OpenSearch collection is the smallest one can get (t3.small)

One could think of hosting the embedding model in a Sagemaker Endpoint or also in a Bedrock one, reducing precision if needed and increase the machine size behind OpenSearch

### Evaluating the generation

One could evaluate this generation in different ways:
* reference free: for example with an LLM-as-a-judge to analize the helpfulness, toxicity and faithfulness from the answer
* reference based: metrics that leverage the fact that there is a golden dataset with labels for some answers


In my case, just as a final validation I will utilize the dataset that was provided as part of the assignment, I will follow a simple approach:

1. select a sample question
2. generate the response to that question
3. compare the generated response to the golden answer by 

In [27]:
raw_model = flagg_embedding.get_model(
    local_model_path="BAAI/bge-large-en-v1.5",
    query_instruction=CONFIG['QUERY_INSTRUCTION_AT_RETRIEVAL'],
    use_fp16=False
)