1. Obtain Pubmed data
2. Index creation: vectorise data and create a searchable vector index (Qdrant vector DB, used fastembed to vectorsie the docs)
3. setup llama.cpp project locally. Download llama models from meta, and quantise the desired llama model
4. setup RAG pipeline
   - a. accept a user question as input
   - b. retrieve relevant documents by querying the vector DB using Qdrant's semantic search engine API
   - c. construct an LLM prompt with user question, and relevant documents - instructing LLM to answer from within the provided context
   - d. Send prompt to LLM to generate an answer
   - e.  Use llama-cpp-python, python bindings to the llama.cpp project.
   - f. Prompt LLM for answer
6. Evaluate performance of RAG-LLM pipeline

### imports

In [1]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from huggingface_hub import notebook_login
from typing import List
from langchain.prompts import PromptTemplate

from fastembed.embedding import DefaultEmbedding, Embedding

1. Data prep - Pubmedata
2. Qdrant Vector DB - create collection
3. Class to handle user q
   a.  Accept user q (and any query preprocessing?)
   b. search against qdran t, ans get back retrieved articles (langchain might already have this?)
   c. send to llama.cpp, and get back the result
  

In [3]:
notebook_login()
# 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

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 1. Load data

In [6]:
dataset_name = "FedML/PubMedQA_instruction"
data = load_dataset(dataset_name, split="train[0:2000]")
data[0]

{'instruction': 'Are group 2 innate lymphoid cells ( ILC2s ) increased in chronic rhinosinusitis with nasal polyps or eosinophilia?',
 'context': 'Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated. The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compare ILC2s across characteristics of disease. A cross-sectional study of patients with CRS undergoing endoscopic sinus surgery was conducted. Sinus mucosal biopsies were obtained during surgery and control tissue from patients undergoing pituitary tumour resection through transphenoidal approach. ILC2s were identified as CD45(+) Lin(-) CD127(+) CD4(-) CD8(-) CRTH2(CD294)(+) CD161(+) cells in single cell suspensions thr

In [8]:
docs = [{'id': idx, 'text': d['context']} for idx, d in enumerate(data)]
doc_texts = [doc['text'] for doc in docs]
doc_ids = [doc['id'] for doc in docs]

In [9]:
WRITE_PUBMED_DOCS_TO_DISK=False
if WRITE_PUBMED_DOCS_TO_DISK:
    pubmedqa_docs_json_path = f"./data/pubmedqa_docs_{len(docs)}.json"
    with open(pubmedqa_docs_json_path, 'w+') as f:
        json.dump(docs, f)

## 2. Create Collection/ searchable vector index

#### 2.1. Define storage location, qdrant client

(You need to tell Qdrant where to store embeddings. This is a basic demo, so your local computer will use its memory as temporary storage.)



In [10]:
qdrant_client = QdrantClient(":memory:")

#### 2.2. Select embedding model

In [11]:
DefaultEmbedding().list_supported_models()

[{'model': 'BAAI/bge-small-en',
  'dim': 384,
  'description': 'Fast English model',
  'size_in_GB': 0.2},
 {'model': 'BAAI/bge-small-en-v1.5',
  'dim': 384,
  'description': 'Fast and Default English model',
  'size_in_GB': 0.13},
 {'model': 'BAAI/bge-base-en',
  'dim': 768,
  'description': 'Base English model',
  'size_in_GB': 0.5},
 {'model': 'BAAI/bge-base-en-v1.5',
  'dim': 768,
  'description': 'Base English model, v1.5',
  'size_in_GB': 0.44},
 {'model': 'sentence-transformers/all-MiniLM-L6-v2',
  'dim': 384,
  'description': 'Sentence Transformer model, MiniLM-L6-v2',
  'size_in_GB': 0.09},
 {'model': 'intfloat/multilingual-e5-large',
  'dim': 1024,
  'description': 'Multilingual model, e5-large. Recommend using this model for non-English languages',
  'size_in_GB': 2.24}]

In [12]:
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en"  # "sentence-transformers/all-MiniLM-L6-v2"
qdrant_client.set_model(EMBEDDING_MODEL_NAME)

#### 2.3. Create collection

In [None]:
COLLECTION_NAME = "pubmedqa"

In [14]:
# if required to repeatedly change config of already existing collection, then use `qdrant_client.recreate_collection()`
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=qdrant_client.get_fastembed_vector_params() 
)

True

In [15]:
# qdrant_client.add() is the new api when using fastembed - it combines two actions into one: encoding documents and adding them the index. 
# https://qdrant.tech/documentation/tutorials/neural-search-fastembed/#upload-data-to-qdrant
# Not available if vectorising without fastembed.the alternate way is to use qdrant_client.upload_records().
qdrant_client.add(
    collection_name=COLLECTION_NAME,
    documents=doc_texts,
    ids=doc_ids
)

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


## 3. Search and retrieve relevant context (Using Qdrant)

In [17]:
# let's peek at a sample data point(query, answer, context) from pubmedQA. Then check if qdrant vector search is able to retrieve the correct context given 'query'   
data[4]

{'instruction': 'Do tumor-infiltrating immune cell profiles and their change after neoadjuvant chemotherapy predict response and prognosis of breast cancer?',
 'context': 'Tumor microenvironment immunity is associated with breast cancer outcome. A high lymphocytic infiltration has been associated with response to neoadjuvant chemotherapy, but the contribution to response and prognosis of immune cell subpopulations profiles in both pre-treated and post-treatment residual tumor is still unclear. We analyzed pre- and post-treatment tumor-infiltrating immune cells (CD3, CD4, CD8, CD20, CD68, Foxp3) by immunohistochemistry in a series of 121 breast cancer patients homogeneously treated with neoadjuvant chemotherapy. Immune cell profiles were analyzed and correlated with response and survival. We identified three tumor-infiltrating immune cell profiles, which were able to predict pathological complete response (pCR) to neoadjuvant chemotherapy (cluster B: 58%, versus clusters A and C: 7%). A

In [33]:
# query() is a valid method only when using qdrant fastembed (https://github.com/qdrant/qdrant-client/blob/d6100614fd2b8413781763d57013f0ee376741e1/qdrant_client/qdrant_fastembed.py#L300)
# Also, the following code requires you have set a model with the client, using qdrant_client.set_model()

sampling_idx = 4
query = data[sampling_idx]["instruction"]

search_result = qdrant_client.query(
    collection_name=COLLECTION_NAME,
    query_text=query, 
    limit=2
)
search_result

[QueryResponse(id=4, embedding=None, metadata={'document': 'Tumor microenvironment immunity is associated with breast cancer outcome. A high lymphocytic infiltration has been associated with response to neoadjuvant chemotherapy, but the contribution to response and prognosis of immune cell subpopulations profiles in both pre-treated and post-treatment residual tumor is still unclear. We analyzed pre- and post-treatment tumor-infiltrating immune cells (CD3, CD4, CD8, CD20, CD68, Foxp3) by immunohistochemistry in a series of 121 breast cancer patients homogeneously treated with neoadjuvant chemotherapy. Immune cell profiles were analyzed and correlated with response and survival. We identified three tumor-infiltrating immune cell profiles, which were able to predict pathological complete response (pCR) to neoadjuvant chemotherapy (cluster B: 58%, versus clusters A and C: 7%). A higher infiltration by CD4 lymphocytes was the main factor explaining the occurrence of pCR, and this associati

## 4. Inference time: Langchain pipeline for RAG 

Comprises of 4 steps
- a. accept a user question as input
- b. retrieve relevant documents by querying the vector DB using Qdrant's semantic search engine API
- c. construct a RAG prompt for LLM with user question, and relevant documents - instructing LLM to answer from within the provided context
- d. send constructed prompt to LLM to generate an answer (use llama-cpp-python, python bindings to the llama.cpp project)


##### utils

In [None]:
# define class retriever/ vectorstore_as_retriever
def retrieve_relevant_doc_context(query, qdrant_client, collection_name, top_k=3, verbose=False):
    rel_docs = []
    search_hits = qdrant_client.query(
        collection_name=collection_name,
        query_text=query, 
        limit=top_k)
    for hit in search_hits:
        hit_dict = {'text': hit.metadata['document'], 
                    'score': hit.score}
        rel_docs.append(hit_dict['text'])
        if verbose:
            print(hit_dict)
    return rel_docs  # hit_dict

In [7]:
#################
# prompt template
#################
RAG_PROMPT_string = ("""\


Human: Here is a question from a medical professional: \n\n<question> \n{user_query}\n</question>


Here are some search results from a medical encyclopedia that you must reference to answer the question: 

{extracts}


Once again, here is the question: 

<question>

{user_query}

</question>

Your objective is to write a high quality, concise answer 
for the medical professional within <answer> </answer> tags. Otherwise, write ANSWER NOT FOUND)

Assistant: <answer>\n\n """  # noqa: E501
)

rag_prompt_template = PromptTemplate.from_template(RAG_PROMPT_string)
print(f"Input variables to populate: {rag_prompt_template.input_variables}")
print(f"Instruction prompt template: {rag_prompt_template.template}")

Input variables to populate: ['extracts', 'user_query']
Instruction prompt template: 

Human: Here is a question from a medical professional: 

<question> 
{user_query}
</question>


Here are some search results from a medical encyclopedia that you must reference to answer the question: 

{extracts}


Once again, here is the question: 

<question>

{user_query}

</question>

Your objective is to write a high quality, concise answer 
for the medical professional within <answer> </answer> tags. Otherwise, write ANSWER NOT FOUND)

Assistant: <answer>

 


In [60]:
def prep_rag_prompt(query: str,
                    rel_search_extracts: List,
                    prompt_template,
                   ) -> str:
    prompt = prompt_template.format(extracts='\n\n'.join(rel_docs),
                                         user_query=query,
                                         )
    return prompt


#### deploying up llama.cpp locally as a service

Reference: https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa#llama2

In [64]:
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [None]:
n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
LLAMA_CPP_Q8_PATH = "/Users/mitrap/PycharmProjects/llama.cpp/models/7B-chat/ggml-model-q8_0.gguf"

In [None]:
#llama_model_7bchat_q8
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=LLAMA_CPP_Q8_PATH,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=2048,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

#### Prompt LLM without RAG prompt

In [9]:
# sample a query
sampling_idx = 5
query = data[sampling_idx]["instruction"]
query

NameError: name 'data' is not defined

In [67]:
llm(query)


llama_print_timings:        load time =    6298.15 ms


"\nHidradenitis suppurativa (HS) is a chronic skin condition characterized by recurrent, painful abscesses and nodules in the apocrine gland-rich areas of the body. While HS has been traditionally considered a localized skin disease, recent studies suggest that it may also have systemic manifestations and comorbidities. This study aimed to investigate the prevalence and characteristics of comorbidities in patients with HS compared to healthy controls using a chart-verified case-control analysis.\nMethods:\nWe conducted a retrospective, chart-verified case-control analysis of 100 patients with HS and 200 age- and sex-matched healthy controls from a tertiary care center. Cases were identified through clinical diagnosis, and controls were selected from the hospital's outpatient registry. Demographic data, medical history, and medication use were collected for each participant. Comorbidities were defined as any coexisting medical condition beyond HS.\nResults:\nCompared to controls, patien

llama_print_timings:      sample time =      25.39 ms /   256 runs   (    0.10 ms per token, 10081.12 tokens per second)
llama_print_timings: prompt eval time =    6297.92 ms /    32 tokens (  196.81 ms per token,     5.08 tokens per second)
llama_print_timings:        eval time =   12004.33 ms /   255 runs   (   47.08 ms per token,    21.24 tokens per second)
llama_print_timings:       total time =   18760.77 ms


#### Prompt LLM with RAG prompt

Comprises of 4 steps
- a. accept a user question as input
- b. retrieve relevant documents by querying the vector DB using Qdrant's semantic search engine API
- c. construct a RAG prompt for LLM with user question, and relevant documents - instructing LLM to answer from within the provided context
- d. send constructed prompt to LLM to generate an answer (use llama-cpp-python, python bindings to the llama.cpp project)


In [None]:
rel_docs = retrieve_relevant_doc_context(query=query,
                             qdrant_client=qdrant_client,
                             collection_name=COLLECTION_NAME,
                             top_k=3)

rag_prompt = prep_rag_prompt(query=query,
                             rel_search_extracts = rel_docs,
                             prompt_template=rag_prompt_template)


In [61]:
llm(rag_prompt)

Llama.generate: prefix-match hit



" Hidradenitis suppurativa (HS) is a chronic inflammatory disease involving intertriginous skin. The prevalence and comorbidities of HS have been studied in a large patient care database, including the presence of autoimmune disorders such as Hashimoto's thyroiditis (HT). In this study, 1776 patients with HS were matched with 2045 controls based on age, gender, and race. The prevalence of comorbidities was compared between the two groups using unadjusted and adjusted analyses.\n<quote>HS patients had a higher prevalence of autoimmune disorders including HT (13.6% vs 6.7%, P < .001), as well as other comorbidities such as smoking, arthropathies, dyslipidemia, polycystic ovarian syndrome, psychiatric disorders, obesity, drug dependence, hypertension, diabetes, thyroid disease, alcohol dependence, and lymphoma (all P < .01).</quote>\nThe study suggests that HS"

llama_print_timings:        load time =    7747.06 ms
llama_print_timings:      sample time =      22.58 ms /   256 runs   (    0.09 ms per token, 11335.46 tokens per second)
llama_print_timings: prompt eval time =   10753.12 ms /  1330 tokens (    8.09 ms per token,   123.69 tokens per second)
llama_print_timings:        eval time =   13141.50 ms /   255 runs   (   51.54 ms per token,    19.40 tokens per second)
llama_print_timings:       total time =   24295.46 ms
