<a href="https://colab.research.google.com/github/polyexplorer/open-llm/blob/main/RAG%2BEval(Llama_Index).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dependencies

In [None]:
! pip install transformers optimum accelerate langchain llama_index sentence_transformers peft trulens-eval
! pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # Use cu117 if on CUDA 11.7
! pip install pypdf pymupdf chromadb InstructorEmbedding
! mkdir pdfs

# Huggingface LLM (Integrated with LlamaIndex)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import TextStreamer, pipeline
import torch

from llama_index.llms.huggingface import HuggingFaceLLM

torch.cuda.empty_cache()
model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                            #  trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

# streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# pipe = pipeline(
#             "text-generation",
#             model=model,
#             tokenizer=tokenizer,
#             max_new_tokens=200,
#             do_sample=True,
#             temperature=0.1,
#             top_k=40,
#             top_p=0.95,
#             repetition_penalty=1.15,
#             streamer=streamer,
#         )

# langchain_llm = HuggingFacePipeline(pipeline=pipe)

llm = HuggingFaceLLM(
    model = model,
    tokenizer = tokenizer,
    context_window = 4096,
    max_new_tokens = 128,
    query_wrapper_prompt = """<|user|>
{query_str}</s>
<|assistant|>
""",
    system_prompt = """
You are a good Q/A chatbot who always answers the question based on the context only.</s>""",

)

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


# Basic RAG Pipeline

## Ingestion

### File Upload

In [44]:
import os
from google.colab import files


uploaded = files.upload()

for fn, content in uploaded.items():
  filename = os.path.join("pdfs",fn)
  with open(filename, 'wb') as f:
    f.write(content)


Saving 263-102-00006_Protocol_Amendment_1_14Nov2019.pdf to 263-102-00006_Protocol_Amendment_1_14Nov2019 (2).pdf


### DocumentStore

In [46]:
from llama_index import Document,SimpleDirectoryReader
import os
# filename = "263-102-00006_Protocol_Amendment_1_14Nov2019 (1).pdf"

documents = SimpleDirectoryReader(
    input_files=[filename]
).load_data()
document = Document(text="\n\n".join([doc.text for doc in documents]))

### Embeddings

In [47]:
# VectorStore Embeddings
from llama_index.embeddings import InstructorEmbedding
from llama_index import ServiceContext

# model_name = "AnnaWegmann/Style-Embedding"
model_name = "hkunlp/instructor-large"
text_instruction = "Represent the Medical document for retrieving important points where answer can be found:"
query_instruction = "Represent the Medical question for retrieving supporting documents:"

embed_model = InstructorEmbedding(
    model_name= model_name,
    text_instruction=text_instruction,
    query_instruction=query_instruction
    )

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

load INSTRUCTOR_Transformer
max_seq_length  512


### Index

In [49]:
! rm -r sentence_index

In [50]:
# Sentence Window Index

from llama_index import ServiceContext, VectorStoreIndex, StorageContext
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index import load_index_from_storage
import os


# index = VectorStoreIndex.from_documents(documents,
#                                         service_context=service_context)

def build_sentence_window_index(
    documents, llm, embed_model="local:BAAI/bge-small-en-v1.5", save_dir="sentence_index"
):
    # create the sentence window node parser w/ default settings
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=2,
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )
    sentence_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )
    if not os.path.exists(save_dir):
        sentence_index = VectorStoreIndex.from_documents(
            documents, service_context=sentence_context
        )
        sentence_index.storage_context.persist(persist_dir=save_dir)
    else:
        sentence_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=save_dir),
            service_context=sentence_context,
        )

    return sentence_index

def get_sentence_window_query_engine(
    sentence_index,
    similarity_top_k=6,
    rerank_top_n=2,
):
    # define postprocessors
    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )

    sentence_window_engine = sentence_index.as_query_engine(
        similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
    )
    return sentence_window_engine


sentence_index = build_sentence_window_index(documents=documents,llm=llm, save_dir="sentence_index")


## Retreival

### Intialize RAG Pipeline

In [59]:
rag_pipeline = get_sentence_window_query_engine(sentence_index)

### Q/A

In [52]:
response_1 = rag_pipeline.query("How many number of Site(s) will this trial take place on?")
print(str(response_1))

The context information provided indicates that this will be a single-site trial. Therefore, only one site will be involved in this trial.


In [53]:
response_2 = rag_pipeline.query("How many patients are planned to be taken for the trial?  ")
print(str(response_2))

The query asks how many patients are planned to be taken for the trial. Based on the context information provided, it can be seen that the trial population is planned to have at least 8 healthy male Japanese subjects who will receive the IV infusion, with a maximum of 10 subjects being dosed in total. Therefore, the answer to the query is that at least 8 and a maximum of 10 patients are planned to be taken for the trial.


In [54]:
response_2.source_nodes[0].node.metadata

{'window': '7) Positive alcohol breath test result or positive urine drug screen (confirmed by repeat) at Screening or Check -in. \n 8) Positive hepatitis panel and/or positive hum an immunodeficiency virus test. \n 9) Participation in a clinical trial involving administration of an investigational drug (new chemical entity  and/or OPC-61815) in the past  90 days prior to dosing or \n5 half- lives of the investigational drug . \n 10) Use or intend to use any medications/products known to alter drug absorption, metabolism, or elimination processes, including St. ',
 'original_text': '9) Participation in a clinical trial involving administration of an investigational drug (new chemical entity  and/or OPC-61815) in the past  90 days prior to dosing or \n5 half- lives of the investigational drug . \n',
 'page_label': '29',
 'file_name': '263-102-00006_Protocol_Amendment_1_14Nov2019 (2).pdf',
 'file_path': 'pdfs/263-102-00006_Protocol_Amendment_1_14Nov2019 (2).pdf',
 'file_type': 'applicati

In [55]:
response_3 = rag_pipeline.query("What is the age of participants mentioned in the text? ")
print(str(response_3))

The participants mentioned in the text are male subjects between 35 and 55 years of age, as stated in the inclusion criteria mentioned in section 5.2.1.


In [56]:
response_full = rag_pipeline.query("What are the demographics (age, gender, ethnicity etc) of the patients/subjects being taken for the trial?")
print(str(response_full))

According to the context information provided, the demographic information (collection date, year of birth, age, sex, race, ethnicity, and country) will be recorded for all subjects at the screening visit (page_label: 28). However, the query does not specify whether the demographic information is for all subjects or just for the patients/subjects being taken for the trial. If we assume that the query refers to the subjects being taken for the trial, then the answer would be that the demographic information will be collected for at least 8 healthy male Japanese subjects who will be dosed to ensure that 


In [57]:
response_4 = rag_pipeline.query("In an inpatient study, participants are admitted to a study site or are admitted to a clinic. In an inpatient study, the text also might mention subjects checking in and getting discharged. Is it an inpatient study? ")
print(str(response_4))

Based on the context information provided, it is unclear whether this is an inpatient study or an outpatient study. The text mentions "Final D discharge from trial will be day of discharge from residential treatment period or the day of last outpatient visit if required to attend them based on the discharge criteria." This suggests that some participants may be discharged from residential treatment, which could indicate an inpatient component, but it also mentions outpatient visits. Without further information, it is not possible to determine whether this is an inpatient or outpatient study.


In [58]:
response_5 = rag_pipeline.query("In an outpatient study, participants visit the study site or must visit a clinic or must visit a hospital. In an outpatient study, participants do not stay overnight. Is it an outpatient study?   ")
print(str(response_5))

Yes, based on the context provided, it is an outpatient study as participants do not stay overnight and visits to the study site or clinic are required during nonresidential collection intervals.


## Evaluation

In [None]:
eval_questions = ["""Are there sections mentioning Interim?""",
"""Are there sections mentioning IA? """,
"""How many sites are planned for the study? """,
"""How many countries are planned for the study? """,
"""Is a Non-USA country involved in the study? """]

from trulens_eval import Tru
tru = Tru()

tru.reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [None]:


tru_recorder = get_prebuilt_trulens_recorder(query_engine,
                                             app_id="Direct Query Engine")