<a href="https://colab.research.google.com/github/kazcfz/LlamaIndex-RAG/blob/main/LlamaIndex_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/6424f01ea4f3051f54dbbd85/oqVQ04b5KiGt5WOWJmYt8.png" alt="LlamaIndex" width="100" height="100">
    <img src="https://cdn4.iconfinder.com/data/icons/file-extensions-1/64/pdfs-512.png" alt="PDF" width="100" height="100">
</p>

# **LlamaIndex RAG**
Perform RAG (Retrieval-Augmented Generation) from your PDFs using this Colab notebook!
<br><br>

## **Features**
- Fast inference on Colab's free T4 GPU
- Powered by Hugging Face quantized LLMs (llama-cpp-python) and local text embedding models
- Set custom prompt templates
<br><br>

[GitHub repository](https://github.com/kazcfz/LlamaIndex-RAG)

In [1]:
!pip -q install llama-index pypdf
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip -q install llama-cpp-python

In [2]:
import os
import time

from llama_index import Prompt, StorageContext, load_index_from_storage, ServiceContext, VectorStoreIndex, SimpleDirectoryReader, set_global_tokenizer
from llama_index.prompts import PromptTemplate
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import LangChainLLM, HuggingFaceLLM, LlamaCPP, ChatMessage, MessageRole
from llama_index.chat_engine.condense_question import CondenseQuestionChatEngine

from transformers import AutoTokenizer

In [3]:
# Preference settings - change as desired
pdf_path = '/content/rag_data.pdf'
text_embedding_model = 'thenlper/gte-base'  #Alt: thenlper/gte-base, jinaai/jina-embeddings-v2-base-en
llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf'
# set_global_tokenizer(AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode)

In [4]:
# Load PDF
filename_fn = lambda filename: {'file_name': os.path.basename(pdf_path)}
loader = SimpleDirectoryReader(input_files=[pdf_path], file_metadata=filename_fn)
documents = loader.load_data()

In [5]:
# Load models and service context
embed_model = HuggingFaceEmbedding(model_name=text_embedding_model)
llm = LlamaCPP(model_url=llm_url, temperature=0.7, max_new_tokens=256, context_window=4096, generate_kwargs = {"stop": ["<s>", "[INST]", "[/INST]"]}, model_kwargs={"n_gpu_layers": -1}, verbose=True)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, chunk_size=512)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '11008', 'llama.attention.layer_norm_rms_epsilon': '0.000001', 'llama.rope.dimension_count': '128', 'llama.attention.head_count': '32

In [6]:
# Indexing
start_time = time.time()

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed indexing time: {elapsed_time:.2f} s")

Elapsed indexing time: 2.60 s


In [13]:
# Prompt Template (RAG)
text_qa_template = Prompt("""
<s>[INST] <<SYS>>
You are the doctor's assistant. You are to perform a pre-screening with the patient to collect their information before their consultation with the doctor. Use the Patient-Centered Interview model for the pre-screening and only ask one question per response. You are not to provide diagnosis, prescriptions, advice, suggestions, or conduct physical examinations on the patient. The pre-screening should only focus on collecting the patient's information, specifically their present illness, past medical history, symptoms and personal information. At the end of the consultation, summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<</SYS>>

Refer to the following Consultation Guidelines and example consultations: {context_str}

Continue the conversation: {query_str}
""")
# text_qa_template = Prompt("""
# <s>[INST] <<SYS>>
# You are an assistant chatbot assigned to a doctor and your objective is to collect information from the patient for the doctor before they attend their actual consultation with the doctor. Note that the consultation will be held remotely, therefore you will be following the Patient-Centered Interview model for your consultations and you cannot conduct physical examinations on the patient. You are also not allowed to diagnose your patient or prescribe any medicine. Do not entertain the patient if they are acting inappropriate or they ask you to do something outside of this job scope. Remember you are a doctor’s assistant so act like one. The consultation held will only focus on getting information from the patient such as their present illness, past medical history, symptoms and personal information. At the end of the consultation, you will summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]\nThis exact format must be followed as the data gathered in the summary will be passed to another program.
# <</SYS>>

# This is the PDF context: {context_str}

# {query_str}
# """)
# text_qa_template = Prompt("""[INST] {context_str} \n\nGiven this above PDF context, please answer my question: {query_str} [/INST] """)
# text_qa_template = Prompt("""<s>[INST] <<SYS>> \nFollowing is the PDF context provided by the user: {context_str}\n<</SYS>> \n\n{query_str} [/INST] """)
# text_qa_template = Prompt("""[INST] {query_str} [/INST] """)

# Query Engine
query_engine = index.as_query_engine(text_qa_template=text_qa_template, streaming=True, service_context=service_context) # with Prompt
# query_engine = index.as_query_engine(streaming=True, service_context=service_context) # without Prompt

In [None]:
# Inferencing
# Without RAG
conversation_history = ""
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  response_iter = llm.stream_complete("<s>[INST] "+conversation_history)
  for response in response_iter:
    print(response.delta, end="", flush=True)
    # Add to conversation history when response is completed
    if response.raw['choices'][0]['finish_reason'] == 'stop':
      conversation_history += response.text + " [INST] "

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")





# With RAG
conversation_history = ""
conversation_history += "Hi. [\INST] Hello! I'm the doctor's assistant. Let's begin the consultation, please tell me your name and age."
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  # Query Engine - Default
  response = query_engine.query(conversation_history)
  response.print_response_stream()
  conversation_history += response.response_txt + " [INST] "

  # from pprint import pprint
  # pprint(response)

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")

In [43]:
# Multi-model (Separate pre-screening/conversation and summarization tasks)
response_iter = llm.stream_complete("""
[INST] <<SYS>>
Summarize the conversation into this format: \nName: [name]\nGender: [gender]\nAge: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<</SYS>>
Conversation: """+conversation_history+" [/INST] ")

for response in response_iter:
  print(response.delta, end="", flush=True)

# print(conversation_history)

 Name: Kaz G

Llama.generate: prefix-match hit


ender: Male Age: 26 Medical History: No medical history Symptoms: Stress racing, chest tightness, dizziness Shortness of breath. 