# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


# RAG + Gemma 7B

## Knowledge Base: Recent Scientific Papers




papers_links = ["https://arxiv.org/pdf/2405.04078", "https://arxiv.org/pdf/2405.06628",
          "https://arxiv.org/pdf/2405.04684",
          "https://arxiv.org/pdf/2404.11221", "https://arxiv.org/pdf/2405.00702"]

## Define Prompts

relevant_prompt_1 = "What are the four experimental evaluation settings mentioned in the paper WISER: Weak supervISion and supErvised Representation learning to improve drug response prediction in cancer?"

relevant_prompt_2 = "How many authors are there for Optimal epidemic control by social distancing and vaccination of an infection structured by time since infection: the covid-19 case study?"

relevant_prompt_3 = "What is the objective of VLST: Virtual Lung Screening Trial for Lung Cancer Detection Using Virtual Imaging Trial?"

irrelevant_prompt = "What is the weather in Budapest?"

##Load the Gemma-7B Instruct LLM

## Installing Dependencies

In [None]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.7.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.3.0 which is incompatible.[0m[31m
[0m

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
import torch
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Google
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Step 4: Test The Model with Self-Designed Prompts

In [12]:
relevant_prompt_1 = "What are the four experimental evaluation settings mentioned in the paper WISER: Weak supervISion and supErvised Representation learning to improve drug response prediction in cancer?"

relevant_prompt_2 = "What is the LAR hypothesis mentioned in the paper Optimal epidemic control by social distancing and vaccination of an infection structured by time since infection: the covid-19 case study?"

relevant_prompt_3 = "What is the objective of VLST: Virtual Lung Screening Trial for Lung Cancer Detection Using Virtual Imaging Trial?"

irrelevant_prompt = "What is my height?"

In [3]:
template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [5]:
result = llm(prompt.format(text=relevant_prompt_1))
print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What are the four experimental evaluation
settings mentioned in the paper WISER: Weak supervISion and supErvised Representation learning to
improve drug response prediction in cancer?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
According to the paper "WISER: Weak supervISion and supErvised Representation learning to improve
drug response prediction in cancer" (https://arxiv.org/abs/2010.14441), the authors mention four
experimental evaluation settings:  1. **Weakly Supervised Setting**: In this setting, the model is
trained on a small set of labeled data (e.g., 100-200 samples) and a large set of unlabeled data.
The goal is to learn a representation that can generalize well to new, unseen data. 2. **Supervised
Setting**: In this setting, the model is trained on a large set of fully labeled data (e.g.,
thousands of samples). This setting serves as a baseline for evaluating the performance of the
proposed method. 3. **

In [10]:
result = llm(prompt.format(text=relevant_prompt_2))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the LAR hypothesis mentioned in
the paper Optimal epidemic control by social distancing and vaccination of an infection structured
by time since infection: the covid-19 case
study?<|eot_id|><|start_header_id|>assistant<|end_header_id|> A very specific question!  The LAR
hypothesis refers to the "Latent-Acute-Recovery" hypothesis, which is a key concept in epidemiology.
In the context of the paper you mentioned, it's used to describe the dynamics of COVID-19
transmission.  The LAR hypothesis proposes that individuals infected with SARS-CoV-2 go through
three distinct stages:  1. **Latent**: The incubation period, during which the individual is
infectious but not yet showing symptoms. 2. **Acute**: The symptomatic phase, where the individual
is actively sick and contagious. 3. **Recovery**: The convalescent phase, where the individual has
recovered from the illness and is no longer infectious.  This hypothesis is importan

In [7]:
result = llm(prompt.format(text=relevant_prompt_3))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the objective of VLST: Virtual
Lung Screening Trial for Lung Cancer Detection Using Virtual Imaging
Trial?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The Virtual Lung Screening Trial
(VLST) is a research study aimed at evaluating the effectiveness and safety of virtual lung
screening using low-dose computed tomography (LDCT) scans in detecting lung cancer. The primary
objectives of the trial are:  1. To determine whether LDCT can reduce lung cancer mortality rates
compared to traditional chest X-rays. 2. To assess the accuracy of LDCT in detecting lung cancer,
including early-stage disease, in high-risk individuals. 3. To evaluate the impact of LDCT on
patient outcomes, such as survival rates, quality of life, and healthcare utilization.  In
particular, the VLST aims to answer the following questions:  * Can LDCT detect lung cancer earlier
than traditional methods? * Is LDCT more effective in reducing lung c

In [13]:
result = llm(prompt.format(text=irrelevant_prompt))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is my
height?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I'm just an AI, I don't have the
ability to physically measure your height. However, if you'd like to share it with me, I can try to
help you with anything else!



## Step 5: Setting up a vector database from the documents

In [16]:
papers_links = ["https://arxiv.org/pdf/2405.04078", "https://arxiv.org/pdf/2405.06628",
          "https://arxiv.org/pdf/2405.04684",
          "https://arxiv.org/pdf/2404.11221", "https://arxiv.org/pdf/2405.00702"]

web_loader = UnstructuredURLLoader(
    urls=papers_links, mode="elements", strategy="fast",
)

web_docs = web_loader.load()
updated_web_docs = filter_complex_metadata(web_docs)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [17]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)
chunked_web_doc = text_splitter.split_documents(updated_web_docs)
len(chunked_web_doc)

2320

In [18]:
embeddings = HuggingFaceEmbeddings()


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [19]:
%%time

# Create the vectorized db with FAISS

db_web = FAISS.from_documents(chunked_web_doc, embeddings)

CPU times: user 7.1 s, sys: 41.9 ms, total: 7.14 s
Wall time: 7.49 s


In [20]:
prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

## Step 6: Create a LangChain-based RAG system

In [21]:
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_web.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 10, 'score_threshold': 0.1}),
    chain_type_kwargs={"prompt": prompt},
)


## Test the Prompts Again

As shown on the prompts proposed in Step 1 (with the expected answers given in the comment section): For relevant prompts, the LLM should give the correct answer, retrieved from the Wikipedia vectorized dataset. For irrelevant promots, the LLM should give no answer -- since there is no mention of the information.

In [22]:
query = relevant_prompt_1
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  WISER: Weak supervISion and supErvised Representation learning to improve drug response
prediction in cancer  WISER: Weak supervISion and supErvised Representation learning to improve drug
response prediction in cancer  WISER: Weak supervISion and supErvised Representation learning to
improve drug response prediction in cancer  WISER: Weak supervISion and supErvised Representation
learning to improve drug response prediction in cancer  WISER: Weak supervISion and supErvised
Representation learning to improve drug response prediction in cancer  WISER: Weak supervISion and
supErvised Representation learning to improve drug response prediction in cancer  WISER: Weak
supervISion a

In [23]:
query = relevant_prompt_2
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  became dominant. However, we feel that the LAR hypothesis is much more than a useful
simplifying hypothesis. Indeed, containment and mitigation of infection incidence at manageably low
levels - what the LAR hypothesis describes in mathematical terms - is the primary form of
intervention that governments and public health institutions can use to mitigate the direct and
indirect (i.e., societal) impacts of serious (current and future) pandemic threats. Additionally, in
our opinion, the tools to keep most pandemics under LAR boundaries are available, namely by early
interventions and their boosting. Therefore, the LAR hypothesis should also represent - by itself -
a fundamental p

In [24]:
query = relevant_prompt_3
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  VLST: Virtual Lung Screening Trial for Lung Cancer Detection Using Virtual Imaging Trial
Importance: The efficacy of lung cancer screening can be significantly impacted by the imaging
modality used. This Virtual Lung Screening Trial (VLST) addresses the critical need for precision in
lung cancer diagnostics and the potential for reducing unnecessary radiation exposure in clinical
settings.  Figure 1. Flowchart summarizing the comprehensive workflow of the Virtual Lung Screening
Trial (VLST), from the creation of virtual patient models using clinical CT data, through lesion
simulation and insertion, to the virtual imaging using DukeSim, and concluding with the evaluation
by a v

In [25]:
query = irrelevant_prompt
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  0 "  260 (98.48)  27.97 ± 5.76 (18.37-41.65)  BMI  Weight (kg)  40  60 ±
13<|eot_id|><|start_header_id|>user<|end_header_id|>  What is my
height?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have enough information to
answer the question. The provided context only includes weight and BMI, but does not include height.
