<a href="https://colab.research.google.com/github/leemabhena/ait-deeplearning-class/blob/main/AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


1. I will be using statistical papers, published on May 3rd on the arxvi.org website.

In [6]:
# urls of the papers
urls = [
    "https://arxiv.org/pdf/2405.00827", # Overcoming model uncertainty – how equivalence tests can benefit from model averaging
    "https://arxiv.org/pdf/2405.00835", # Individual-level models of disease transmission incorporating piecewise spatial risk functions
    "https://arxiv.org/pdf/2405.00842", # Quickest Change Detection with Confusing Change
    "https://arxiv.org/pdf/2405.00859", # WATCH: A Workflow to Assess Treatment Effect Heterogeneity in Drug Development for Clinical Trial Sponsors
    "https://arxiv.org/pdf/2405.00884", # What’s So Hard about the Monty Hall Problem?
]

2. Three relevant prompts and one irelevant one

In [5]:
# Relevant prompts
prompt1 = "What is the role of model averaging in overcoming model uncertainty in equivalence tests, as discussed in recent arXiv publications?"
prompt2 = "How do the proposed non-parametric spatial disease transmission models improve upon traditional parametric forms in estimating infection risk?"
prompt3 = "What new procedures are proposed in quickest change detection to differentiate between 'bad' and 'confusing' changes, and what guarantees do these procedures offer?"

#Irrelevant prompt
i_prompt = "Which team won Laliga this season?"

3. Load an LLM with at least 5B parameters.

In [6]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"


HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

quantization_config =  BitsAndBytesConfig(load_in_8bit=True, llm_int8_enable_fp32_cpu_offload=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset.

In [3]:
template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


In [7]:
# Test with first prompt
result = llm(prompt.format(text=prompt1))
print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the role of model averaging in
overcoming model uncertainty in equivalence tests, as discussed in recent arXiv
publications?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Model averaging is a technique
used to address model uncertainty in equivalence testing by combining the results from multiple
models or specifications. In recent arXiv publications, model averaging has been proposed as a way
to overcome model uncertainty in equivalence tests.  In traditional equivalence testing, a single
model is specified and tested for equality between two groups or treatments. However, this approach
assumes that the true underlying relationship between the variables is accurately captured by the
chosen model. In reality, there may be multiple plausible models that can explain the data equally
well, leading to model uncertainty.  Model averaging addresses this issue by combining the results
from multiple models or specifica

In [8]:
# Test with second prompt
result = llm(prompt.format(text=prompt2))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  How do the proposed non-parametric
spatial disease transmission models improve upon traditional parametric forms in estimating
infection risk?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Non-parametric spatial
disease transmission models have several advantages over traditional parametric forms when it comes
to estimating infection risk. Here are some ways they improve:  1. **Flexibility**: Non-parametric
models don't rely on specific assumptions about the underlying distribution of the data, such as
normality or Poisson distributions. This flexibility allows them to capture complex patterns and
relationships that may not be captured by traditional parametric models. 2. **Robustness to
outliers**: Parametric models can be sensitive to outliers, which can lead to biased estimates of
infection risk. Non-parametric models are more robust to outliers, as they don't assume a specific
distribution for the data. 3. **Abilit

In [9]:
# Test with third prompt
result = llm(prompt.format(text=prompt3))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What new procedures are proposed in
quickest change detection to differentiate between 'bad' and 'confusing' changes, and what
guarantees do these procedures offer?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Quickest Change Detection (QCD) is a statistical method used to detect changes in data streams. To
differentiate between "bad" and "confusing" changes, several new procedures have been proposed:  1.
**Change Point Detection with Confidence Intervals**: This approach uses confidence intervals to
determine the significance of detected changes. It provides a guarantee that the probability of
false positives is bounded by the chosen significance level. 2. **Online Change Detection with
Adaptive Windowing**: This procedure adapts the window size based on the data distribution, allowing
it to better handle confusing changes. It offers a guarantee that the detection delay is minimized
while maintaining a low false posi

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

In [10]:
# Download the content from the web
web_loader = UnstructuredURLLoader(
    urls=urls, mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

In [12]:
# Split the data into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

3531

In [13]:
# Build a vector database
embeddings = HuggingFaceEmbeddings()



In [14]:
# Use FAISS object to store the vector of objects
%%time

# Create the vectorized db with FAISS

db_web = FAISS.from_documents(chunked_web_doc, embeddings)

# Create the vectorized db with Chroma
# from langchain.vectorstores import Chroma
# db_web = Chroma.from_documents(chunked_web_doc, embeddings)

CPU times: user 7.95 s, sys: 89.5 ms, total: 8.04 s
Wall time: 8.38 s


In [15]:
# Generate prompt template

prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just return an empty answer. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

In [16]:
# Set up retrievalQA
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_web.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 10, 'score_threshold': 0.1}),
    chain_type_kwargs={"prompt": prompt},
)


6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)

In [17]:
# Check for hullucinations
%%time
result = Chain_web.invoke(prompt1)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.  3.2 Model-based
equivalence tests incorporating model averaging  Overcoming model uncertainty – how equivalence
tests can benefit  3 Model-based equivalence tests under model uncertainty  In this paper, we
introduced a new approach for model-based equivalence testing which can also be applied in the
presence of model uncertainty – a problem which is usually faced in practical applications. Our
approach is based on a flexible model averaging method which relies on information criteria and a
testing procedure which makes use of the duality of tests and confidence intervals rather than
simulating the distribution under the null hypothesis, providing a numerically stable procedure.
Moreover, our approach leads to addi

In [18]:
# Check for hullucinations
%%time

result = Chain_web.invoke(prompt2)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.  Parametric forms for
spatial risk functions, or kernels, are often used, but rely on strong assumptions about underlying
transmission mechanisms. Here, we propose a class of non- parametric spatial disease transmission
model, fitted within a Bayesian Markov chain Monte Carlo (MCMC) framework, allowing for more
flexible assumptions when estimating the effect on spatial distance and infection risk.  In real
life, the individual or group-level factors likely have an impact on infection transmission or
susceptibility. We could extend our model by incorporating covariates to consider more complex
dynamics. For example, in our foot-and-mouth disease models we ignored the number of cattle and
sheep on the farm, since we 

In [19]:
# Check for hullucinations
%%time

result = Chain_web.invoke(prompt3)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.  Quickest Change Detection
with Confusing Change Y.-Z. Janice Chen∗, Jinhang Zuo∗, Venugopal V. Veeravalli‡, Don Towsley∗  [2]
V. V. Veeravalli and T. Banerjee, “Quickest change detection,” in Elsevier, 2014, vol. 3,  While
S-CuSum effectively detects the bad change and ig- nores the confusing change, its detection delay
leaves room for improvement. Toward this, we propose Joint CuSum (J-CuSum), which incorporates two
tests w.r.t. W [t] and Λ[t] respectively in a more involved way. Specifically, J-CuSum utilizes
CuSumW (as defined in (17)) and let  Our contributions are as follows: In Section II, we formulate a
novel quickest change detection problem where the change can be either a bad change or a confusing
change

In [20]:
# Check for hullucinations
%%time

result = Chain_web.invoke(i_prompt)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.  nl 2  1
nl<|eot_id|><|start_header_id|>user<|end_header_id|>  Which team won Laliga this
season?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I'm sorry, but I couldn't find that
information in the given context. The provided text does not mention La Liga or its winner for a
specific season.
CPU times: user 8.96 s, sys: 32.7 ms, total: 8.99 s
Wall time: 9.38 s
