<a href="https://colab.research.google.com/github/oanders6/AIT-Budapest-Deep-Learning/blob/main/Anderson_Owen_AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Assesment
Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * **latest scientific papers from arxiv.org,** ***selected***
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points) ***all were released May 1st, 2024***

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points) ***see prompts below/in testing sections***

3. Load an LLM with at least 5B parameters. (10 points) ***LLaMa 3 8B was loaded***

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points) ***see Testing section–LLM answers were either incorrect, or it could not come up with an answer***

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points) ***see "Using LangChain RetrievalQA and establishing RAG template" section***

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points) ***see Testing section at bottom of notebook***


# Prompts:

**Relevant**:
1.   Who authored the paper titled: Quantum algorithms for matrix geometric means that was released in 2024?
2.   What is the name of the university Runyi Yang worked at when they published the paper titled: SUNDAE: Spectrally Pruned Gaussian Fields with Neural Compensation? Do not provide any information other than the name. If you do not recognize the paper, then do not provide a name.
3.   Could you provide me with the table of contents for Charlotte Chan's paper 'The Scalar Product Formula for Parahoric Deligne–Lusztig Induction'?

**Irrelevant**:


1.   Accoridng to her paper, what is Charlotte Chan's favorite color?





## Installing dependencies



In [31]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null


# Restarting kernel

In [None]:
import os
os.kill(os.getpid(), 9)

# Importing the model (I chose LLaMa 3 8B)
Next, we import the necessary Python libraries.

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Next, we download a model from Huggingface. You can find two variants - a smaller and a larger one. Let's start with the smaller, and later make experiments with the larger one too.

In [2]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Templates

In [3]:
template_llama3 = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

template=template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


# Testing (LLM)

In [4]:
text = "What is the name of the university Runyi Yang worked at when they published the paper titled: SUNDAE: Spectrally Pruned Gaussian Fields with Neural Compensation? If you do not recognize the paper, then do not provide a name."
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the name of the university Runyi
Yang worked at when they published the paper titled: SUNDAE: Spectrally Pruned Gaussian Fields with
Neural Compensation? If you do not recognize the paper, then do not provide a
name.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I couldn't find any
information on a person named Runyi Yang publishing a paper called "SUNDAE: Spectrally Pruned
Gaussian Fields with Neural Compensation". It's possible that the paper doesn't exist or is not
well-known. Can you provide more context or details about the paper or author?


In [5]:
text = "Who authored the paper titled: Quantum algorithms for matrix geometric means that was released in 2024?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Who authored the paper titled: Quantum
algorithms for matrix geometric means that was released in
2024?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I couldn't find any
information on a paper titled "Quantum algorithms for matrix geometric means" published in 2024.
It's possible that the paper has not been published yet or is not widely known.  However, there are
several papers on quantum algorithms for computing matrix geometric means that have been published
in recent years. For example:  * "Quantum algorithms for matrix geometric means and their
applications" by Wang et al., published in 2020 in the Journal of Physics A. * "Fast quantum
algorithms for matrix geometric means" by Li et al., published in 2019 in the Journal of
Mathematical Physics. * "Quantum algorithms for matrix exponentials and logarithms" by Childs et
al., published in 2017 in the Journal of the ACM.  If you could provide more con

In [6]:
text = "Could you provide me with the table of contents for Charlotte Chan's paper 'The Scalar Product Formula for Parahoric Deligne–Lusztig Induction'?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Could you provide me with the table of
contents for Charlotte Chan's paper 'The Scalar Product Formula for Parahoric Deligne–Lusztig
Induction'?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I'm happy to help! However, I
have to inform you that I couldn't find any information on a paper called "The Scalar Product
Formula for Parahoric Deligne–Lusztig Induction" by Charlotte Chan.  It's possible that the paper
doesn't exist or is not widely known. Could you please provide more context or details about the
paper, such as where you heard about it or what journal it was published in? This might help me
assist you better.  Alternatively, if you're looking for information on parahoric Deligne-Lusztig
induction or scalar product formulas in representation theory, I'd be happy to try and provide some
general resources or references. Just let me know how I can help!


# Loading embeddings

In [7]:
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"




# Using LangChain RetrievalQA and establishing RAG template



In [8]:
%%time

prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_template=prompt_template_llama3


CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.78 µs


# Loading PDFs
----
As per the assignment details, I loaded 8 academic papers from arxiv.org. These papers were all published on May 1st, 2024, and thus were not included in LLaMa 3's training data (as LLaMa 3 was published on April 18th, 2024).



In [9]:
# All of these papers were released in May 2024
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00676"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00675"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00674"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00673"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00672"
!wget -O document6.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00671"
!wget -O document7.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00670"
!wget -O document8.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00669"

--2024-05-03 16:31:52--  https://arxiv.org/pdf/2405.00676
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.131.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13547625 (13M) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-03 16:31:52 (121 MB/s) - ‘document1.pdf’ saved [13547625/13547625]

--2024-05-03 16:31:52--  https://arxiv.org/pdf/2405.00675
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 754757 (737K) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-03 16:31:52 (28.2 MB/s) - ‘document2.pdf’ saved [754757/754757]

--2024-05-03 16:31:52--  https://arxiv.org/pdf/2405.00674
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.4

In [10]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf","/content/document2.pdf", "/content/document3.pdf", "/content/document4.pdf", "/content/document5.pdf", "/content/document6.pdf", "/content/document7.pdf", "/content/document8.pdf"]]


As before, we split it into smaller pieces.

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...
Loading raw document.../content/document6.pdf
Splitting text...
Loading raw document.../content/document7.pdf
Splitting text...
Loading raw document.../content/document8.pdf
Splitting text...


897

Make the vectors for each part

In [12]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 19.6 s, sys: 48 ms, total: 19.6 s
Wall time: 20 s


# Testing (LLM + RAG)

In [13]:
%%time
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
query = "What is the name of the university Runyi Yang worked at when they published the paper titled: SUNDAE: Spectrally Pruned Gaussian Fields with Neural Compensation? Do not provide any information other than the name."
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  4 2 0 2  y a M 1  ]  V C . s c [  1 v 6 7 6 0 0 . 5 0 4 2 : v i X r a  SUNDAE:
Spectrally Pruned Gaussian Fields with Neural Compensation  Runyi Yang AIR, Tsinghua University
Imperial College London runyi.yang23@imperial.ac.uk  Zhenxin Zhu AIR, Tsinghua University Beihang
University  Zhou Jiang AIR, Tsinghua University Beijing Institute of Technology  Baijun Ye AIR,
Tsinghua University Beijing Institute of Technology  Xiaoxue Chen AIR, Tsinghua University  Yifei
Zhang AIR, Tsinghua University UCAS  Yuantao Chen AIR, Tsinghua University CUHK (SZ)  Jian Zhao†
EVOL Lab, Institute of AI (TeleAI), China Telecom  Hao Zhao† AIR, Tsinghua University
zhaohao@air.tsinghua.edu.cn  PSNR: 

In [14]:
%%time

query = "Who authored the paper titled: Quantum algorithms for matrix geometric means?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  4 2 0 2  y a M 1  ] h p - t n a u q [  1 v 3 7 6 0 0 . 5 0 4 2 : v i X r a  Quantum
algorithms for matrix geometric means  Nana Liu,1, 2, 3, ∗ Qisheng Wang,4, † Mark M. Wilde,5, ‡ and
Zhicheng Zhang6, § 1Institute of Natural Sciences, School of Mathematical Sciences, MOE-LSC,
Shanghai Jiao Tong University, Shanghai, 200240, P. R. China 2Shanghai Artificial Intelligence
Laboratory, Shanghai, China 3University of Michigan-Shanghai Jiao Tong University Joint Institute,
Shanghai 200240, China 4Graduate School of Mathematics, Nagoya University, Nagoya 464-8602, Japan
5School of Electrical and Computer Engineering, Cornell University, Ithaca, New York 14850, USA
6Centre for Quantum 

In [15]:
%%time

query = "Could you provide me with the table of contents for Charlotte Chan's paper 'The Scalar Product Formula for Parahoric Deligne–Lusztig Induction'?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  LrG+ r Tr,Br  in analogy with Deﬁnition 3.2.  20  Theorem 7.2. Let (θ, Tr, Br) be split-
generic. For any triple (θ′, T′  hR  LrG+ r Tr ,Br  (θ), R  LrG+ r T′ r ,B′ r  (θ′)i(Lr G0+:r+)σ =
r, B′ hθ, ad(w)∗θ′iTσ r .  r),  Xw∈WLr G+  r  (Tr ,T′  r)σ  Proof. The properties of parahoric
Deligne–Lusztig induction presented in Section 3 have and the proofs go through with only notational
changes. The direct analogues for R same is the case for Section 4 and especially Theorem 4.5, the
generic Mackey formula. Here, note that the generalized Bruhat decomposition in Section 4.2 should
be intersected with LrG+ r ⊂ Gr. The crux then is to see that the ﬁber calculations in Section 5.
But t

# Irrelevant prompt --> test for hallucinations

In [16]:
%%time

query = "Accoridng to her paper, what is Charlotte Chan's favorite color?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  Accoridng to her paper, what is
Charlotte Chan's favorite color?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have
enough information to answer the question. The provided context does not mention Charlotte Chan or
her favorite color.
CPU times: user 6.11 s, sys: 24.9 ms, total: 6.13 s
Wall time: 6.21 s
