# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


# Prompt

* What did Russia and Russia's president, Vladimir Putin, do as a response to comments from the French president, Emmanuel Macron, on western troops fighting in Ukraine and from the British foreign secretary, David Cameron, on using British-supplied weapons against Russia?
* What is the new developments of China’s DeepSeek?
* What was the misconception about the young people?
* What is special about the new Nvidia’s Optical Boogeyman?
* (Irrelevant) What is the overall analysis result of High Order Reasoning for Time Critical Recommendation in Evidence-based Medicine?

# Database

My model is NousResearch/Hermes-2-Pro-Llama-3-8B which based on Llama 3 model. The Llama 3 model has the knowledge cutoff date on March, 2023, while the OpenHermes-2.5 dataset has the cutoff date on 12 Nov 2023 (https://huggingface.co/datasets/teknium/OpenHermes-2.5/tree/main)
* https://www.theguardian.com/world/article/2024/may/06/russia-to-hold-battlefield-nuclear-drills-after-macron-and-cameron-comments (Mon 6 May 2024 17.59 CEST)
* https://www.semianalysis.com/p/openai-is-doomed-et-tu-microsoft?utm_source=post-email-title&publication_id=329241&post_id=144399864&utm_campaign=email-post-title&isFreemail=true&r=36rch9&triedRedirect=true&utm_medium=email (May 07, 2024)
* https://naomicfisher.substack.com/p/pointing-out-a-problem?utm_source=substack&publication_id=1062989&post_id=139279736&utm_medium=email&utm_content=share&utm_campaign=email-share&triggerShare=true&isFreemail=true&r=36rch9&triedRedirect=true (May 06, 2024)
* https://www.semianalysis.com/p/inference-race-to-the-bottom-make?utm_source=substack&utm_medium=email (Dec 18, 2023)
* https://www.semianalysis.com/p/nvidias-optical-boogeyman-nvl72-infiniband (Mar 25, 2024)

In [1]:
# !pip install transformers>=4.32.0 optimum>=1.12.0
# !pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
# !pip install langchain
# !pip install chromadb
# !pip install sentence_transformers  # ==2.2.2
# !pip install unstructured
# !pip install pdf2image
# !pip install pdfminer.six
# !pip install unstructured-pytesseract
# !pip install unstructured-inference
# !pip install faiss-gpu
# !pip install pikepdf
# !pip install pypdf
# !pip install accelerate
# !pip install pillow_heif
# !pip install -i https://pypi.org/simple/ bitsandbytes

In [2]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import gc
import locale
from textwrap import fill

from huggingface_hub import login
from langchain.chains import RetrievalQA
from langchain.document_loaders import UnstructuredPDFLoader, UnstructuredURLLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.vectorstores.utils import (
    filter_complex_metadata,  # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    GenerationConfig,
    pipeline,
)

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT = ""
login(HUGGINGFACE_UAT)

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/lehoangchibach/.cache/huggingface/token
Login successful


In [3]:
model_name = "NousResearch/Hermes-2-Pro-Llama-3-8B"  # Data cutoff: 12 Nov 2023, https://huggingface.co/datasets/teknium/OpenHermes-2.5/tree/main

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens = 512
gen_cfg.temperature = (
    0.0000001  # 0.0 # For RAG we would like to have determenistic answers
)
gen_cfg.return_full_text = True
gen_cfg.do_sample = True
gen_cfg.repetition_penalty = 1.11

pipe = pipeline(
    task="text-generation", model=model, tokenizer=tokenizer, generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.23s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
def custom_fill(text, w=100):
    return fill(text, width=w)


def print_dolphin_format_result(result):
    t = result.split("<|im_start|>system")[1]
    system, t = t.split("<|im_end|>\n<|im_start|>user")
    user, assistant = t.split("<|im_end|>\n<|im_start|>assistant")
    print(f"System: {custom_fill(system)}\n")
    print(f"User: {custom_fill(user)}\n")
    print(f"Assistant: {custom_fill(assistant)}\n")

In [5]:
template = """
<|im_start|>system
If you can't find the relevant information in the context, just say you don't have enough information to answer the question and end the conversation. 
Don't try to make up an answer.
<|im_end|>
<|im_start|>user
{text}<|im_end|>
<|im_start|>assistant
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [6]:
text = "What did Russia and Russia's president, Vladimir Putin, do as a response to comments from the French president, Emmanuel Macron, on western troops fighting in Ukraine and from the British foreign secretary, David Cameron, on using British-supplied weapons against Russia?"
result = llm(prompt.format(text=text))
print_dolphin_format_result(result)

  warn_deprecated(


System:  If you can't find the relevant information in the context, just say you don't have enough
information to answer the question and end the conversation.  Don't try to make up an answer.

User:  What did Russia and Russia's president, Vladimir Putin, do as a response to comments from the
French president, Emmanuel Macron, on western troops fighting in Ukraine and from the British
foreign secretary, David Cameron, on using British-supplied weapons against Russia?

Assistant:  I don't have enough information to answer this specific question accurately. Please provide more
details or context about the time frame and any additional background information that would help me
better understand the situation and provide a proper response.



In [7]:
%%time
text = "What is the new developments of China’s DeepSeek?"
result = llm(prompt.format(text=text))
print_dolphin_format_result(result)

System:  If you can't find the relevant information in the context, just say you don't have enough
information to answer the question and end the conversation.  Don't try to make up an answer.

User:  What is the new developments of China’s DeepSeek?

Assistant:  I don't have enough information to provide updates on China's DeepSeek specifically. However, I can
tell you that DeepSeek was a project launched by researchers from Peking University and other
institutions in 2019 to develop AI-powered tools for detecting marine species using underwater
drones. The project aimed to identify and track marine life in the ocean, which could help with
conservation efforts.  As for any recent developments or news related to this project, I would
recommend searching online for more current information as my knowledge may be outdated.

CPU times: user 4.4 s, sys: 69.8 ms, total: 4.47 s
Wall time: 4.47 s


In [8]:
%%time
text = "What was the misconception about the young people?"
result = llm(prompt.format(text=text))
print_dolphin_format_result(result)

System:  If you can't find the relevant information in the context, just say you don't have enough
information to answer the question and end the conversation.  Don't try to make up an answer.

User:  What was the misconception about the young people?

Assistant:  I don't have enough information to answer that specific question. Please provide more details or
context so I can better understand what misconception you are referring to.

CPU times: user 1.41 s, sys: 95.8 ms, total: 1.5 s
Wall time: 1.51 s


In [9]:
%%time
text = "What is special about the new Nvidia’s Optical Boogeyman?"
result = llm(prompt.format(text=text))
print_dolphin_format_result(result)

System:  If you can't find the relevant information in the context, just say you don't have enough
information to answer the question and end the conversation.  Don't try to make up an answer.

User:  What is special about the new Nvidia’s Optical Boogeyman?

Assistant:  I am sorry but I do not have enough information to provide details on "Nvidia's Optical Boogeyman."
Could you please provide more context or background information so that I may better assist you with
your query?

CPU times: user 2.07 s, sys: 130 ms, total: 2.2 s
Wall time: 2.2 s


In [10]:
%%time
text = "What is the overall analysis result of High Order Reasoning for Time Critical Recommendation in Evidence-based Medicine?"
result = llm(prompt.format(text=text))
print_dolphin_format_result(result)

System:  If you can't find the relevant information in the context, just say you don't have enough
information to answer the question and end the conversation.  Don't try to make up an answer.

User:  What is the overall analysis result of High Order Reasoning for Time Critical Recommendation in
Evidence-based Medicine?

Assistant:  I am sorry but I do not have enough information to provide an overall analysis result of High Order
Reasoning for Time-Critical Recommendation in Evidence-Based Medicine. Could you please provide more
context or details about this topic so that I may better understand your request?

CPU times: user 2.33 s, sys: 127 ms, total: 2.46 s
Wall time: 2.46 s


In [11]:
%%time
web_loader = UnstructuredURLLoader(
    urls=[
        "https://www.theguardian.com/world/article/2024/may/06/russia-to-hold-battlefield-nuclear-drills-after-macron-and-cameron-comments",
        "https://www.semianalysis.com/p/openai-is-doomed-et-tu-microsoft?utm_source=post-email-title&publication_id=329241&post_id=144399864&utm_campaign=email-post-title&isFreemail=true&r=36rch9&triedRedirect=true&utm_medium=email",
        "https://naomicfisher.substack.com/p/pointing-out-a-problem?utm_source=substack&publication_id=1062989&post_id=139279736&utm_medium=email&utm_content=share&utm_campaign=email-share&triggerShare=true&isFreemail=true&r=36rch9&triedRedirect=true",
        "https://www.semianalysis.com/p/inference-race-to-the-bottom-make?utm_source=substack&utm_medium=email",
        "https://www.semianalysis.com/p/nvidias-optical-boogeyman-nvl72-infiniband",
    ],
    mode="elements",
    strategy="fast",
)
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
print("chunked_web_doc", len(chunked_web_doc))

web_embeddings = (
    HuggingFaceEmbeddings()
)  # default model_name="sentence-transformers/all-mpnet-base-v2"
db_web = FAISS.from_documents(chunked_web_doc, web_embeddings)

prompt_template = """
<|im_start|>system
Use the following context to answer the question at the end. Do not use any other information. 
If you can't find the relevant information in the context, just say you don't have enough information to answer the question and end the conversation. 
Don't try to make up an answer. 
{context}<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""

context_prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_web.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={"k": 10, "score_threshold": 0.1},
    ),
    chain_type_kwargs={"prompt": context_prompt},
)

chunked_web_doc 359




CPU times: user 3.17 s, sys: 268 ms, total: 3.43 s
Wall time: 9.9 s


In [12]:
%%time
query = "What did Russia and Russia's president, Vladimir Putin, do as a response to comments from the French president, Emmanuel Macron, on western troops fighting in Ukraine and from the British foreign secretary, David Cameron, on using British-supplied weapons against Russia?"
result = Chain_web.invoke(query)
print_dolphin_format_result(result["result"])

System:  Use the following context to answer the question at the end. Do not use any other information.  If
you can't find the relevant information in the context, just say you don't have enough information
to answer the question and end the conversation.  Don't try to make up an answer.  Vladimir Putin
responds to recent statements from David Cameron and Emmanuel Macron over Ukraine war  The
announcement came days after Macron said he would “not rule out” the possibility of sending troops
to Ukraine and Cameron said it was up to Kyiv how it used British weapons, including against targets
inside of Russia.  The Russian foreign ministry issued a formal protest to Casey over Cameron’s
recent statements that Ukraine had the right to use British weapons to strike inside Russia.  “A
short time ago you and I witnessed another unprecedented stage in the escalation of tensions
initiated by the French president and the British foreign secretary,” said Dmitry Peskov, the
Kremlin spokesman. “That

In [13]:
%%time
query = "What is the new developments of China’s DeepSeek?"
result = Chain_web.invoke(query)
print_dolphin_format_result(result["result"])

System:  Use the following context to answer the question at the end. Do not use any other information.  If
you can't find the relevant information in the context, just say you don't have enough information
to answer the question and end the conversation.  Don't try to make up an answer.  These results
demonstrate Chinese companies are now competitive as well. Also the paper is probably the best one
this year in terms of information and details shared.  Even assuming servers are never perfectly
utilized, and batch sizes are lower than peak capability, there is plenty of room for DeepSeek to
make money while crushing everyone else’s inference economics. Mixtral, Claude 3 Sonnet, Llama 3,
and DBRX were already beating down OpenAI’s GPT-3.5 Turbo, but this is the nail in the coffin.  They
trained the model on 8.1 trillion tokens. DeepSeek V2 was able to achieve incredible training
efficiency with better model performance than other open models at 1/5th the compute of Meta’s Llama
3 70B. F

In [14]:
%%time
query = "What was the misconception about the young people?"
result = Chain_web.invoke(query)
print_dolphin_format_result(result["result"])



System:  Use the following context to answer the question at the end. Do not use any other information.  If
you can't find the relevant information in the context, just say you don't have enough information
to answer the question and end the conversation.  Don't try to make up an answer.  For our teenagers
are vulnerable. Bigger than children and so we expect more, but still very much in development. 
They are pointing out a problem. The question is, can we listen?  But instead of seeing this as them
pointing out a problem, the typical response is to say that it is the young people (and their
parents) who are the problem.  More discipline, stricter rules, higher expectations, a change in
attitude – that’s what’s needed. Change the young people so they stop showing us their distress,
which will enable us to ignore the reasons for their distress. Get them under control so they stop
complaining.  What happened was that he was called rebellious. The youth leaders told him that he
needed to

In [15]:
%%time
query = "What is special about the new Nvidia’s Optical Boogeyman?"
result = Chain_web.invoke(query)
print_dolphin_format_result(result["result"])

System:  Use the following context to answer the question at the end. Do not use any other information.  If
you can't find the relevant information in the context, just say you don't have enough information
to answer the question and end the conversation.  Don't try to make up an answer.  The NVL72 is not
the optical boogeyman people should worry about, there is another, much more real optical boogeyman.
Nvidia’s Optical Boogeyman – NVL72, Infiniband Scale Out, 800G & 1.6T Ramp  Nvidia’s Optical
Boogeyman – NVL72, Infiniband Scale Out, 800G & 1.6T Ramp  Nvidia’s Optical Boogeyman – NVL72,
Infiniband Scale Out, 800G & 1.6T Ramp  This optical boogeyman dramatically reduces the number of
transceivers by a significant volume, and it’s shipping in volume next year. Below we will explain
what, how, and by how much.  When Nvidia’s DGX GB200 NVL72 was announced in the keynote speech last
week – with the capability of linking 72 GPUs in the same rack with 900GB/s per GPU NVLink 5
connections – 

In [16]:
%%time
query = "What is the overall analysis result of High Order Reasoning for Time Critical Recommendation in Evidence-based Medicine?"
result = Chain_web.invoke(query)
print_dolphin_format_result(result["result"])



System:  Use the following context to answer the question at the end. Do not use any other information.  If
you can't find the relevant information in the context, just say you don't have enough information
to answer the question and end the conversation.  Don't try to make up an answer.

User:  What is the overall analysis result of High Order Reasoning for Time Critical Recommendation in
Evidence-based Medicine?

Assistant:  I am sorry but I do not have enough information to provide an overall analysis result of High Order
Reasoning for Time-Critical Recommendation in Evidence-Based Medicine. Could you please provide more
context or details about this topic so that I could better understand your request?

CPU times: user 2.38 s, sys: 140 ms, total: 2.52 s
Wall time: 2.52 s
