<pre>
Modified version of: https://github.com/muntasirhsn/Retrieval-Augmented-Generation-with-Llama-2/blob/main/RAG_with_Lama_2_and_LangChain_2.ipynb

Modified and updated by:  Bálint Gyires-Tóth
</pre>

# RAG + LLama 7B chat + LangChain
Retrieval-Augmented Generation (RAG) is a method that combines the capabilities of large language models with external or proprietary data sources. It involves extracting relevant information from a large corpus and then generating context-appropriate responses to queries.

In this notebook, we combine Gemma 2B with RAG.

## Installing dependencies



In [None]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.3.0 which is incompatible.[0m[31m
[0m

Important
------
Restart the kernel after installing the packages.

In [None]:
import os
os.kill(os.getpid(), 9)

# Imports
Next, we import the necessary Python libraries.

In [None]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT=""
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Next, we download a model from Huggingface. You can find two variants - a smaller and a larger one. Let's start with the smaller, and later make experiments with the larger one too.

In [None]:
model_name = "google/gemma-2b-it" # 2B language model from Google
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Testing the model with basic prompts.

In [None]:
template_gemma = """
<bos><start_of_turn>user
{text}<end_of_turn>
<start_of_turn>model
"""

template_llama3 = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

if "gemma" in model_name:
  template=template_gemma
else:
  template=template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


In [None]:
text = "What is Murray Lenister's book The strange people about?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is Murray Lenister's book The
strange people about?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I
couldn't find any information on a book called "The Strange People" by Murray Lenister. It's
possible that the book doesn't exist or is not well-known.  However, there are several authors with
similar names who have written books on various topics. Could you be referring to a different author
or book title?  If you could provide more context or details about the book, such as the genre,
publication date, or a brief summary, I may be able to help you better.


## RAG on the web
In this section, we download content from the internet, vectorise it and store the vectors, then search these vectors and generate the answer using the associated text.

In [None]:
web_loader = UnstructuredURLLoader(
    urls=["https://www.gutenberg.org/cache/epub/73515/pg73515-images.html"], mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Next, due to the limited receptive field (context window) of the LLMs, we split the text into smaller parts with some overlap.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

837

We then build the vector database using a small language model.

In [None]:
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The texts and the associated vectors are stored in a FAISS object (another similar alternative would be to use Chroma).

In [None]:
%%time

# Create the vectorized db with FAISS

db_web = FAISS.from_documents(chunked_web_doc, embeddings)

# Create the vectorized db with Chroma
# from langchain.vectorstores import Chroma
# db_web = Chroma.from_documents(chunked_web_doc, embeddings)

CPU times: user 3.22 s, sys: 23.1 ms, total: 3.24 s
Wall time: 3.62 s


We then use the LangChain RetrievalQA function ("chain"). This needs a Retreiver, an LLM and a chain_type input. When the RetrievalQA chain is called, the data collector retrieves the most similar content to the instruction from a vector store. The ``chain type = "stuff"`` method concatenates all retrieved information and invokes the language model. The LLM then generates the text/response based on the retrieved documents. [Read more in the section on Langchain Retriver](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa).

In [None]:
%%time


prompt_template_gemma = """
<bos><start_of_turn>user
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}

Question: {question}<end_of_turn>

<start_of_turn>model
"""

prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

if "gemma" in model_name:
  prompt_template=prompt_template_gemma
else:
  prompt_template=prompt_template_llama3


CPU times: user 4 µs, sys: 2 µs, total: 6 µs
Wall time: 9.54 µs


In [None]:
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_web.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 10, 'score_threshold': 0.1}),
    chain_type_kwargs={"prompt": prompt},
)


In [None]:
query = "What is Murray Lenister's book The strange people about?"
result = Chain_web.invoke(query)
result

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'query': "What is Murray Lenister's book The strange people about?",
 'result': '\n<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nUse the following context to answer the question at the end. Do not use any other information. If you can\'t find the relevant information in the context, just say you don\'t have enough information to answer the question. Don\'t try to make up an answer.\n\nThe Project Gutenberg eBook of The strange people\n\n*** END OF THE PROJECT GUTENBERG EBOOK THE STRANGE PEOPLE ***\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE STRANGE PEOPLE ***\n\nTitle: The strange people\n\nAuthor: Murray Leinster\n\nTHE STRANGE PEOPLE\n\nThe mob had appeared from Bendale. On horse-back, in motor-cars and in wagons\r\ndrawn by teams, what seemed to be the whole population had come raging out to\r\nCoulters. The farmers of the valley had put their women-folk together and come\r\narmed with weapons, from shotguns to pitchforks. And they had surged into the\r\nhills 

In [None]:
print(fill(result['result'].strip(), width=100))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  The Project Gutenberg eBook of The strange people  *** END OF THE PROJECT GUTENBERG
EBOOK THE STRANGE PEOPLE ***  *** START OF THE PROJECT GUTENBERG EBOOK THE STRANGE PEOPLE ***
Title: The strange people  Author: Murray Leinster  THE STRANGE PEOPLE  The mob had appeared from
Bendale. On horse-back, in motor-cars and in wagons  drawn by teams, what seemed to be the whole
population had come raging out to  Coulters. The farmers of the valley had put their women-folk
together and come  armed with weapons, from shotguns to pitchforks. And they had surged into the
hills in quest of the Strange People. All had forgotten that the only thing  genuinely proved
against the Strangers was

In [None]:
%%time

query = "Who are the strange people in the book The strange people?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  THE STRANGE PEOPLE  Title: The strange people  The Project Gutenberg eBook of The
strange people  *** END OF THE PROJECT GUTENBERG EBOOK THE STRANGE PEOPLE ***  *** START OF THE
PROJECT GUTENBERG EBOOK THE STRANGE PEOPLE ***  They were two hundred people of unknown origin who
spoke English far purer  than the New Hampshireites around them and avoided contact with their
neighbors with a passionate sincerity. They could not be classified even by  the expert on races of
men who had written the article. They were not  Americans or Anglo-Saxons. They were not any known
people. But whatever they  were, they were splendid specimens, and they were hated by their
neighbors,  and they k

It is also worth checking the hallucination - it is better to have no answer from the LLM than to have what appears to be a wrong answer from the text. Therefore an out-of-context question is given.

In [None]:
%%time

query = "What is artificial intelligence?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  What is artificial
intelligence?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Artificial intelligence (AI)
refers to the development of computer systems that are capable of performing tasks that typically
require human intelligence, such as learning, problem-solving, and decision-making. AI systems can
be trained on large amounts of data to recognize patterns, make predictions, and take actions based
on those predictions. They can also improve their performance over time through machine learning
algorithms, which enable them to adapt to new situations and learn from experience.
CPU times: user 20.3 s, sys: 0 ns, tot

If our system is working well, we should not get an answer.

## RAG with local documents (PDF)

The first step is to download the longer PDF file for the previous website.

In [None]:
!wget -O document1.pdf --no-check-certificate "https://s201.q4cdn.com/141608511/files/doc_financials/2023/ar/2023-Annual-Report-1.pdf"
!wget -O document2.pdf --no-check-certificate "https://s201.q4cdn.com/141608511/files/doc_financials/2022/ar/2022-Annual-Review.pdf"


--2024-05-02 15:22:29--  https://s201.q4cdn.com/141608511/files/doc_financials/2023/ar/2023-Annual-Report-1.pdf
Resolving s201.q4cdn.com (s201.q4cdn.com)... 68.70.205.1, 68.70.205.3, 68.70.205.2, ...
Connecting to s201.q4cdn.com (s201.q4cdn.com)|68.70.205.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41968287 (40M) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-02 15:22:29 (237 MB/s) - ‘document1.pdf’ saved [41968287/41968287]

--2024-05-02 15:22:29--  https://s201.q4cdn.com/141608511/files/doc_financials/2022/ar/2022-Annual-Review.pdf
Resolving s201.q4cdn.com (s201.q4cdn.com)... 68.70.205.1, 68.70.205.3, 68.70.205.2, ...
Connecting to s201.q4cdn.com (s201.q4cdn.com)|68.70.205.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22363773 (21M) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-02 15:22:30 (189 MB/s) - ‘document2.pdf’ saved [22363773/22363773]



...and then we load it.

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf","/content/document2.pdf"]]


As before, we split it into smaller pieces.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...


1838

And we make the vectors for each part.

In [None]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 29.4 s, sys: 7.83 ms, total: 29.4 s
Wall time: 29.4 s


Again, we use the RetrievalQA tool to retrieve the information.

In [None]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
query = "What are the risk factors for NVIDIA?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  13  ITEM 1A. RISK FACTORS  In evaluating NVIDIA, the following risk factors should be
considered in addition to the other information in this Annual Report on Form 10-K. Purchasing or
owning NVIDIA common stock involves investment risks including, but not limited to, the risks
described below. Any one of the following risks could harm our business, financial condition,
results of operations or reputation, which could cause our stock price to decline, and you may lose
all or a part of your investment. Additional risks, trends and uncertainties not presently known to
us or that we currently believe are immaterial may also harm our business, financial condition,
results of operat

In [None]:
%%time

query = "What are the possible job profiles at NVIDIA?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  To be competitive and execute our business strategy successfully, we must recruit,
develop, and retain talented employees, including qualified executives, scientists, engineers, and
technical and non-technical staff.  Recruitment  As the demand for global technical talent continues
to be competitive, we have grown our technical workforce and have been successful in attracting top
talent to NVIDIA. We have attracted strong talent globally with our differentiated hiring strategies
for university, professional, executive and diverse recruits. The COVID-19 pandemic created expanded
hiring opportunities in new geographies and provided increased flexibility for employees to work
fro

Finally, let's examine whether the current system is hallucinating.

In [None]:
%%time

query = "What is the NVIDIA CEO's dog name?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  Chief Executive Officer  Chief Executive Officer  NVIDIA Corporation  Robert K. Burgess
Independent Consultant  Chris A. Malachowsky Founder and NVIDIA Fellow  Tench Coxe Former Managing
Director  Sutter Hill Ventures  Executive Team  Colette M. Kress Executive Vice President and  John
O. Dabiri Centennial Professor of Aeronautics and Mechanical Engineering  Chief Financial Officer
Jay Puri Executive Vice President  California Institute of Technology  Worldwide Field Operations
Persis S. Drell Provost  Debora Shoquist Executive Vice President  Stanford University  Operations
Dawn Hudson Former Chief Marketing Officer  Timothy S. Teter Executive Vice President  National
Footbal