# RAG with Llama 2 and LangChain
Retrieval-Augmented Generation (RAG) is a technique that combines a retriever and a generative language model to deliver accurate response. It involves retrieving relevant information from a large corpus and then generating contextually appropriate responses to queries. Here we use the quantized version of the Llama 2 13B LLM with LangChain to perform generative QA with RAG. The notebook file has been tested in Google Colab with T4 GPU. Please change the runtime type to T4 GPU before running the notebook.

## Install Packages

In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install langchain
!pip install chromadb
!pip install sentence_transformers # ==2.2.2
!pip install unstructured
!pip install pdf2image
!pip install pdfminer.six
!pip install unstructured-pytesseract
!pip install unstructured-inference
!pip install faiss-gpu

Looking in indexes: https://pypi.org/simple, https://huggingface.github.io/autogptq-index/whl/cu118/
Collecting auto-gptq
  Downloading https://huggingface.github.io/autogptq-index/whl/cu118/auto-gptq/auto_gptq-0.5.0%2Bcu118-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate>=0.22.0 (from auto-gptq)
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.0.6-py3-none-any.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m97.8 MB/s[0m eta [36m0:00:00[0m
Collecting peft>=0.5.0 (from auto-gptq)
  Downloading peft-0.6

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


## Restart Runtime

In [None]:
import os
os.kill(os.getpid(), 9)

## Load Llama 2
We will use the quantized version of the LLAMA 2 13B model from HuggingFace for our RAG task.

In [1]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

model_name = "TheBloke/Llama-2-13b-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

Downloading (…)lve/main/config.json:   0%|          | 0.00/837 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

#### Test LLM with Llama 2 prompt structure and LangChain PromptTemplate

In [2]:
from textwrap import fill
from langchain.prompts import PromptTemplate

template = """
<s>[INST] <<SYS>>
You are an AI assistant. You are truthful, unbiased and honest in your response.

If you are unsure about an answer, truthfully say "I don't know"
<</SYS>>

{text} [/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

text = "Explain artificial intelligence in a few lines"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Sure! Here is my explanation of artificial intelligence:  Artificial intelligence (AI) refers to the
development of computer systems that can perform tasks that typically require human intelligence,
such as learning, problem-solving, and decision-making. These systems use algorithms and machine
learning techniques to analyze data and make predictions or take actions based on that data. Some
examples of AI include natural language processing, image recognition, and autonomous vehicles.
Overall, the goal of AI is to create machines that can think and act like humans, but with greater
speed, accuracy, and consistency.


In [3]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

## RAG from web pages
### A. Create a vectore store for the context/external data
Here, we'll create embedding vectores of the unstructured data loaded from the the source and store them in a vectore store.  

####Load the document

Depending on the type of the source data, we can use the appropriate data loader from LangChain to load the data.



In [4]:
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format

web_loader = UnstructuredURLLoader(
    urls=["https://en.wikipedia.org/wiki/Solar_System"], mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
  rows = body.findall("tr") if body else []


####Split the documents into chunks

Due to the limited size of the context window of an LLM, the data need to be divided into smaller chunks with a text splitter like ``CharacterTextSplitter`` or ``RecursiveCharacterTextSplitter``. In this way, the smaller chunks can be fed into the LLM.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

1446

#### Create a vector database of the chunked documents with HuggingFace embeddings

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

Downloading (…)99753/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)0cdb299753/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)db299753/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)753/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)99753/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)9753/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)0cdb299753/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)b299753/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can either use Chroma or FAISS to create the [Vector Store](https://python.langchain.com/docs/modules/data_connection/vectorstores.html).

In [7]:
%%time

# Create the vectorized db with FAISS
from langchain.vectorstores import FAISS
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

# Create the vectorized db with Chroma
# from langchain.vectorstores import Chroma
# db_web = Chroma.from_documents(chunked_web_doc, embeddings)

CPU times: user 3.8 s, sys: 26.3 ms, total: 3.82 s
Wall time: 4.01 s


### B. Use RetrievalQA chain
We instantiate a RetrievalQA chain from LangChain which takes in a retriever, LLM and a chain_type as the input arguments. When the QA chain receives a query, the retriever retrieves information relevent to the query from the vectore store.   The ``chain type = "stuff"`` method stuffs all the retrieved information into context and makes a call to the language model. The LLM then generates the text/response from the retrieved documents. [See information on Langchain Retriver](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa).

**LLM prompt structure**

We can also pass in the recommended prompt structue for Llama 2 for the QA. In this way, we'd be able to advise our LLM to only use the available context to answer our question. If it cannot find information relevant to our query in the context, it'll **NOT** make up an answer, rather, it would advise that it's unable to find relevant information in the context.

In [8]:
%%time

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# use the recommended propt style for the LLAMA 2 LLM
prompt_template = """
<s>[INST] <<SYS>>
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

<</SYS>>

{context}

Question: {question} [/INST]
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_web.as_retriever(), # (search_kwargs={'k': 5, 'score_threshold': 0.8}),
    chain_type_kwargs={"prompt": prompt},
)
query = "When was the solar system formed?"
result = Chain_web(query)
result

CPU times: user 2.51 s, sys: 125 ms, total: 2.63 s
Wall time: 2.79 s


{'query': 'When was the solar system formed?',
 'result': '\nBased on the provided context, I can answer the question. According to the text, the Solar System was formed approximately 4.6 billion years ago.'}

In [9]:
print(fill(result['result'].strip(), width=100))

Based on the provided context, I can answer the question. According to the text, the Solar System
was formed approximately 4.6 billion years ago.


In [10]:
%%time

query = "How was the solar system formed?"
result = Chain_web(query)
print(fill(result['result'].strip(), width=100))

According to the context provided, the solar system was formed through a process known as accretion,
where small particles of dust and gas came together to form larger bodies, eventually forming the
planets and other objects we see today. This process is thought to have occurred over millions of
years, with the Sun forming from a cloud of hot, dense gas and the planets forming from the leftover
material. The exact details of the formation of the solar system are still the subject of ongoing
research and debate, but the general idea of accretion is widely accepted as the most likely
explanation for how the solar system came to be.
CPU times: user 8.85 s, sys: 164 ms, total: 9.01 s
Wall time: 9.16 s


In [11]:
%%time

query = "Why do the planets orbit the Sun in the same direction that the Sun is rotating?"
result = Chain_web(query)
print(fill(result['result'].strip(), width=100))

Based on the provided context, the reason why the planets orbit the Sun in the same direction that
the Sun is rotating is because they formed from a rotating disk of gas and dust that surrounded the
young Sun. This disk was likely created by the collapse of a giant cloud of gas and dust, and it
rotated in the same direction as the Sun due to conservation of angular momentum. As the material in
the disk cooled and condensed, it formed into the planets we see today, all orbiting in the same
direction as the Sun due to their shared origin and the laws of motion.
CPU times: user 7.23 s, sys: 136 ms, total: 7.37 s
Wall time: 7.42 s


In [12]:
%%time

query = "What are planets of the solar system composed of? Give a detailed response."
result = Chain_web(query)
print(fill(result['result'].strip(), width=100))

Based on the provided context, I can answer the question as follows:  The planets of the solar
system are composed of a variety of materials, including rocks, metals, and gases. The terrestrial
planets, which include Mercury, Venus, Earth, and Mars, are primarily composed of rock and metal.
These planets have a definite surface and are densely packed with minerals and metals, such as iron
and silicates.  The four giant planets, Jupiter, Saturn, Uranus, and Neptune, are composed mainly of
hydrogen and helium gases, with some heavier elements such as methane and ammonia. These planets
have no solid surface and are made up of layers of gas and liquid.  The Kuiper belt, a region of the
outer Solar System beyond the orbit of Neptune, contains mostly icy objects, while the asteroid belt
between the orbits of Mars and Jupiter is composed of rocky objects.  In summary, the planets of the
solar system are composed of a mix of rocks, metals, and gases, with the terrestrial planets being
primaril

Let's test our LLM with a query that is not relevant to the context. The model should respond that it does not have enough information to respond to this query.

In [13]:
%%time

query = "How does the tranformers architecture work?"
result = Chain_web(query)
print(fill(result['result'].strip(), width=100))

I cannot provide a detailed answer to how the Transformers architecture works based on the given
context. The context only mentions "ring systems" and "retrograde manner," which do not provide
enough information to explain the Transformers architecture. The Transformers architecture is a
complex topic that requires a comprehensive understanding of deep learning and neural networks, and
it is not possible to explain it fully within the scope of this context.
CPU times: user 4.61 s, sys: 81.8 ms, total: 4.69 s
Wall time: 4.69 s


**The model responded as expected. It does not have the information to answer the question!**

## RAG from PDF Files

Download pdf files

In [14]:
!gdown "https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf" # this is just a pdf print of the Solar System page on Wikipedia!

Downloading...
From: https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf
To: /content/Solar-System-Wikipedia.pdf
  0% 0.00/4.49M [00:00<?, ?B/s] 82% 3.67M/4.49M [00:00<00:00, 36.1MB/s]100% 4.49M/4.49M [00:00<00:00, 42.0MB/s]


Load PDF Files

In [15]:
from langchain.document_loaders import UnstructuredPDFLoader
pdf_loader = UnstructuredPDFLoader("/content/Solar-System-Wikipedia.pdf")
pdf_doc = pdf_loader.load()
updated_pdf_doc = filter_complex_metadata(pdf_doc)

Spit the document into chunks

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_pdf_doc = text_splitter.split_documents(updated_pdf_doc)
len(chunked_pdf_doc)

241

Create the vector store

In [17]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 5.29 s, sys: 7.14 ms, total: 5.3 s
Wall time: 5.3 s


RAG with RetrievalQA

In [18]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)
query = "When was the solar system formed?"
result = Chain_pdf(query)
print(fill(result['result'].strip(), width=100))

Based on the provided context, the solar system was formed approximately 4.6 billion years ago. This
information can be found in the second paragraph of the Wikipedia article, which states that the
Solar System was formed "over 4.6 billion years ago" from the gravitational collapse of a giant
molecular cloud.
CPU times: user 6.65 s, sys: 678 ms, total: 7.33 s
Wall time: 7.44 s


In [19]:
%%time

query = "How does the tranformers architecture work?"
result = Chain_pdf(query)
print(fill(result['result'].strip(), width=100))

I cannot answer your question because the provided text does not contain information about
transformers architecture. The text discusses the structure and composition of the solar system, the
Nice model, and the expansion of the Sun. It also mentions Christiaan Huygens and his discovery of
Titan, as well as the observations of the transit of Venus in 1639 by Jeremiah Horrocks and William
Crabtree. None of this information relates to transformers architecture. Therefore, I cannot provide
an answer to your question based on the given context.
CPU times: user 8.15 s, sys: 464 ms, total: 8.61 s
Wall time: 8.61 s
