# RAG with Llama 2 and LangChain
Retrieval-Augmented Generation (RAG) is a technique that combines a retriever and a generative language model to deliver accurate response. It involves retrieving relevant information from a large corpus and then generating contextually appropriate responses to queries. Here we use the quantized version of the Llama 2 13B LLM with LangChain to perform generative QA with RAG. The notebook file has been tested in Google Colab with T4 GPU. Please change the runtime type to T4 GPU before running the notebook.

## Install Packages

In [2]:
!pip install transformers>=4.32.0 optimum>=1.12.0
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install langchain
!pip install chromadb
!pip install sentence_transformers # ==2.2.2
!pip install unstructured
!pip install pdf2image
!pip install pdfminer.six
!pip install unstructured-pytesseract
!pip install unstructured-inference
!pip install faiss-gpu
!pip install pikepdf
!pip install pypdf

## Restart Runtime

## Load Llama 2
We will use the quantized version of the LLAMA 2 13B model from HuggingFace for our RAG task.

In [1]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

model_name = "TheBloke/Llama-2-13b-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


#### Test LLM with Llama 2 prompt structure and LangChain PromptTemplate

In [2]:
from textwrap import fill
from langchain.prompts import PromptTemplate

template = """
<s>[INST] <<SYS>>
You are an AI assistant. You are truthful, unbiased and honest in your response.

If you are unsure about an answer, truthfully say "I don't know"
<</SYS>>

{text} [/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

text = "Explain artificial intelligence in a few lines"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

  warn_deprecated(


Sure! Here is my explanation of artificial intelligence:  Artificial intelligence (AI) refers to the
development of computer systems that can perform tasks that typically require human intelligence,
such as learning, problem-solving, and decision-making. These systems use algorithms and machine
learning techniques to analyze data and make predictions or take actions based on that data. Some
examples of AI include natural language processing, image recognition, and autonomous vehicles.
Overall, the goal of AI is to create machines that can think and act like humans, but with greater
speed, accuracy, and consistency.


In [3]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

## RAG from web pages
### A. Create a vectore store for the context/external data
Here, we'll create embedding vectores of the unstructured data loaded from the the source and store them in a vectore store.  

####Load the document

Depending on the type of the source data, we can use the appropriate data loader from LangChain to load the data.



In [6]:
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format

web_loader = UnstructuredURLLoader(
    urls=["https://en.wikipedia.org/wiki/Solar_System"], mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


####Split the documents into chunks

Due to the limited size of the context window of an LLM, the data need to be divided into smaller chunks with a text splitter like ``CharacterTextSplitter`` or ``RecursiveCharacterTextSplitter``. In this way, the smaller chunks can be fed into the LLM.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

1452

#### Create a vector database of the chunked documents with HuggingFace embeddings

In [8]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We can either use Chroma or FAISS to create the [Vector Store](https://python.langchain.com/docs/modules/data_connection/vectorstores.html).

In [9]:
%%time

# Create the vectorized db with FAISS
from langchain.vectorstores import FAISS
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

# Create the vectorized db with Chroma
# from langchain.vectorstores import Chroma
# db_web = Chroma.from_documents(chunked_web_doc, embeddings)

CPU times: user 3.56 s, sys: 34.7 ms, total: 3.6 s
Wall time: 3.92 s


### B. Use RetrievalQA chain
We instantiate a RetrievalQA chain from LangChain which takes in a retriever, LLM and a chain_type as the input arguments. When the QA chain receives a query, the retriever retrieves information relevent to the query from the vectore store.   The ``chain type = "stuff"`` method stuffs all the retrieved information into context and makes a call to the language model. The LLM then generates the text/response from the retrieved documents. [See information on Langchain Retriver](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa).

**LLM prompt structure**

We can also pass in the recommended prompt structue for Llama 2 for the QA. In this way, we'd be able to advise our LLM to only use the available context to answer our question. If it cannot find information relevant to our query in the context, it'll **NOT** make up an answer, rather, it would advise that it's unable to find relevant information in the context.

In [12]:
%%time

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# use the recommended propt style for the LLAMA 2 LLM
prompt_template = """
<s>[INST] <<SYS>>
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

<</SYS>>

{context}

Question: {question} [/INST]
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_web.as_retriever(), # (search_kwargs={'k': 5, 'score_threshold': 0.8}),
    chain_type_kwargs={"prompt": prompt},
)
query = "When was the solar system formed?"
result = Chain_web.invoke(query)
result

CPU times: user 2.15 s, sys: 127 ms, total: 2.28 s
Wall time: 2.44 s


{'query': 'When was the solar system formed?',
 'result': '\nBased on the information provided, the Solar System was formed approximately 4.568 billion years ago.'}

In [13]:
print(fill(result['result'].strip(), width=100))

Based on the information provided, the Solar System was formed approximately 4.568 billion years
ago.


In [27]:
%%time

query = "Explain in detail how the solar system was formed."
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))



The solar system was formed 4.568 billion years ago from the gravitational collapse of a region
within a large molecular cloud. This region was likely several light-years across and may have given
birth to several stars. The cloud consisted mostly of hydrogen, with some helium, and small amounts
of heavier elements fused by previous generations of stars. The gravitational collapse led to the
formation of a protostar, which eventually became the Sun. The remaining material in the cloud
condensed into small, rocky bodies called planetesimals, which collided and merged to form the
planets we know today.
CPU times: user 8.04 s, sys: 240 ms, total: 8.28 s
Wall time: 8.69 s


In [31]:
%%time

query = "Why do the planets orbit the Sun in the same direction that the Sun is rotating?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))



Based on the information provided, the reason why the planets orbit the Sun in the same direction
that the Sun is rotating is because they formed from a rotating disk of gas and dust that surrounded
the young Sun. This disk was likely formed through the collapse of a giant cloud of gas and dust,
and it rotated in the same direction as the Sun due to conservation of angular momentum. As the
material in the disk cooled and condensed, it formed into the planets we see today, all orbiting the
Sun in the same direction due to their shared origin and the conservation of angular momentum.
CPU times: user 9.27 s, sys: 158 ms, total: 9.42 s
Wall time: 10.5 s


In [32]:
%%time

query = "What are the planets of the solar system composed of? Give a detailed response."
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))



Based on the provided context, the planets of the solar system are composed of the following:  1.
Rocky materials: The four inner planets - Mercury, Venus, Earth, and Mars - are primarily composed
of rocky materials. 2. Gaseous materials: The four giant planets - Jupiter, Saturn, Uranus, and
Neptune - are primarily composed of gaseous materials, such as hydrogen and helium. 3. Icy
materials: The Kuiper belt, which is located beyond the asteroid belt and consists of objects that
are mostly icy, is thought to be the source of many short-period comets. 4. Dust and small
particles: The solar system also contains a vast amount of dust and small particles that are found
in the interplanetary space between the planets.  It is important to note that the composition of
the planets can vary significantly depending on their location within the solar system. For example,
the inner planets are much denser than the gas giants, and the gas giants have no solid surfaces due
to their high temperatures 

## C. Hallucination Check
Hallucination in RAG refers to the generation of content by an LLM that is not based onn the retrieved knowledge.

Let's test our LLM with a query that is not relevant to the context. The model should respond that it does not have enough information to respond to this query.

In [33]:
%%time

query = "How does the tranformers architecture work?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))



I cannot provide a detailed answer to how the transformers architecture works based on the given
context. The context only mentions "ring systems" and "retrograde manner," which do not provide
enough information to explain the transformers architecture. Therefore, I cannot answer the question
with certainty.
CPU times: user 3.27 s, sys: 106 ms, total: 3.38 s
Wall time: 3.37 s


The model responded as expected. The context provided to it do not contain any information on tranformers architectures. So, it cannot answer this question!

## RAG from PDF Files

Download pdf files

In [34]:
!gdown "https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf" # this is just a pdf print of the Solar System page on Wikipedia!

Downloading...
From: https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf
To: /content/Solar-System-Wikipedia.pdf
  0% 0.00/4.49M [00:00<?, ?B/s] 35% 1.57M/4.49M [00:00<00:00, 13.6MB/s]100% 4.49M/4.49M [00:00<00:00, 20.8MB/s]


Load PDF Files

In [35]:
from langchain.document_loaders import UnstructuredPDFLoader
pdf_loader = UnstructuredPDFLoader("/content/Solar-System-Wikipedia.pdf")
pdf_doc = pdf_loader.load()
updated_pdf_doc = filter_complex_metadata(pdf_doc)

Spit the document into chunks

In [36]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_pdf_doc = text_splitter.split_documents(updated_pdf_doc)
len(chunked_pdf_doc)

241

Create the vector store

In [37]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 5.62 s, sys: 7.29 ms, total: 5.63 s
Wall time: 5.61 s


### RAG with RetrievalQA

In [39]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)
query = "When was the solar system formed?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))



Based on the provided context, the solar system was formed approximately 4.6 billion years ago. This
information can be found in the second paragraph of the Wikipedia article, which states that the
Solar System was formed "over 4.6 billion years ago" from the gravitational collapse of a giant
molecular cloud.
CPU times: user 6.39 s, sys: 848 ms, total: 7.24 s
Wall time: 7.79 s


In [40]:
%%time

query = "Explain in detail how the solar system was formed."
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))



Based on the provided context, here is a detailed explanation of how the Solar System was formed:
The Solar System was formed approximately 4.6 billion years ago from the gravitational collapse of a
giant molecular cloud. This cloud was likely several light-years across and may have birthed several
stars. The cloud consisted mostly of hydrogen, with some helium, and small amounts of heavier
elements fused by previous generations of stars. As the region that would become the Solar System
collapsed, conservation of angular momentum caused it to rotate faster, leading to the formation of
a protoplanetary disk with a diameter of roughly 200 AU (30 billion km; 19 billion mi) and a hot,
dense protostar at the center.  Over time, the contracting nebula continued to flatten into a disk,
with the majority of the mass collecting at the center to form the Sun. As the material in the disk
collided and coalesced, hundreds of protoplanets may have formed, but many of these objects either
merged or w

### Hallucination Check

In [43]:
%%time

query = "How does the tranformers architecture work?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))



I cannot answer your question because the provided text does not contain information about
transformers architecture. The text discusses the structure and composition of the solar system, the
Nice model, and the expansion of the Sun. It also mentions Christiaan Huygens and his discovery of
Titan, as well as the observations of the transit of Venus in 1639 by Jeremiah Horrocks and William
Crabtree. None of this information relates to transformers architecture. Therefore, I cannot provide
an answer to your question based on the given context.
CPU times: user 8.3 s, sys: 515 ms, total: 8.82 s
Wall time: 9.06 s
