#Retrieval Augmented Generation (RAG)

## Overview
*   Motivation for RAG
*   Idea behind RAG
*   Advantages and Disadvantages
*   Implementation to augment question + answer
*   Advanced applications


#### Imagine you went to live under a rock on August 2006. When you come out in 2024, you are asked how many planets revolve around the sun. What would you say?...
![pluto](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/pluto_planets.jpeg?raw=1)

This is similar to LLMs which are trained with data until a certain point and then asked questions on data they are not trained on. Understandably, LLMs will either be unable to answer or simply hallucinate a probably wrong answer.

###What can be done?

Have the LLM go to the library using **Research Augmented Generation (RAG)**!

RAG involves adding your own data (via a retrieval tool) to the prompt that you pass into a large language model.


![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag-overview.original.png?raw=1)
Image credit: https://scriv.ai/guides/retrieval-augmented-generation-overview/

RAG has been shown to improve LLM prediction accuracy without needing to increase parameter size.

![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag_acc_v_size.png?raw=1)

*Image credit: Yu, Wenhao. "Retrieval-augmented generation across heterogeneous knowledge." Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022.*

RAG also increases explainability by giving the source for information.

![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag_source_locator.png?raw=1)

Image credit: https://ai.stanford.edu/blog/retrieval-based-NLP/

## Advantages and Disadvantages

### Advantages

*   Provides domain specific context
*   Improves predictive performance and reduces hallucinations
*   Does not increase model parameters
*   Less labor intensive than fine-tuning LLMs

### Disadvantages

*   May introduce latency since we are adding a relatively costly search step
*   If your dataset includes private information, you may inadvertently expose another user with this information.
*   The data you want to use needs to be curated and you should decide how the data should be accessed. This adds time for the initial set-up.


#Implementation

### 1. Install + load relevant modules:
*   langchain
*   torch
*   transformers
*   sentence-transformers
*   datasets
*   faiss-cpu  
*   pypdf
*  unstructure[pdf]
*  huggingface_hub (add hf_token)




In [14]:
!pip install langchain==0.1.5
!pip install --quiet langchain_experimental
!pip install torch
!pip install transformers
!pip install faiss-cpu
!pip install pypdf
!pip install sentence-transformers
!pip install unstructured==0.12.3
!pip install unstructured[pdf]==0.12.3
!pip install tiktoken
!pip install huggingface_hub
!pip install ipywidgets  # genera error en llamado de llave hf
!pip install unstructured ## se necesita abajo
!pip install unstructured[pdf]


Collecting langchain==0.1.5
  Using cached langchain-0.1.5-py3-none-any.whl.metadata (13 kB)
Collecting langchain-community<0.1,>=0.0.17 (from langchain==0.1.5)
  Using cached langchain_community-0.0.38-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain-core<0.2,>=0.1.16 (from langchain==0.1.5)
  Using cached langchain_core-0.1.52-py3-none-any.whl.metadata (5.9 kB)
Collecting langsmith<0.1,>=0.0.83 (from langchain==0.1.5)
  Using cached langsmith-0.0.92-py3-none-any.whl.metadata (9.9 kB)
INFO: pip is looking at multiple versions of langchain-community to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-community<0.1,>=0.0.17 (from langchain==0.1.5)
  Using cached langchain_community-0.0.37-py3-none-any.whl.metadata (8.7 kB)
  Using cached langchain_community-0.0.36-py3-none-any.whl.metadata (8.7 kB)
  Using cached langchain_community-0.0.35-py3-none-any.whl.metadata (8.7 kB)
  Using cached langchain_community-0.0.34-py3-n

In [15]:
# Download supporting data from llm-workshop + MIT Opencourseware

!git clone https://github.com/argonne-lcf/llm-workshop.git
!wget https://ocw.mit.edu/courses/6-00-introduction-to-computer-science-and-programming-fall-2008/e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf

fatal: destination path 'llm-workshop' already exists and is not an empty directory.
--2024-09-05 13:37:13--  https://ocw.mit.edu/courses/6-00-introduction-to-computer-science-and-programming-fall-2008/e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf
Resolving ocw.mit.edu (ocw.mit.edu)... 151.101.130.133, 151.101.66.133, 151.101.2.133, ...
Connecting to ocw.mit.edu (ocw.mit.edu)|151.101.130.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 135352 (132K) [application/pdf]
Saving to: ‘e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf.3’


2024-09-05 13:37:13 (57.7 MB/s) - ‘e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf.3’ saved [135352/135352]



In [16]:
# Create a HF token key from https://huggingface.co/settings/tokens so that you
# can login to HF from inside this notebook
from huggingface_hub import login

import os
from getpass import getpass

hf_token = getpass('Enter huggingfacehub api token: ')
login(token=hf_token, add_to_git_credential=True)

Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
Token has not been saved to git credential helper.
Your token has been saved to /home/codespace/.cache/huggingface/token
Login successful


### 2. Choose a dataset to use and then load it into your code
Here we are using the pdfs loaded in pdfs/. We load this using langchain DirectoryLoader.

We can load multiple types of datasets into this example though the most commonly used are PDFs and websites.

To load websites, we could also use `langchain WebBaseLoader`

In this example, we will consider PDFs and load them in using `langchain DirectoryLoader`.

We host all PDFs at the PDFs directory `llm-workshop/tutorials/04-rag/PDFs`



In [17]:
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('llm-workshop/tutorials/04-rag/PDFs', glob="**/*.pdf", show_progress=True)
papers = loader.load()



 14%|█▍        | 1/7 [10:10<1:01:05, 610.87s/it]
 14%|█▍        | 1/7 [09:27<56:42, 567.05s/it]
Error loading file llm-workshop/tutorials/04-rag/PDFs/2307.07443.pdf


[A[A

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

### 3. Now, we need to split our documents into chunks.
We want the embedding to be greater than 1 word but much less than an entire page. This is essential for the similarity search between the query and the document. Essentially, the query will be searched for greatest similarity to embedded chunks in the dataset. Then those chunks with greatest similarity are augmented to the query.

It is essential to choose the chunking method according to your data type.
There are different ways to do this:

Fixed size
*   Token: Splits text on tokens. Can chunk tokens together
*   Character: Splits based on some user defined character.

Recursive
*  Recursively splits text. Useful for keeping related pieces of text next to each other.

Document based
*   HTML: Splits text based on HTML-specific characters.
*   Markdown: Splits on Markdown-specific characters
*   Code: Splits text based on characters specific to coding languages.

Semantic chunking
*   Extract semantic meaning from embeddings and then assess the semantic relationship between these chunks. Essentially splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

Here we use recursive where the dataset is split using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].  A large text is split by the first character \n\n. If the first split by \n\n is still large then it moves to the next character which is \n and tries to split by it. This continues until the chunk size is reached.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
character_chunker = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=["\n\n"])
char_chunks = character_chunker.split_documents(papers)

In [None]:
for i in char_chunks[0:5]:
  print(i, "\n")

In [None]:
print(f"{len(papers)} papers have been split into {len(char_chunks)} chunks.")

#### Example: Comparing Naive Chunking with Semantic Chunking

Using a lecture transcript from MIT OpenCourseware on [Binary Trees: Fall 2008 Lecture 10](https://ocw.mit.edu/courses/6-00-introduction-to-computer-science-and-programming-fall-2008/resources/6-00f08-l10/) we can see the difference between naive chunking and semantic chunking.

In [None]:
from langchain.docstore.document import Document
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf")
pages = loader.load_and_split()

# First page of lecture is liscence, ignore and get text for all other pages
lecture_text = "".join(elem.page_content for elem in pages[1:])

lecture =  Document(page_content=lecture_text, metadata={"source": "local"})


In [None]:
#Initialize the encoder model.
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "sentence-transformers/msmarco-distilbert-dot-v5"
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings':False}

encoder = HuggingFaceEmbeddings(
  model_name = model_name,
  model_kwargs = model_kwargs,
  encode_kwargs=encode_kwargs
)

#Perform semantic chunking.
from langchain_experimental.text_splitter import SemanticChunker

#initializing the splliter.
semantic_chunker = SemanticChunker(encoder, buffer_size=5)

#list of grouped_sentences (buffers)
buffers = semantic_chunker.split_documents([lecture])
buffers = [buffer.page_content for buffer in buffers]
semantic_chunks = semantic_chunker.create_documents(buffers)

In [None]:
from langchain.text_splitter import CharacterTextSplitter
character_chunker = CharacterTextSplitter(chunk_size=500, chunk_overlap=150, separator=" ")
char_chunks = character_chunker.split_documents([lecture])

In [None]:
import random
random.shuffle(semantic_chunks)
random.shuffle(char_chunks)

print("Semantic chunking")
for i in semantic_chunks[0:10]:
  print(i, "\n")

print("\n\n ----------- \n\n")

print("Fixed-size character chunking")
for i in char_chunks[0:10]:
  print(i, "\n")

In [None]:
print(f"Number of chunks produced by semantic chunking: {len(semantic_chunks)}")
print(f"Number of chunks produced by character chunking: {len(char_chunks)}")


#### ProTip: Semantic Chunking is not suitable to poorly-parsed PDF contents.

In [None]:
#Initialize the biomedical domain-specific encoder model.
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "pritamdeka/S-PubMedBert-MS-MARCO"
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings':False}

biomedical_encoder = HuggingFaceEmbeddings(
  model_name = model_name,
  model_kwargs = model_kwargs,
  encode_kwargs=encode_kwargs
)


In [None]:
#Perform semantic chunking on scientific papers.
from langchain_experimental.text_splitter import SemanticChunker

#initializing the splliter.
semantic_chunker = SemanticChunker(biomedical_encoder, buffer_size=5)

#list of grouped_sentences (buffers)
buffers = semantic_chunker.split_documents(papers)
buffers = [buffer.page_content for buffer in buffers]
semantic_chunks = semantic_chunker.create_documents(buffers)

for i in semantic_chunks[0:10]:
  print(i, "\n")

### 4. Then we embed the chunked texts using a Transformer and create a Faiss Vector Database
This allows us to encode the text into our search. Let's investigate the retrieved documents for a query.

Vector databases, also called vector storage, efficiently store and retrieve vector data, which are arrays of numerical values representing points in multi-dimensional space. They're useful for handling data like embeddings from deep learning models or numerical features. Unlike traditional relational databases, which aren't optimized for vectors, vector databases offer efficient storage, indexing, and querying for high-dimensional and variable-length vectors.

There are various types of vector databases:
1. Chroma
2. FAISS
3. Pinecone
4. Weaviate
5. Qdrant

Here, we build this using the FAISS utility.

In [None]:
from langchain.vectorstores import FAISS
faiss_vector_db = FAISS.from_documents(semantic_chunks, biomedical_encoder)
question = "Do you have any information on publications about RFDiffusion?"
searchDocs = faiss_vector_db.similarity_search(question)

#investigate top-3 nearest (most relevant) documents for the query.
print(searchDocs[0].page_content)
print(searchDocs[1].page_content)
print(searchDocs[2].page_content)

![vector_database](https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/04-rag/rag_images/vector_database.png?raw=1)

Image credit: https://blog.gopenai.com/primer-on-vector-databases-and-retrieval-augmented-generation-rag-using-langchain-pinecone-37a27fb10546

### 5. Initialize the LLM that will be used for question answering

Here, we use a pretrained model flan-t5-large as part of a HuggingFacePipeline. This will later be chained with the vector database for RAG.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,pipeline
from langchain import HuggingFacePipeline

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(
   pipeline = pipe,
   model_kwargs={"temperature": 0, "max_length": 2048, "max_new_tokens": 1024, "device":"cuda"},
)


### 6. Retrieve data and use it to answer a question

![rag_workflow](https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/04-rag/rag_images/rag_workflow.png?raw=1)

Image credit: https://blog.gopenai.com/retrieval-augmented-generation-101-de05e5dc21ef

Let's ask questions it would only be able to know if the model actually read the texts!

In [None]:
from langchain.prompts import PromptTemplate

template = """You are an honest and helpful AI. You are alwasys truthful and concise in your answers. Please answer the question with the provided context.
If you don't know the answer, please say I don't know.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
  llm=llm,
  chain_type="stuff",
  retriever=faiss_vector_db.as_retriever(),
  chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)
result = qa_chain ({ "query" : "What technique proposed in 2023 can be used to predict protein folding?" })
print(result["result"])

Now let's ask the chain where to find the article related to RFDiffusion

In [None]:
qa_chain ({ "query" : "Where was the RFdiffusion paper published?" })

In [None]:
qa_chain ({ "query" : "What can I use RFdiffusion model for?" })