<a href="https://colab.research.google.com/github/argonne-lcf/llm-workshop/blob/main/tutorials/04-rag/RAGTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Resource Augmented Generation (RAG)

## Overview
*   Motivation for RAG
*   Idea behind RAG
*   Advantages and Disadvantages
*   Implementation to augment question + answer
*   Advanced applications


#### Imagine you went to live under a rock on August 2006. When you come out in 2024, you are asked how many planets revolve around the sun. What would you say?...
![pluto](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/pluto_planets.jpeg?raw=1)

This is similar to LLMs which are trained with data until a certain point and then asked questions on data they are not trained on. Understandably, LLMs will either be unable to answer or simply hallucinate a probably wrong answer.

###What can be done?

Have the LLM go to the library using **Research Augmented Generation (RAG)**!

RAG involves adding your own data (via a retrieval tool) to the prompt that you pass into a large language model.


![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag-overview.original.png?raw=1)
Image credit: https://scriv.ai/guides/retrieval-augmented-generation-overview/

RAG has been shown to improve LLM prediction accuracy without needing to increase parameter size.

![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag_acc_v_size.png?raw=1)

*Image credit: Yu, Wenhao. "Retrieval-augmented generation across heterogeneous knowledge." Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022.*

RAG also increases explainability by giving the source for information.

![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag_source_locator.png?raw=1)

Image credit: https://ai.stanford.edu/blog/retrieval-based-NLP/

## Advantages and Disadvantages

### Advantages

*   Provides domain specific context
*   Improves predictive performance and reduces hallucinations
*   Does not increase model parameters
*   Less labor intensive than fine-tuning LLMs

### Disadvantages

*   May introduce latency since we are adding a relatively costly search step
*   If your dataset includes private information, you may inadvertently expose another user with this information.
*   The data you want to use needs to be curated and you should decide how the data should be accessed. This adds time for the initial set-up.


#Implementation

### 1. Install + load relevant modules:
*   langchain
*   torch
*   transformers
*   sentence-transformers
*   datasets
*   faiss-cpu  
*   pypdf
*  unstructure[pdf]
*  huggingface_hub (add hf_token)




In [None]:
!pip install langchain
!pip install torch
!pip install transformers
!pip install faiss-cpu
!pip install pypdf
!pip install sentence-transformers
!pip install unstructured
!pip install unstructured[pdf]
!pip install tiktoken
!pip install huggingface_hub
from huggingface_hub import login

hf_token = "your-hf-token"
login(token=hf_token, add_to_git_credential=True)

Collecting langchain
  Downloading langchain-0.1.6-py3-none-any.whl (811 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m811.8/811.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.18 (from langchain)
  Downloading langchain_community-0.0.19-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.22 (from langchain)
  Downloading langchain_core-0.1.22-py3-none-any.whl (239 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.4/239.4 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.1,>=0.0.83 (from langchain)
  Downloading langsmith-

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not installed.[0m[31m
[0mSuccessfully installed tiktoken-0.6.0
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### 2. Choose a dataset to use and then load it into your code
Here we are using the pdfs loaded in pdfs/. We load this using langchain DirectoryLoader.

We can load multiple types of datasets into this example though the most commonly used are PDFs and websites.

To load websites, we could also use `langchain WebBaseLoader`

In this example, we will consider PDFs and load them in using `langchain DirectoryLoader`.

We host all PDFs at the PDFs directory `llm-workshop/tutorials/04-rag/PDFs`



In [None]:
! git clone https://github.com/argonne-lcf/llm-workshop.git

Cloning into 'llm-workshop'...
remote: Enumerating objects: 243, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 243 (delta 44), reused 8 (delta 3), pack-reused 170[K
Receiving objects: 100% (243/243), 42.92 MiB | 31.48 MiB/s, done.
Resolving deltas: 100% (124/124), done.


In [None]:
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('llm-workshop/tutorials/04-rag/PDFs', glob="**/*.pdf", show_progress=True)
documents = loader.load()

### 3. Now, we need to split our documents into chunks.
We want the embedding to be greater than 1 word but much less than an entire page. This is essential for the similarity search between the query and the document. Essentially, the query will be searched for greatest similarity to embedded chunks in the dataset. Then those chunks with greatest similarity are augmented to the query.

It is essential to choose the chunking method according to your data type.
There are different ways to do this:

Fixed size
*   Token: Splits text on tokens. Can chunk tokens together
*   Character: Splits based on some user defined character.

Recursive
*  Recursively splits text. Useful for keeping related pieces of text next to each other.

Document based
*   HTML: Splits text based on HTML-specific characters.
*   Markdown: Splits on Markdown-specific characters
*   Code: Splits text based on characters specific to coding languages.

Semantic chunking
*   Extract semantic meaning from embeddings and then assess the semantic relationship between these chunks. Essentially splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

Here we use recursive where the dataset is split using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].  A large text is split by the first character \n\n. If the first split by \n\n is still large then it moves to the next character which is \n and tries to split by it. This continues until the chunk size is reached.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(documents)

### 4. Then we embed the chunked texts using a Transformer.
This allows us to encode the text into our search

Embedding converts text to a numerical representation in a vector space. RAG compares the embeddings of user queries within the vector of the knowledge library.

In this example, we choose a simple embedding using the MiniLM

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings':False}
embeddings = HuggingFaceEmbeddings(
  model_name = modelPath,
  model_kwargs = model_kwargs,
  encode_kwargs=encode_kwargs
)

### 5.Create a vector database
Vector databases, also called vector storage, efficiently store and retrieve vector data, which are arrays of numerical values representing points in multi-dimensional space. They're useful for handling data like embeddings from deep learning models or numerical features. Unlike traditional relational databases, which aren't optimized for vectors, vector databases offer efficient storage, indexing, and querying for high-dimensional and variable-length vectors.

There are various types of vector databases:
1. Chroma
2. FAISS
3. Pinecone
4. Weaviate
5. Qdrant

Here, we build this using the FAISS utility.

![vector_database](https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/04-rag/rag_images/vector_database.png?raw=1)

Image credit: https://blog.gopenai.com/primer-on-vector-databases-and-retrieval-augmented-generation-rag-using-langchain-pinecone-37a27fb10546

In [None]:
from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings)
question = "What is RF Fold?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

RFdiffusion, allowing it to efficiently target this site (right bar, pink). C-D) As well as conditioning on hotspot residue information, a fine-tuned RFdiffusion model can also condition on input fold information (secondary structure and block-adjacency information - see Supplementary Methods 4.5). This effectively allows the specification of a (for instance, particularly compatible) fold that the binder should adopt. C) Two examples showing binders can be specified to adopt either a ferredoxin fold (left) or a particular helical bundle fold (right). D) Quantification of the efficiency of fold-conditioning. Secondary structure inputs were accurately respected (top, pink). Note that in this design target and target site, RFdiffusion without fold-specification made generally helical


### 6. Initialize the LLM that will be used for question answering

Here, we use a pretrained model flan-t5-large as part of a HuggingFacePipeline. This will later be chained with the vector database for RAG.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,pipeline
from langchain import HuggingFacePipeline

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(
   pipeline = pipe,
   model_kwargs={"temperature": 0, "max_length": 2048},
)
#'HuggingFaceH4/zephyr-7b-beta'

### 7. Retrieve data and use it to answer a question

![rag_workflow](https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/04-rag/rag_images/rag_workflow.png?raw=1)

Image credit: https://blog.gopenai.com/retrieval-augmented-generation-101-de05e5dc21ef

Let's ask questions it would only be able to know if the model actually read the texts!

In [None]:
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [None]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
  llm=llm,
  chain_type="stuff",
  retriever=db.as_retriever(),
  chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)
result = qa_chain ({ "query" : "What technique proposed in 2023 can be used to predict protein folding?" })
print(result["result"])

Token indices sequence length is longer than the specified maximum sequence length for this model (1076 > 512). Running this sequence through the model will result in indexing errors
  forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder


RFdiffusion


Now let's ask the chain where to find the article related to RFDiffusion

In [None]:
qa_chain ({ "query" : "Which scientific article should I read to learn about RFdiffusion for protein folding?" })

  forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder


{'query': 'Which scientific article should I read to learn about RFdiffusion for protein folding?',
 'result': 'Nature | Vol 620 | 31 August 2023 | 1091'}

## Exercise

Use any of the frameworks/models here to load in your favorite websites and ask the model a question regarding them.

Hint:
```
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(["https://www.espn.com/", "https://www.google.com"])

To bypass SSL verification errors during fetching, you can set the “verify” option:

loader.requests_kwargs = {‘verify’:False}

data = loader.load()
```

## More applications

### RAG using Llama 2, Langchain and ChromaDB

In [None]:
!pip install einops xformers \
bitsandbytes chromadb

Collecting einops
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers
  Downloading xformers-0.0.24-cp310-cp310-manylinux2014_x86_64.whl (218.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.2/218.2 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.2.0 (from xformers)
  Downloadi

In [None]:
!pip install accelerate

Collecting accelerate
  Downloading accelerate-0.27.0-py3-none-any.whl (279 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/279.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m276.5/279.7 kB[0m [31m10.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m279.7/279.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.0


Load in models and setup pipeline

In [None]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=False,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [None]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Prepare model, tokenizer: 73.339 sec.


In [None]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 12.021 sec.


Prepare function to test pipeline

In [None]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

In [None]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")


Test inference: 732.623 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.
The State of the Union address is an annual speech delivered by the President of the United States to Congress, in which the President reports on the state of the union and outlines legislative priorities for the upcoming year.


Set up Huggingface pipeline

In [None]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

NameError: name 'HuggingFacePipeline' is not defined

Load in data

In [None]:
loader = TextLoader("llm-workshop/tutorials/04-rag/state_union2023.txt",
                    encoding="utf8")
documents = loader.load()

Chunk data recursively

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

In [None]:
all_splits

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\n\nLast year COVID-19 kept us apart. This year we are finally together again.\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\n\nWith a duty to one another to the American people to the Constitution.\n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.\n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated.\n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined.\n\nHe met the Ukrainian people.\n\nBiden condemns Putin in scathing State of the Union speech, in 180 seconds', metadata={'source': 'llm-workshop/tutorials/04-rag/state_union2023.txt'})

Embed and store in Chroma Vector store

In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device':'cpu'}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

In [None]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

Set up chain

In [None]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

Test RAG

In [None]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Get sources...

In [None]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

In [None]:
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)