# RAG - Retrieval-Augmented Generation

# RAG Hands-on Tutorial

Will be using [Langchain framework](https://www.langchain.com/)

Suggested code references:
- Langchain RAG from scratch [link](https://github.com/langchain-ai/rag-from-scratch/tree/main)
- Langchain RAG quickstart [link](https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/)

Session hands-on code in [github.com/ncsa/cyber2a-workshop](github.com/ncsa/cyber2a-workshop)

Session technical details in course book : [cyber2a.github.io/cyber2a-course/sections/foundation-models.html](https://cyber2a.github.io/cyber2a-course/sections/foundation-models.html)


## Recap

![With and Without RAG](notebook_images/rag-with-without.png "With and Without RAG") 

![RAG](notebook_images/rag-before-after.png "RAG")

## RAG - *Retrieval*-Augmented Generation

### Knowledge DB

* Vector database (Beginners [blog 1](https://medium.com/data-and-beyond/vector-databases-a-beginners-guide-b050cbbe9ca0), Pinecone [blog 2](https://www.pinecone.io/learn/vector-database/))


![knowledge DB](notebook_images/knowledge-db.png "Knowledge DB")


## RAG - *Retrieval*-Augmented Generation

### Vector DB Retrieval

![Vector DB Retrieval](notebook_images/vectordb-retrieval.png "Vector DB Retrieval")

### RAG - Retrieval Steps

1. Prepare data 
2. Create a vector store and insert data
3. Search the vector store and retrieve relevant documents

In [None]:
# basic imports
import os
import json
import logging
import sys
import pandas as pd

from dotenv import load_dotenv
load_dotenv(override=True)

# create and configure logger
logging.basicConfig(level=logging.INFO, datefmt='%Y-%m-%dT%H:%M:%S',
                    format='%(asctime)-15s.%(msecs)03dZ %(levelname)-7s : %(name)s - %(message)s',
                    handlers=[logging.StreamHandler(sys.stdout)]
                    )
# create log object with current module name
log = logging.getLogger(__name__)

### 1. Prepare data
- Load data from different sources
- Will be using proceedings from [Arctic data symposium 2023](https://arcticdata.io/catalog/portals/pisymposium2023).
- Eg: [Proceedings final report](https://permafrostcoasts.org/wp-content/uploads/2024/08/2023-PI-Symposium-final-report-web.pdf), [participant bios](https://arcticdata.io/metacat/d1/mn/v2/object/urn:uuid:6e613b84-842c-4ac9-993d-a863d7040aa5) 
- Data in data/docs directory. 


### 1.1 Data Loaders
- Langchain provides different data loaders for different file types
- Data loaded in Langchain Document class format [document class](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)

![Langchain document class](notebook_images/langchain-document-class.png "Langchain document class")


In [None]:
# data loaders
from langchain_community.document_loaders import CSVLoader, DataFrameLoader, PyPDFLoader, Docx2txtLoader, UnstructuredRSTLoader, DirectoryLoader


class DataLoaders:
    """
    various data loaders
    """
    def __init__(self, data_dir_path):
        self.data_dir_path = data_dir_path
    
    def csv_loader(self):
        csv_loader_kwargs = {
                            "csv_args":{
                                "delimiter": ",",
                                "quotechar": '"',
                                },
                            }
        dir_csv_loader = DirectoryLoader(self.data_dir_path, glob="**/*.csv", use_multithreading=True,
                                    loader_cls=CSVLoader, 
                                    loader_kwargs=csv_loader_kwargs,
                                    )
        return dir_csv_loader
    
    def pdf_loader(self):
        dir_pdf_loader = DirectoryLoader(self.data_dir_path, glob="**/*.pdf",
                                    loader_cls=PyPDFLoader,
                                    )
        return dir_pdf_loader
    
    def word_loader(self):
        dir_word_loader = DirectoryLoader(self.data_dir_path, glob="**/*.docx",
                                    loader_cls=Docx2txtLoader,
                                    )
        return dir_word_loader
    
    def rst_loader(self):
        rst_loader_kwargs = {
                        "mode":"single"
                        }
        dir_rst_loader = DirectoryLoader(self.data_dir_path, glob="**/*.rst",
                                    loader_cls=UnstructuredRSTLoader, 
                                    loader_kwargs=rst_loader_kwargs
                                    )
        return dir_rst_loader
    

In [None]:
# load data
data_dir_path = os.getenv('DATA_DIR_PATH', "data")
data_loader = DataLoaders(data_dir_path=data_dir_path)
log.info("Loading files from directory %s", data_dir_path)
dir_csv_loader = data_loader.csv_loader()
dir_word_loader = data_loader.word_loader()
dir_pdf_loader = data_loader.pdf_loader()
dir_rst_loader = data_loader.rst_loader()
csv_data = dir_csv_loader.load()
word_data = dir_word_loader.load()
pdf_data = dir_pdf_loader.load()
rst_data = dir_rst_loader.load()

In [None]:
for doc in pdf_data:
    print(doc)
    break

### 1.2 Format into text and metadata
- Convert data to a list of texts and metadata 
- Metadata can be used for filtering the data


In [None]:
# get text and metadata from the data
def get_text_metadatas(csv_data=None, pdf_data=None, word_data=None, rst_data=None):
    """
    Each document class has page_content and metadata properties
    Separate text and metadata content from Document class
    Have custom metadata if needed
    """
    csv_texts = [doc.page_content for doc in csv_data]
    # custom metadata
    csv_metadatas = [{'source': doc.metadata['source'], 'row_page': doc.metadata['row']} for doc in csv_data]   # metadata={'source': 'filename.csv', 'row': 0}
    pdf_texts = [doc.page_content for doc in pdf_data]
    pdf_metadatas = [{'source': doc.metadata['source'], 'row_page': doc.metadata['page']} for doc in pdf_data]  # metadata={'source': 'data/filename.pdf', 'page': 8}
    word_texts = [doc.page_content for doc in word_data]
    word_metadatas = [{'source': doc.metadata['source'], 'row_page': ''} for doc in word_data] 
    rst_texts = [doc.page_content for doc in rst_data]
    rst_metadatas = [{'source': doc.metadata['source'], 'row_page': ''} for doc in rst_data]         # metadata={'source': 'docs/images/architecture/index.rst'}

    texts = csv_texts + pdf_texts + word_texts + rst_texts
    metadatas = csv_metadatas + pdf_metadatas + word_metadatas + rst_metadatas
    return texts, metadatas


texts , metadatas = get_text_metadatas(csv_data, pdf_data, word_data, rst_data)

### 1.3 Chunking

![Chunk Optimization](notebook_images/rag-chunking.png "Chunk Optimization")

### 1.3 Chunking
- Split texts into chunks for embedding
- Return a list of document chunks (list of langchain [document class](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html))

In [None]:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from typing import List

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=1000,
        chunk_overlap=200,
        separators=[
            "\n\n", "\n", ". ", " ", ""
        ]  # try to split on paragraphs... fallback to sentences, then chars, ensure we always fit in context window
    )

docs: List[Document] = text_splitter.create_documents(texts=texts, metadatas=metadatas)


In [None]:
print(docs[0])
print("Number of documents: ", len(docs))


### 1.4 Embeddings

- Mathematical representations of data points in a high-dimensional space. 
- In the context of natural language processing:
    1. Word Embeddings: Individual words are represented as real-valued vectors in a multi-dimensional space.
    2. Semantic Capture: These embeddings capture the semantic meaning and relationships of the text.
    3. Similarity Principle: Words with similar meanings tend to have similar vector representations.

- We will be using OpenAI embeddings
- text-embedding-ada-002 model for embeddings, which has a maximum token limit of 8191 according to OpenAI documentation.
- HF Embedding models leaderboard [link](https://huggingface.co/spaces/mteb/leaderboard)

In [None]:
# embeddings 
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

### RAG - Retrieval Steps

~~1. Prepare data ~~
2. Create a vector store and insert data
3. Search the vector store and retrieve relevant documents

### 2. Vector Store

![Inserting into DB](notebook_images/inserting-db.png "Inserting into DB")

Source Credits : [Blog.demir](https://blog.demir.io/hands-on-with-rag-step-by-step-guide-to-integrating-retrieval-augmented-generation-in-llms-ac3cb075ab6f)


### 2. Vector Store

- We will use [Qdrant](https://qdrant.tech/) vector store for this example
- For today we will use local memory as the vector store
- Qdrant has a docker image that can be used to create a vector store and hosted remotely
Eg: [Qdrant docker container running locally](http://localhost:6333/dashboard)

- Blog post on vector stores [link](https://medium.com/google-cloud/vector-databases-are-all-the-rage-872c888fa348)

In [None]:
# creating a qdrant vector store in local memory

from langchain_community.vectorstores import Qdrant

# qdrant collection name
collection_name = os.getenv('QDRANT_COLLECTION_NAME', "data-collection")

# create vector store in local memory
vectorstore = Qdrant.from_documents(
    documents=docs,
    embedding=embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name=collection_name,
    )

### RAG - Retrieval Steps

~~1. Prepare data ~~
~~2. Create a vector store and insert data~~
3. Search the vector store and retrieve relevant documents

## 3. Retrieve relevant documents
Create a retriever from the vector store

In [None]:
# Retriever to retrieve relevant snippets
retriever = vectorstore.as_retriever()

### RAG - Retrieval Steps

~~1. Prepare data ~~
~~2. Create a vector store and insert data~~
~~3. Search the vector store and retrieve relevant documents~~

## RAG - Retrieval-Augmented *Generation*


![RAG LLM](notebook_images/rag-llm.png "RAG LLM")

![LLM prompting](notebook_images/rag-prompting.png "LLM Prompting")

## 4. Call LLM

### 4.1 Prompting
- Use a prompt template [link](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.prompt.PromptTemplate.html)
    - includes input parameters that can be dynamically changed
- Use Langchain hub to pull prompts [link](https://smith.langchain.com/hub)
    - easy to share and reuse prompts
    - can see what are the popular prompts for specific use cases
    - Eg: [rag-prompt](https://smith.langchain.com/hub/rlm/rag-prompt)
- Use a custom prompt
```
qa_prompt_template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
    1. If the question has some initial findings, use that as context.
    2. If you don't know the answer, don't try to make up an answer. Just say **I can't find the final answer but you may want to check the following sourcess** and add the source documents as a list.
    3. If you find the answer, write the answer in a concise way and add the list of sources that are **directly** used to derive the answer. Exclude the sources that are irrelevant to the final answer.

    {context}

    Question: {question}
    Helpful Answer:"""

rag_chain_prompt = PromptTemplate.from_template(qa_prompt_template) 
```


#### 4.1 Prompting

rlm/rag-prompt from Langchain

![RLM RAG prompt](notebook_images/rlm-rag-prompt.png "rlm/rag-prompt")

In [None]:
# prompting

from langchain import hub
prompt = hub.pull("rlm/rag-prompt")

## 4.2 Call LLM
- We will use 
    - OpenAI GPT-4o-mini and 
    - Ollama llama3.2 model (hosted by NCSA)
- Each model has its own formats and parameters

In [None]:
# formatting the documents as a string before calling the LLM
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
# call open ai GPT-4o-mini
from langchain_openai import ChatOpenAI

# create a chat openai model
llm: ChatOpenAI = ChatOpenAI(
            temperature=0,
            model="gpt-4o-mini",
            max_retries=500,
        )

In [None]:
# call GPT4o-mini
llm.invoke("What is the capital of the world?")

## 5 RAG 

![RAG system](notebook_images/rag-system.png "RAG system")


### 5 RAG Chain
Combining it all together

- Context is the retrieved docs from the retriever/vector db
- RunnablePassthrough() is used to pass the user query as is to the chain
- format_docs is used to format the documents as a string
- prompt is used to call the prompt template
- llm is used to call the LLM
- StrOutputParser() is used to parse the output from the LLM

In [None]:
# rag chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

openai_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# call openai rag chain
openai_rag_chain.invoke("What were the goals of the symposium?")


In [None]:
# call openai rag chain
openai_rag_chain.invoke("What is the capital of the world?")

In [None]:
# call ollama llama3:latest

from langchain_community.llms import Ollama

ollama_api_key = os.getenv('OLLAMA_API_KEY')
ollama_jwt_token = os.getenv('OLLAMA_JWT_TOKEN')
ollama_headers = {"Authorization": f"Bearer {ollama_api_key}"}

# create a ollama model
ollamallm: Ollama = Ollama(
    base_url="https://ollama.software.ncsa.illinois.edu/ollama",
    model="llama3.2:latest",
    headers=ollama_headers
    )

In [None]:
# call llama3 model
ollamallm.invoke("What is the capital of the world?")

In [None]:
# ollama rag chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

ollama_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | ollamallm
    | StrOutputParser()
)

In [None]:
# call ollama rag chain
ollama_rag_chain.invoke("Who is the president of USA?")

In [None]:
## adding sources to openai rag chain

from langchain_core.runnables import RunnableParallel

openai_rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

openai_rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=openai_rag_chain_from_docs)

In [None]:
# call openai rag chain with source
# this will return the answer and the sources (context)
openai_rag_chain_with_source.invoke("What were the goals of the symposium?")

In [None]:
openai_rag_chain_with_source.invoke("Why is tundra restoration and rehabilitation important")

In [None]:
openai_rag_chain_with_source.invoke("Who is Brenadette Adams?")

## RAG Steps

1. Prepare data 
2. Create a vector store and insert into db
3. Search the vector store and retrieve relevant documents
4. Call LLM with the user query and the retrieved documents
4. Return the LLM response to the user