# Retrieval Augmented Generation

Overview:
- Introduction to RAG with LlamaIndex
- Data ingestion
  - PDF
  - Web pages
  - Code
- Data splitting
  - Token splitting
  - Sentence splitting
  - Structured data splitting
  - Semantic chunking
- Vectorization
  - Embeddings
  - Vector storage
- Retrieval
  - Keyword search
  - Vector search
  - Hybrid search
- Advanced methods
  - Query rewriting
  - Multi-hop retrieval

# Introduction to RAG with LlamaIndex

LlamaIndex is a library for working with large language models.
One of its main strengths is its ability to ingest documents into a vector index and use them to answer questions.
This is known as Retrieval Augmented Generation (RAG).

To start, we will use a low-code, high-level abstraction to build a basic PDF question-answering system.
We will read in PDFs, split them into chunks, embed them, and store them in a vector database.
Then, we will use an abstraction known as a `QueryEngine` that implements RAG to answer questions about the documents.

In [39]:
# If we're in colab, use userdata to get the OPENAI_API_KEY
import os
from rich import print
from pathlib import Path

try:
    print("Colab detected - using colab secrets.")
    from google.colab import userdata
    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
except:
    print("Not in colab - using local environment variables.")
    from dotenv import load_dotenv
    load_dotenv("../.env")


In [40]:
import os
import requests

# Create data directory if it doesn't exist
data_dir = "data"
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# Download the PDF file
pdf_url = "https://arxiv.org/pdf/2407.21783"
pdf_path = os.path.join(data_dir, "2407.21783.pdf")

if not os.path.exists(pdf_path):
    response = requests.get(pdf_url)
    with open(pdf_path, "wb") as f:
        f.write(response.content)
    print(f"Downloaded PDF to {pdf_path}")
else:
    print(f"PDF already exists at {pdf_path}")


In [41]:
from llama_index.core.readers import SimpleDirectoryReader
documents = SimpleDirectoryReader(data_dir).load_data()

In [44]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from torch.backends.mps import is_available as is_mps_available
from torch.cuda import is_available as is_cuda_available

if is_mps_available():
    device = "mps"
elif is_cuda_available():
    device = "cuda"
else:
    device = "cpu"

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5", device=device)
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [54]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")

query_engine = index.as_query_engine(llm=llm)

In [11]:
response = query_engine.query("How many new Llama models models are mentioned in the paper?")

In [None]:
print(response.response)

## Exercise: Create a Gradio interface for a question-answering system

Your goal in this exercise is to create a Gradio interface for a question-answering system.
Your application should:
- Use the query engine created above to answer questions about the uploaded PDF
- Display the question and answer in the UI

If you need a challenge:
- Use the `gr.File` component to allow the user to upload ANY pdf and ask question about it.


In [18]:
import gradio as gr

In [59]:
from tempfile import TemporaryDirectory
import os
from pathlib import Path

def ingest_documents(file_obj):
    """
    Process an uploaded PDF file and create a query engine for it.
    
    Args:
        file_obj: Gradio file upload object containing the PDF
        
    Returns:
        QueryEngine: Configured query engine for the ingested document
    """
    if file_obj is None:
        return None
        
    with TemporaryDirectory() as temp_dir:
        file_path = os.path.join(temp_dir, 'tmp.pdf')
        # Get the file bytes from the Gradio upload object
        file_bytes = open(file_obj, "rb").read()
        
        with open(file_path, "wb") as f:
            f.write(file_bytes)
            
        documents = SimpleDirectoryReader(temp_dir).load_data()
        llm = OpenAI(model="gpt-4o-mini")
        index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
        query_engine = index.as_query_engine(llm=llm)
        
    return query_engine, None, Path(file_obj.name).name

def predict(query_engine, question):
    return query_engine.query(question)


In [None]:
with gr.Blocks() as demo:
    query_engine = gr.State()
    gr.Markdown("## RAG Demo: Question answering with a PDF")
    with gr.Row():
        with gr.Column(scale=1):   
            file_upload = gr.File(label="Upload a PDF file", file_types=[".pdf"], file_count="single")
            submit_button = gr.Button("Submit")
            pdf_name = gr.Textbox(label="You are asking about...")
        with gr.Column(scale=3):
            input = gr.Textbox(label="Enter a question")
            output = gr.Textbox(label="Answer")

    submit_button.click(fn=ingest_documents, inputs=file_upload, outputs=[query_engine, file_upload, pdf_name])
    input.submit(fn=predict, inputs=[query_engine, input], outputs=output)

demo.launch()

# Data ingestion

Data often comes in many different formats.
It may come in the form of a PDF, a web page, a code file, etc.
We may need some specific processing pipelines to extract the text from these documents, split them correctly, and vectorize them.

Luckily, LlamaIndex (and other libraries) provide lots of built-in and add-on tools to help you ingest almost any data type.
Instead of loading a PDF, let's load a web page instead.
We will use one of the classes provided by [`llama-index-readers-web`](https://llamahub.ai/l/readers/llama-index-readers-web?from=readers) to load data from a web page.

In this section, we will:
- Load a web page as Markdown
- Split it into chunks following the structured format of the Markdown
- Embed the chunks
- Store the chunks in a vector database
- Create a query engine from the vector database and use it to answer a question


In [1]:
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.ingestion import IngestionPipeline
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb


In [81]:
web_docs = SimpleWebPageReader(html_to_text=True).load_data(['https://en.wikipedia.org/wiki/Wikipedia'])

In [132]:
chromadb.EphemeralClient().delete_collection("wikipedia")

In [137]:
collection = chromadb.EphemeralClient().create_collection("wikipedia", get_or_create=True)
vector_store = ChromaVectorStore(collection)

In [138]:
pipeline = IngestionPipeline(
    transformations=[
        MarkdownNodeParser.from_defaults(),
        embed_model,
    ], 
    vector_store=vector_store
)

In [139]:
nodes = pipeline.run(documents=web_docs)

In [140]:
index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)

In [180]:
llm = OpenAI(model="gpt-4o-mini")
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("How many languages there exactly? Quote the exact text as well.")
print(response.response)

In [181]:
# Use fuzzywuzzy to find the closest match in the source text
from thefuzz import fuzz, process
# Get the top matching line of text from the source_text_quote
top_match, match_score = process.extractOne(response.response, response.source_nodes[0].text.splitlines(), scorer=fuzz.ratio)
assert top_match in response.source_nodes[0].text
print(f"Quote from source: '{top_match}'")