# Building a Retrieval-Augmented Generation (RAG) System Workshop

## Learning Objectives
### Remember
- Define the key components of a RAG system
- Identify the essential libraries required for building a RAG pipeline
- List the main steps in the RAG process

### Understand
- Explain how document embedding works
- Describe the role of vector databases in RAG
- Interpret the relationship between chunks, embeddings, and retrieval

### Apply
- Implement a document loading and chunking pipeline
- Configure a language model for generation
- Set up a vector store for document retrieval

### Analyze
- Compare different chunking strategies
- Examine the impact of various parameter settings
- Debug common issues in RAG systems

### Evaluate
- Assess the quality of generated responses
- Test system performance with different configurations
- Judge the effectiveness of retrieval strategies

### Create
- Design a complete RAG pipeline
- Develop custom prompt templates
- Build an interactive question-answering system

## Prerequisites
- Basic understanding of Python
- Familiarity with machine learning concepts
- Understanding of basic NLP terminology

Let's begin our journey into building a RAG system!

## What is RAG (Retrieval-augmented generation)

**Retrieval-augmented generation (RAG)** is a technique for augmenting LLM knowledge with additional, often private or real-time, data. LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

## Setup llama.cpp python for Intel CPUs and GPUs
The llama.cpp SYCL backend is designed to support Intel GPU. Based on the cross-platform feature of SYCL.

We will setup Python environment and corresponding custom kernel for Jupyter Notebook, and we will install/build llama.cpp that will be used for the RAG Application.

### Step 1: Create and activate Python environment:

Open Terminal, make sure mini-forge is install and create new virtual environment

```
    conda create -n llm-sycl python=3.11

    conda activate llm-sycl

```
_Note: In case you want to remove the virtual environment, run the following command:_
```
    [conda remove -n llm-sycl --all]
```

### Step 2: Setup a custom kernel for Jupyter notebook:

Run the following commands in the terminal to setup custom kernel for the Jupyter Notebook.

```
    conda install -c conda-forge ipykernel

    python -m ipykernel install --user --name=llm-sycl
```
_Note: In case you want to remove the custom kernel from Jupyter, run the following command:_
```
    [python -m jupyter kernelspec uninstall llm-sycl]
```

<img src="Assets/llm4.png">

### Step 3: Install and Build llama.cpp

### For Linux

#### 1. Enable oneAPI environment

Make sure oneAPI Base Toolkit is installed to use the SYCL compiler for building llama.cpp

Run the following commands in terminal to initialize oneAPI environment and check available devices:

```
    source /opt/intel/oneapi/setvars.sh
    sycl-ls
```

#### 2. Install and build llama.cpp Python

Run the following commands in terminal to install and build llama.cpp

```
    CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python==0.3.1
```

### For Windows

#### 1. Enable oneAPI environment

Make sure oneAPI Base Toolkit is installed to use the SYCL compiler for building llama.cpp

Type oneapi in the windows search and then open the Intel oneAPI command prompt for Intel 64 for Visual Studio 2022 App.

Run the following commands to initialize oneAPI environment and check available devices:

```
    @call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
    sycl-ls
```

#### 2. Install build tools

* Download & install [cmake for Windows](https://cmake.org/download/):
* The new Visual Studio will install Ninja as default. (If not, please install it manually: https://ninja-build.org/)

#### 3. Install and build llama.cpp Python

* On the oneAPI command line window, step into the llama.cpp main directory and run the following:
  
```
    set CMAKE_GENERATOR=Ninja
    set CMAKE_C_COMPILER=cl
    set CMAKE_CXX_COMPILER=icx
    set CXX=icx
    set CC=cl
    set CMAKE_ARGS="-DGGML_SYCL=ON -DGGML_SYCL_F16=ON -DCMAKE_CXX_COMPILER=icx -DCMAKE_C_COMPILER=cl"
    
    pip install llama-cpp-python==0.3.1 -U --force --no-cache-dir --verbose
```


## RAG Application Details

A typical RAG application has two main components:

- **Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happens offline.

- **Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

**Indexing**

1. `Load`: First we need to load our data. We’ll use DocumentLoaders for this.
2. `Split`: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t in a model’s finite context window.
3. `Store`: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

![Indexing pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/dfed2ba3-0c3a-4e0e-a2a7-01638730486a)

**Retrieval and generation**

1. `Retrieve`: Given a user input, relevant splits are retrieved from storage using a Retriever.
2. `Generate`: A LLM produces an answer using a prompt that includes the question and the retrieved data.

![Retrieval and generation pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/f0545ddc-c0cd-4569-8c86-9879fdab105a)


## Install Python Modules
In Jupyter Lab, select `llm-sycl` as the kernel.

You can now proceed installing modules and running python code in Jupyter Notebook

In [None]:
import sys
!{sys.executable} -m pip install -r rag/requirements.txt

## 1. Setting Up the Environment

First, let's import all necessary libraries. We'll go through each import and understand its role in our RAG system.

Key Components:
- LangChain: Framework for developing applications powered by language models
- HuggingFace: Platform for accessing pre-trained models and embeddings
- Chroma: Vector store for efficient similarity search
- LlamaCpp: Efficient C++ implementation of Llama models

In [None]:
import os
import time
import warnings
import hashlib
warnings.filterwarnings("ignore")

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_community.llms import LlamaCpp
from langchain import chains, text_splitter, PromptTemplate
from huggingface_hub import hf_hub_download
from langchain_core.callbacks import BaseCallbackHandler
import sys

# Setting constants and environment variables
VECTOR_DB_DIR = "vector_dbs"

## 2. Understanding Streaming Callbacks

In RAG systems, especially when working with large language models, it's important to provide real-time feedback to users. The streaming callback handler allows us to see the model's output token by token, creating a more interactive experience.

Key Concepts:
- Callbacks in LangChain
- Token-by-token streaming
- Real-time output handling

In [None]:
class StreamingStdOutCallbackHandler(BaseCallbackHandler):
    """Callback handler for streaming LLM outputs to standard output.
    
    This handler intercepts tokens as they're generated by the language model
    and prints them to stdout in real-time, creating a streaming effect.
    
    Attributes:
        None
        
    Methods:
        on_llm_new_token: Handles each new token generated by the LLM
    """
    
    def on_llm_new_token(self, token: str, **kwargs) -> None:
        print(token, end="", flush=True)

def get_source_hash(source, source_type):
    """Generate a unique hash for the source"""
    source_string = f"{source_type}:{source}"
    return hashlib.md5(source_string.encode()).hexdigest()

def get_source_input():
    """Get source type and location from user input."""
    while True:
        source_type = input("Enter source type (url/local) or 'quit' to exit: ").lower()
        if source_type == 'quit':
            return None, None
        if source_type in ['url', 'local']:
            break
        print("Invalid source type. Please enter 'url' or 'local'")

    if source_type == 'url':
        source = input("Enter the URL: ")
    else:
        print("\nAvailable files in data folder:")
        data_dir = os.path.join(os.getcwd(), "data")
        if os.path.exists(data_dir):
            files = os.listdir(data_dir)
            for file in files:
                print(f"- {file}")
        source = input("\nEnter filename or directory name from data folder: ")

    return source, source_type

## 3. Document Loading and Processing

A crucial part of any RAG system is how it handles document loading and processing. We'll create functions to:
1. Load documents from URLs
2. Split documents into manageable chunks
3. Initialize our embedding model

Understanding these processes is crucial as they directly impact the quality of our retrieval system.

* Document loaders in RAG are used to load and preprocess the documents that will be used for retrieval during the question answering process.
* Document loaders are responsible for preprocessing the documents. This includes tokenizing the text, converting it to the format expected by the retriever, and creating batches of documents.
* Document loaders work in conjunction with the retriever in RAG. The retriever uses the documents loaded by the document loader to find the most relevant documents for a given query.
* The WebBaseLoader in Retrieval Augmented Generation (RAG) is a type of document loader that is designed to load documents from the web.
* The WebBaseLoader is used when the documents for retrieval are not stored locally or in a Hugging Face dataset, but are instead located on the web. This can be useful when you want to use the most up-to-date information available on the internet for your question answering system

### Text splitter

* RecursiveCharacterTextSplitter is used to split text into smaller pieces recursively at the character level. 
* split_documents fuctions splits larger documents into smaller chunks, for easier processing

In [None]:
def load_document(source, source_type="local"):
    """Load a document from a given URL using LangChain's WebBaseLoader.
    
    This function handles the initial document ingestion process, loading
    web content while managing potential network and parsing issues.
    
    Args:
        url (str): The URL of the document to load
        
    Returns:
        list: List of Document objects containing the loaded content
    """
    print("Loading document from URL...")
    if source_type == "url":
        print("Loading document from URL...")
        loader = WebBaseLoader(source)
        return loader.load()
    else:
        current_dir = os.getcwd()
        data_dir = os.path.join(current_dir, "data")
        full_path = os.path.join(data_dir, source)
        
        if not os.path.exists(full_path):
            raise FileNotFoundError(f"File or directory not found: {full_path}")
        
        print(f"Loading document from: {full_path}")
        
        if os.path.isdir(full_path):
            loader = DirectoryLoader(
                full_path,
                glob="**/*.*",
                loader_cls=TextLoader
            )
        elif source.endswith('.pdf'):
            loader = PyPDFLoader(full_path)
        else:
            loader = TextLoader(full_path)
        
        return loader.load()

def split_document(text, chunk_size=1000, overlap=100):
    """Split documents into smaller chunks for processing.
    
    Uses recursive character text splitting to create overlapping chunks
    of text that are small enough to process effectively.
    
    Args:
        text (list): List of Document objects to split
        chunk_size (int): Maximum size of each chunk
        overlap (int): Number of characters to overlap between chunks
        
    Returns:
        list: List of split Document objects
    """
    print("Splitting document into chunks...")
    text_splitter_instance = text_splitter.RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )
    return text_splitter_instance.split_documents(text)

def initialize_embedding_fn(model_name="sentence-transformers/paraphrase-MiniLM-L3-v2"):
    """Initialize the embedding model for converting text to vector representations.
    
    Uses HuggingFace's sentence transformers to create embeddings that
    capture semantic meaning of text chunks.
    
    Args:
        model_name (str): Name of the HuggingFace model to use
        
    Returns:
        HuggingFaceEmbeddings: Initialized embedding model
    """
    
    print(f"Initializing embedding model with {model_name}...")
    return HuggingFaceEmbeddings(model_name=model_name)

## 4. Vector Store Management

The vector store is a crucial component in RAG systems. It enables efficient similarity search over document embeddings, allowing us to retrieve relevant context for any given query.

Key Concepts:
- Vector databases
- Persistence and caching
- Similarity search
- Document retrieval
- In Retrieval Augmented Generation (RAG) embeddings play a crucial role in the retrieval of relevant documents for a given query.

* In RAG, each document in the knowledge base is represented as a dense vector, also known as an embedding. These embeddings are typically generated by a transformer model.
* When a query is received, it is also converted into an embedding using the same transformer model. This ensures that the query and the documents are in the same vector space, making it possible to compare them.
* Retrieval: The retrieval step in RAG involves finding the documents whose embeddings are most similar to the query embedding. This is typically done using a nearest neighbor search.


In [None]:
def get_or_create_embeddings(source, source_type, embedding_fn):
    """Create or load a persistent vector store for document embeddings.
    
    This function manages the vector database, either loading an existing one
    or creating a new one if it doesn't exist. It handles document processing,
    embedding generation, and storage.
    
    Args:
        document_url (str): URL of the document to process
        embedding_fn: Function to generate embeddings
        persist_dir (str): Directory for storing the vector database
        
    Returns:
        Chroma: Initialized vector store with document embeddings
    """
    source_hash = get_source_hash(source, source_type)
    persist_dir = os.path.join(VECTOR_DB_DIR, source_hash)

    # Check if embeddings already exist
    if os.path.exists(persist_dir):
        print("Loading existing embeddings...")
        return Chroma(persist_directory=persist_dir, embedding_function=embedding_fn)

    # Create new embeddings if they don't exist
    start_time = time.time()
    print("Creating new embeddings...")
    document = load_document(source, source_type)
    documents = split_document(document)
    vector_store = Chroma.from_documents(
        documents=documents,
        embedding=embedding_fn,
        persist_directory=persist_dir
    )
    
    print(f"Embedding time: {time.time() - start_time:.2f} seconds")
    return vector_store

## 5. Language Model Configuration

The language model is the heart of our RAG system. Proper configuration is essential for optimal performance and resource usage.

Key Aspects:
- Model parameters
- GPU optimization
- Generation settings
- Performance tuning

In [None]:
def create_llm(model_path):
    """Initialize and configure the Llama language model.
    
    Sets up the language model with optimized parameters for performance
    and resource usage. Configures GPU acceleration and generation settings.
    
    Args:
        model_path (str): Path to the model weights
        
    Returns:
        LlamaCpp: Configured language model instance
    """
    return LlamaCpp(
        model_path=model_path,
        n_gpu_layers=33,
        n_batch=256,
        n_ctx=4096,
        f16_kv=True,
        verbose=True,
        streaming=True,
        temperature=0.7,
        max_tokens=512,
        top_p=0.95,
        repeat_penalty=1.1
    )

## 6. Question-Answering Interface

The question-answering interface provides the user interaction layer of our RAG system. It handles:
1. User input processing
2. Response generation
3. Error handling
4. Performance monitoring

In [None]:
def ask_questions(qachain):
    """Handle user questions and generate responses using the RAG pipeline.
    
    Manages the interaction loop with users, processes questions, generates
    responses, and handles any errors that occur during generation.
    
    Args:
        qachain: The configured question-answering chain
        
    Returns:
        bool: Whether to continue the interaction loop
    """
    while True:
        try:
            question = input("\nEnter your question (or 'switch' to change document, 'quit' to exit): ")
            if question.lower() == 'quit':
                return False
            if question.lower() == 'switch':
                return True
                
            start_time = time.time()
            response = qachain.invoke(
                {
                    "query": question,
                    "max_tokens": 512,
                    "temperature": 0.7
                }, 
                config={"callbacks": [StreamingStdOutCallbackHandler()]}
            )
            print(f"\nResponse time: {time.time() - start_time:.2f} seconds")
            
        except KeyboardInterrupt:
            print("\nGeneration interrupted by user")
            return False
        except Exception as e:
            print(f"\nAn error occurred: {str(e)}")

## 7. Main Application Logic

The main function orchestrates all components of our RAG system. It:
1. Initializes components
2. Sets up the RAG pipeline
3. Manages the interaction loop
4. Handles errors and cleanup

### Retrievers

* Retrievers are responsible for fetching relevant documents from a document store or knowledge base given a query. The retrieved documents are then used by the generator to produce a response.
* RetrievalQA is a type of question answering system that uses a retriever to fetch relevant documents given a question, and then uses a reader to extract the answer from the retrieved documents.
* RetrievalQA can be seen as a two-step process:
    * Retrieval: The retriever fetches relevant documents from the document store given a query.    
    * Generation: The generator uses the retrieved documents to generate a response.
* This two-step process allows RAG to leverage the strengths of both retrieval-based and generation-based approaches to question answering. The retriever allows RAG to efficiently search a large document store, while the generator allows RAG to generate detailed and coherent responses.


In [None]:
#from langchain_community.vectorstores.chroma import Chroma
from langchain_chroma import Chroma
def main():
    """Main function to run the RAG system.
    
    Orchestrates the entire RAG pipeline, including model initialization,
    vector store setup, and the question-answering loop.
    
    Args:
        None
        
    Returns:
        None
    """
    model_name_or_path = "TheBloke/Llama-2-7B-Chat-GGUF"
    model_basename = "llama-2-7b-chat.Q4_K_M.gguf"
    print("Downloading Model...\n   " + model_name_or_path + "/" + model_basename)
    MODEL_PATH = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)
    print("Download Complete.")
    
    # Initialize components that don't need to be recreated
    embedding_fn = initialize_embedding_fn()
    chat_model = create_llm(MODEL_PATH)
    
    # Create base prompt template
    prompt_template = """
    Answer the question based on the context below. Keep your answer concise.
    If you don't know, just say "I don't know."

    Context: {context}

    Question: {question}
    Answer:
    """
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
    chain_type_kwargs = {"prompt": prompt}

    while True:
        try:
            # Get source input from user
            source, source_type = get_source_input()
            if source is None:  # User chose to quit
                break

            # Get or create embeddings for the current document
            vector_store = get_or_create_embeddings(source, source_type, embedding_fn)
            
            # Setup retriever and chain
            retriever = vector_store.as_retriever(search_kwargs={"k": 4})
            qachain = chains.RetrievalQA.from_chain_type(
                llm=chat_model,
                retriever=retriever,
                chain_type="stuff",
                chain_type_kwargs=chain_type_kwargs,
                return_source_documents=False
            )

            print("\nModel is ready! Ask your questions.")
            if not ask_questions(qachain):
                break
                
        except Exception as e:
            print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\nProgram interrupted by user")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

## Ipywidgets Implementation

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import time
import os
import sys

# Set up widgets for the interface
source_type_dropdown = widgets.Dropdown(
    options=['url', 'local'],
    value='local',
    description='Source Type:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='300px')
)

source_input = widgets.Text(
    description='Source:',
    placeholder='Enter URL or filename from data folder',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='600px')
)

file_list = widgets.Select(
    options=[],
    description='Available Files:',
    disabled=False,
    layout=widgets.Layout(width='400px', height='150px', display='none')
)

question_input = widgets.Text(
    placeholder='Ask a question about the document...',
    layout=widgets.Layout(width='800px')
)

submit_button = widgets.Button(
    description='Ask',
    button_style='primary',
    layout=widgets.Layout(width='100px')
)

exit_button = widgets.Button(
    description='Exit',
    button_style='danger',
    layout=widgets.Layout(width='100px')
)

model_status = widgets.HTML(value="<b>Status:</b> Ready to load document")
response_area = widgets.Output(layout=widgets.Layout(border='1px solid #ddd', padding='10px', width='900px', height='300px'))

# Main widget container (for easier cleanup)
main_container = widgets.VBox()

# Setup components
embedding_fn = initialize_embedding_fn()
vector_store = None
qachain = None

model_name_or_path = "TheBloke/Llama-2-7B-Chat-GGUF"
model_basename = "llama-2-7b-chat.Q4_K_M.gguf"
MODEL_PATH = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)
chat_model = create_llm(MODEL_PATH)

# Create base prompt template
prompt_template = """
Answer the question based on the context below. Keep your answer concise.
If you don't know, just say "I don't know."

Context: {context}

Question: {question}
Answer:
"""
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain_type_kwargs = {"prompt": prompt}

# Function to update file list
def update_file_list():
    data_dir = os.path.join(os.getcwd(), "data")
    if os.path.exists(data_dir):
        files = os.listdir(data_dir)
        file_list.options = files
        if files:
            file_list.value = files[0]
    else:
        file_list.options = ["No files found in data directory"]

# Handler for source type change
def source_type_changed(change):
    if change['new'] == 'local':
        update_file_list()
        file_list.layout.display = 'block'
    else:
        file_list.layout.display = 'none'

# Handler for file selection
def file_selected(change):
    if change['new'] and source_type_dropdown.value == 'local':
        source_input.value = change['new']

# Widget handlers
def load_document_button_click(b):
    global vector_store, qachain
    
    with response_area:
        clear_output()
        source = source_input.value
        source_type = source_type_dropdown.value
        
        if not source:
            print("Please enter a source URL or filename")
            return
        
        model_status.value = "<b>Status:</b> Loading document..."
        try:
            # Get or create embeddings for the current document
            vector_store = get_or_create_embeddings(source, source_type, embedding_fn)
            
            # Setup retriever and chain
            retriever = vector_store.as_retriever(search_kwargs={"k": 4})
            qachain = chains.RetrievalQA.from_chain_type(
                llm=chat_model,
                retriever=retriever,
                chain_type="stuff",
                chain_type_kwargs=chain_type_kwargs,
                return_source_documents=False
            )
            
            model_status.value = "<b>Status:</b> Model ready! Ask your questions."
            print("Document loaded successfully! You can now ask questions.")
        except Exception as e:
            model_status.value = f"<b>Status:</b> Error loading document"
            print(f"Error: {str(e)}")

class WidgetStreamHandler(BaseCallbackHandler):
    def __init__(self, output_widget):
        self.output_widget = output_widget
        self.generated_text = ""
        
    def on_llm_new_token(self, token, **kwargs):
        self.generated_text += token
        with self.output_widget:
            clear_output(wait=True)
            print(self.generated_text)

def ask_question_button_click(b):
    question = question_input.value
    
    if not question:
        with response_area:
            clear_output()
            print("Please enter a question")
        return
    
    if vector_store is None or qachain is None:
        with response_area:
            clear_output()
            print("Please load a document first")
        return
    
    with response_area:
        clear_output()
        model_status.value = "<b>Status:</b> Generating response..."
        stream_handler = WidgetStreamHandler(response_area)
        
        try:
            start_time = time.time()
            qachain.invoke(
                {
                    "query": question,
                    "max_tokens": 512,
                    "temperature": 0.7
                },
                config={"callbacks": [stream_handler]}
            )
            elapsed = time.time() - start_time
            model_status.value = f"<b>Status:</b> Response generated in {elapsed:.2f} seconds"
            
        except Exception as e:
            model_status.value = "<b>Status:</b> Error generating response"
            print(f"Error: {str(e)}")

def exit_application(b):
    # Clear all widget outputs
    with response_area:
        clear_output()
    
    # Clean up resources
    global vector_store, qachain
    vector_store = None
    qachain = None
    
    # Clear the main container and show exit message
    main_container.children = [widgets.HTML("<h3>RAG application has been closed. You can run this cell again to restart.</h3>")]

# Connect handlers
source_type_dropdown.observe(source_type_changed, names='value')
file_list.observe(file_selected, names='value')

load_document_button = widgets.Button(
    description='Load Document',
    button_style='info',
    layout=widgets.Layout(width='150px')
)
load_document_button.on_click(load_document_button_click)
submit_button.on_click(ask_question_button_click)
exit_button.on_click(exit_application)

# Initialize the file list if "local" is selected
if source_type_dropdown.value == 'local':
    update_file_list()
    file_list.layout.display = 'block'

# Create button row
button_row = widgets.HBox([question_input, submit_button, exit_button])

# Create header with title and exit button
header = widgets.HTML("<h3>RAG Interactive Interface</h3>")

# Build the main container
main_container.children = [
    header,
    widgets.HBox([source_type_dropdown, source_input, load_document_button]),
    file_list,
    button_row,
    model_status,
    response_area
]

# Display the interface
display(main_container)

## Workshop Summary

In this workshop, we've built a complete RAG system from scratch, covering:

### Technical Components
1. Document Processing
   - Web document loading
   - Text chunking strategies
   - Embedding generation

2. Vector Store Management
   - Persistent storage
   - Efficient retrieval
   - Similarity search

3. Language Model Integration
   - Model configuration
   - Parameter optimization
   - Response generation

### Key Learnings
1. System Architecture
   - Understanding RAG pipeline components
   - Component interaction
   - Error handling

2. Performance Optimization
   - Memory management
   - GPU utilization
   - Response time optimization

3. Best Practices
   - Code organization
   - Documentation
   - Error handling

### Next Steps
To further improve the system, consider:
1. Implementing different embedding models
2. Experimenting with chunk sizes and overlap
3. Adding evaluation metrics
4. Implementing caching mechanisms
5. Adding support for different document types

This workshop provides a foundation for building and understanding RAG systems, which you can extend and customize for your specific use cases.