# Simple RAG (Retrieval-Augmented Generation) System

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

## Key Components

1. PDF processing and text extraction
2. Text chunking for manageable processing
3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
4. Retriever setup for querying the processed documents
5. Evaluation of the RAG system

## Method Details

### Document Preprocessing

1. The PDF is loaded using PyPDFLoader.
2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.

### Text Cleaning

A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.

### Encoding Function

The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

## Key Features

1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.
2. Configurable Chunking: Allows adjustment of chunk size and overlap.
3. Efficient Retrieval: Uses FAISS for fast similarity search.
4. Evaluation: Includes a function to evaluate the RAG system's performance.

## Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

## Evaluation

The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [15]:
# Install required packages
!pip install pypdf==5.6.0
!pip install PyMuPDF==1.26.1
!pip install python-dotenv==1.1.0
!pip install rank_bm25==0.2.2
!pip install faiss-cpu==1.11.0
!pip install deepeval
!pip install langchain langchain-community langchain-openai langchain-cohere chromadb
!pip install sentence-transformers



In [16]:
# Clone the repository to access helper functions and evaluation modules
!git clone https://github.com/NirDiamant/RAG_TECHNIQUES.git
import sys
sys.path.append('RAG_TECHNIQUES')

# If you need to run with the latest data
# !cp -r RAG_TECHNIQUES/data .

fatal: destination path 'RAG_TECHNIQUES' already exists and is not an empty directory.


In [18]:
import os
import sys
from dotenv import load_dotenv
from google.colab import userdata



# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable (comment out if not using OpenAI)
if not userdata.get('OPENAI_API_KEY'):
    os.environ["OPENAI_API_KEY"] = input("Please enter your OpenAI API key: ")
else:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Original path append replaced for Colab compatibility

from langchain.text_splitter import RecursiveCharacterTextSplitter
from helper_functions import (
                              retrieve_context_per_question,
                              replace_t_with_space,
                              show_context)
from langchain_community.embeddings import HuggingFaceEmbeddings

from evaluation.evalute_rag import evaluate_rag

from langchain.vectorstores import FAISS, Chroma

### Read Docs

In [19]:
# Download required data files
import os
os.makedirs('data', exist_ok=True)

# Download the PDF document used in this notebook
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf


--2025-11-29 22:49:55--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206372 (202K) [application/octet-stream]
Saving to: ‘data/Understanding_Climate_Change.pdf’


2025-11-29 22:49:55 (11.4 MB/s) - ‘data/Understanding_Climate_Change.pdf’ saved [206372/206372]

--2025-11-29 22:49:55--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
L

In [20]:
path = "data/Understanding_Climate_Change.pdf"

### Encode document

In [21]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A Chroma vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings using a local HuggingFace model
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    # Create vector store
    vectorstore = Chroma.from_documents(cleaned_texts, embeddings)

    return vectorstore

### Other Embedding Options

Here are some other popular embedding options you can use with Langchain, along with example code snippets. Remember to install the necessary packages and set up API keys for cloud-based services.

#### 1. OpenAIEmbeddings

To use OpenAI's embedding models (e.g., `text-embedding-ada-002` or `text-embedding-3-small`), you'll need the `langchain-openai` package and your `OPENAI_API_KEY` set.

```python
# Install if not already installed
!pip install langchain-openai

from langchain_openai import OpenAIEmbeddings

# Ensure OPENAI_API_KEY is set in your environment variables
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Or for older models: embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
```

#### 2. CohereEmbeddings

Cohere offers strong embedding models. You'll need the `langchain-cohere` package and your `COHERE_API_KEY`.

```python
# Install if not already installed
!pip install langchain-cohere

from langchain_cohere import CohereEmbeddings

# Ensure COHERE_API_KEY is set in your environment variables
embeddings = CohereEmbeddings()
```

#### 3. GoogleGenerativeAIEmbeddings (for Gemini)

If you're using Google's Gemini models, you can use `GoogleGenerativeAIEmbeddings`. You'll need `langchain-google-genai` and `GOOGLE_API_KEY`.

```python
# Install if not already installed
!pip install langchain-google-genai

from langchain_google_genai import GoogleGenerativeAIEmbeddings

# Ensure GOOGLE_API_KEY is set in your environment variables
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
```

#### 4. BedrockEmbeddings (AWS)

For AWS users leveraging Amazon Bedrock, you can access models like `Amazon Titan Embeddings`. This requires `langchain-community` and AWS credentials configured.

```python
# Install if not already installed
!pip install langchain-community boto3

from langchain_community.embeddings import BedrockEmbeddings

# Configure AWS client (e.g., via environment variables or boto3 config)
embeddings = BedrockEmbeddings(client=boto3.client("bedrock-runtime", region_name="us-east-1"))
```

To switch to any of these, you would simply replace the `HuggingFaceEmbeddings` instantiation in your `encode_pdf` function with the chosen embedding provider's instantiation, ensuring all prerequisites (installations, API keys) are met.

In [22]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Create retriever

In [23]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

### Test retriever

In [25]:
test_query = "What is the types of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

Context 1:
Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human 
activities, particularly the burning of fossil fuels and deforestation, have significantly 
contributed to climate change. 
Historical Context 
The Earth's climate has changed throughout history. Over the past 650,000 years, there have 
been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about 
11,700 years ago marking the beginning of the modern climate era and human civilization. 
Most of these climate changes are attributed to very small variations in Earth's orbit that 
change the amount of solar energy our planet receives. During the Holocene epoch, which


Context 2:
Most of these climate change

### Evaluate results

In [None]:
#Note - this currently works with OPENAI only
evaluate_rag(chunks_query_retriever)

{'questions': ['1. **Multiple Choice: Causes of Climate Change**',
  '   - What is the primary cause of the current climate change trend?',
  '     A) Solar radiation variations',
  '     B) Natural cycles of the Earth',
  '     C) Human activities, such as burning fossil fuels',
  '     D) Volcanic eruptions',
  '',
  '2. **True or False: Climate Change Impacts**',
  '   - True or False: Climate change only affects the temperature of the planet, not weather patterns, sea levels, or ecosystems.',
  '',
  '3. **Short Answer: Mitigation Strategies**',
  '   - Describe two effective strategies that could be implemented to mitigate the effects of climate change.',
  '',
  '4. **Matching: Climate Change Terminology**',
  '   - Match the following terms with their correct definitions:',
  '     A) Greenhouse Gases',
  '     B) Carbon Footprint',
  '     C) Renewable Energy',
  '     D) Adaptation',
  '     - Definitions:',
  '       1. The total amount of greenhouse gases produced to directl

# Task
Enhance the existing RAG notebook by providing comprehensive explanations of RAG core concepts, document preprocessing (including PDF loading, text chunking, and cleaning), embeddings and vector stores (including different types and trade-offs), retriever setup (focusing on the 'k' parameter), RAG evaluation principles and metrics, and advanced RAG topics (such as advanced chunking, reranking, hybrid search, and production considerations). The final output should be a detailed learning reference within the notebook.

## Summarize RAG Core Concepts

### Subtask:
Define RAG, its main components, purpose, and the problems it solves based on the notebook's introduction.


## Summarize RAG Core Concepts

### Definition of RAG
Retrieval-Augmented Generation (RAG) is a system designed for processing and querying documents by combining information retrieval with generative capabilities. It encodes document content into a searchable vector store, which then allows for the retrieval of relevant information to augment a query.

### Main Components of the Implemented RAG System
Based on the notebook's description, the key components of this RAG system are:
1.  **PDF processing and text extraction**: Loading documents from PDF files using `PyPDFLoader`.
2.  **Text chunking**: Splitting extracted text into manageable chunks using `RecursiveCharacterTextSplitter`.
3.  **Text cleaning**: Applying custom functions like `replace_t_with_space` for specific formatting issues.
4.  **Vector store creation**: Generating vector representations of text chunks using `HuggingFaceEmbeddings` (or other embedding models) and storing them in a `Chroma` vector store (or FAISS).
5.  **Retriever setup**: Configuring a retriever (e.g., `as_retriever`) to fetch the most relevant chunks for a given query.

### Purpose and Problems Solved by RAG
The primary purpose of a RAG system is to enable efficient and relevant information retrieval and question-answering from large documents or collections. It aims to solve problems such as:
*   **Scalability**: By processing large documents in chunks.
*   **Flexibility**: Allowing adjustment of parameters like chunk size and number of retrieved results.
*   **Efficiency**: Utilizing vector stores (like FAISS or Chroma) for fast similarity searches in high-dimensional spaces.
*   **Integration with Advanced NLP**: Leveraging state-of-the-art text representation through embeddings (e.g., OpenAI, HuggingFace, Cohere, Google, Bedrock).
*   **Providing context**: Ensuring that generated answers are grounded in specific, retrieved information rather than general knowledge, leading to more accurate and factual responses.

## Explain Document Preprocessing

### Subtask:
Detail PDF loading, text chunking (explaining `chunk_size` and `chunk_overlap`), and the importance of text cleaning.


## Document Preprocessing in RAG

Document preprocessing is a crucial initial step in building an effective Retrieval-Augmented Generation (RAG) system. It involves transforming raw documents into a format suitable for indexing and retrieval.

### 1. PDF Loading

The first step is to load the PDF documents into the system. In this notebook, `PyPDFLoader` from `langchain_community.document_loaders` is used for this purpose. This loader takes the path to a PDF file and extracts its content, typically returning a list of `Document` objects, where each object might represent a page or a section of the PDF.

### 2. Text Chunking

Once the PDF content is loaded, it's often too large to be processed effectively by language models or for efficient similarity search. This is where **text chunking** comes in. The `RecursiveCharacterTextSplitter` is employed to break down the large document into smaller, manageable pieces called "chunks."

*   **`chunk_size`**: This parameter determines the maximum number of characters (or tokens, depending on the splitter) in each chunk. A well-chosen `chunk_size` ensures that each chunk contains enough context to be meaningful, but not so much that it becomes unwieldy.
*   **`chunk_overlap`**: This parameter specifies the number of characters that consecutive chunks share. Overlap is vital for maintaining context across chunk boundaries. Without overlap, information at the end of one chunk and the beginning of the next might be disjointed, leading to missed relationships or incomplete answers during retrieval. A reasonable `chunk_overlap` helps ensure that queries that span chunk boundaries can still retrieve relevant information.

### 3. Text Cleaning

After chunking, the extracted text often contains artifacts or formatting inconsistencies from the original PDF (e.g., unusual characters, excessive whitespace, or specific escape sequences). The `replace_t_with_space` function is applied to the text chunks to address these issues. Text cleaning is important because:

*   **Improves Embedding Quality**: Clean text leads to more accurate and meaningful embeddings, as the embedding model doesn't have to contend with noise.
*   **Enhances Search Accuracy**: Cleaner text makes it easier for the retrieval system to match queries with relevant document sections, as irrelevant characters or formatting issues are removed.
*   **Better Generation Quality**: The cleaner the input context provided to the language model, the higher the quality of the generated response will be.

## Describe Embeddings and Vector Stores

### Subtask:
Explain what embeddings and vector stores are, their function in RAG, different types (e.g., HuggingFace, OpenAI), and their trade-offs.


## Describe Embeddings and Vector Stores

### Embeddings
**What they are**: Embeddings are numerical representations of text (words, phrases, sentences, or even entire documents) in a high-dimensional vector space. These vectors are designed in such a way that semantically similar pieces of text are located closer to each other in this space, while dissimilar texts are farther apart.

**How they work**: Machine learning models (often deep neural networks) are trained on vast amounts of text data to learn these representations. When you input a text, the model processes it and outputs a fixed-size array of numbers (the embedding vector).

**Role in RAG**: In a Retrieval-Augmented Generation (RAG) system, embeddings are crucial for converting human-readable text into a format that can be efficiently searched and compared by algorithms. When a query is made, it is first embedded into a vector. This query vector is then used to find the most similar document chunks (also represented as embedding vectors) in the vector store.

### Vector Stores
**What they are**: A vector store (or vector database) is a specialized database designed to efficiently store, manage, and query embedding vectors. Unlike traditional databases that store structured data or documents, vector stores are optimized for similarity searches based on vector distance.

**How they store embeddings**: Vector stores index the embedding vectors of your document chunks. Each chunk of text from your PDF, after being converted into an embedding, is stored along with its original text and any associated metadata. This indexing allows for fast retrieval of vectors that are "closest" to a given query vector.

**Function in RAG**: Their primary function in RAG is to enable rapid and accurate retrieval of contextually relevant information. When a user asks a question, the question is embedded, and the vector store quickly identifies the top-k (e.g., top 2) most similar document embeddings. The original text chunks corresponding to these embeddings are then passed to the language model as context to generate an informed answer.

### Types of Embeddings
The notebook demonstrates and mentions several types of embedding models, primarily categorized by their origin and deployment:

1.  **`HuggingFaceEmbeddings` (e.g., `all-MiniLM-L6-v2`)**: These typically leverage models available on the Hugging Face Hub. They are often open-source, can be run locally (CPU or GPU), and offer a balance of performance and cost-effectiveness. `all-MiniLM-L6-v2` is a popular choice for its efficiency and good performance for many tasks. **Use Case**: Good for local development, privacy-sensitive applications, or when controlling computational resources is key.

2.  **`OpenAIEmbeddings` (e.g., `text-embedding-3-small`, `text-embedding-ada-002`)**: These are proprietary models provided by OpenAI. They are known for high quality and ease of use, accessible via API. Newer models like `text-embedding-3-small` offer improved performance and cost efficiency. **Use Case**: When high accuracy is paramount, and budget allows for API calls.

3.  **`CohereEmbeddings`**: Proprietary models from Cohere, similar to OpenAI in offering high-quality embeddings via API. Cohere often emphasizes enterprise-grade solutions. **Use Case**: Similar to OpenAI, when high-quality commercial embeddings are preferred.

4.  **`GoogleGenerativeAIEmbeddings` (for Gemini, e.g., `models/embedding-001`)**: Google's embedding models, often part of their broader generative AI offerings. They provide another option for high-quality, cloud-based embeddings. **Use Case**: For users already integrated into the Google Cloud ecosystem or preferring Google's models.

5.  **`BedrockEmbeddings` (AWS, e.g., Amazon Titan Embeddings)**: Provided by AWS Bedrock, these allow access to various foundation models, including embedding models like Amazon Titan. Ideal for users within the AWS ecosystem. **Use Case**: Enterprises utilizing AWS for their infrastructure.

### Types of Vector Stores

1.  **`FAISS` (Facebook AI Similarity Search)**: This is a library for efficient similarity search and clustering of dense vectors. FAISS is known for its speed and scalability, especially for in-memory vector indexing. It's often used for large-scale similarity search problems. **Advantages**: Extremely fast for search, highly optimized for CPU/GPU, good for local or single-node deployments. **Differences**: Primarily an in-memory library, meaning it's not inherently persistent or distributed without additional infrastructure.

2.  **`Chroma`**: A more recent, open-source vector database that aims to be simple to use and provides persistence out-of-the-box. It can be run locally or in a client-server mode, offering a good balance between ease of use and features. **Advantages**: Simplicity, persistence, includes a Python client, supports filtering metadata. **Differences**: Can handle persistence and is designed as a full-fledged database, unlike FAISS which is a library focused on indexing algorithms.

### Trade-offs

**When choosing Embedding Models:**

*   **Cost**: Proprietary API-based models (OpenAI, Cohere, Google, Bedrock) incur per-token or per-call costs, which can scale significantly with usage. Open-source models (HuggingFace) are free to use but require managing your own computational resources.
*   **Performance/Quality**: Commercial models generally offer state-of-the-art performance and are continuously updated. Open-source models vary widely in quality but can be very competitive for specific domains or fine-tuning.
*   **Open-source vs. Proprietary**: Open-source offers transparency, customizability, and no vendor lock-in. Proprietary offers convenience, often higher baseline performance, and managed infrastructure.
*   **Latency**: Local open-source models can offer lower latency if computations are performed on powerful local hardware. API calls introduce network latency.
*   **Privacy**: Running open-source models locally means data never leaves your environment, which is crucial for sensitive applications.

**When choosing Vector Stores:**

*   **Scalability**: For very large datasets (billions of vectors), dedicated cloud-native vector databases (like Pinecone, Weaviate, Milvus) might be necessary, offering distributed architectures. FAISS is scalable in terms of search speed but typically runs on a single machine or cluster of machines with careful orchestration. Chroma is suitable for moderate-to-large datasets.
*   **Persistence**: Some vector stores (like Chroma) offer built-in persistence, meaning your index survives restarts. Others (like FAISS in its basic form) require you to manually save and load the index.
*   **Deployment Complexity**: FAISS requires more manual setup and management for persistence and scaling. Chroma is simpler to deploy and manage for basic use cases. Cloud-based vector databases handle all infrastructure management but come with higher operational costs.
*   **Features**: Vector stores differ in features like filtering (e.g., by metadata), hybrid search (combining vector and keyword search), and integrations with other tools. FAISS is highly focused on pure vector search. Chroma offers good metadata filtering capabilities.

## Elaborate on Retriever Setup

### Subtask:
Clarify how a retriever is configured from a vector store, focusing on the `k` parameter and its impact on retrieved context.


### Retriever Configuration Explained

In the context of a Retrieval-Augmented Generation (RAG) system, a **retriever** is a crucial component responsible for fetching relevant information from a knowledge base (like our vector store) that can then be used by a Language Model (LLM) to generate an informed response to a query. Its primary role is to bridge the gap between a user's question and the vast amount of data stored in the vector database.

The retriever is configured from a vector store using the `.as_retriever()` method, as seen in the notebook:

```python
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})
```

This line of code transforms our `chunks_vector_store` (which holds the embedded PDF chunks) into an object capable of performing similarity searches and returning relevant documents.

#### The `k` parameter

The `k` parameter within `search_kwargs` is critical. It defines the **number of top-scoring document chunks** that the retriever will return for a given query. When a query is made, the retriever finds the most similar chunks in the vector store, and `k` dictates how many of those top-ranked chunks are actually passed on.

Let's discuss the impact of different `k` values:

*   **Small `k` (e.g., `k=1` or `k=2`)**: A small `k` value means fewer document chunks are retrieved. While this can lead to very focused context, it carries a significant risk of **missing relevant information** if the most pertinent details are spread across several chunks or if the single top-ranked chunk isn't comprehensive enough. This can result in incomplete or less accurate answers from the LLM.

*   **Large `k` (e.g., `k=10` or `k=20`)**: Conversely, a large `k` value retrieves many document chunks. This increases the chances of capturing all relevant information, but it also introduces the potential for **irrelevant information (noise)**. Too many chunks can overwhelm the LLM, make it harder for the model to identify the truly important details, increase computational load, and potentially dilute the most relevant information within a sea of less useful data. This can lead to slower response times and sometimes less coherent answers.

*   **Optimal `k`**: The ideal `k` value is a **trade-off between recall (getting all relevant information) and precision (getting only relevant information)**. There isn't a universally 'best' `k`. It often requires experimentation and iteration based on the specific dataset, the complexity of the queries, and the desired quality of the LLM's responses. A good starting point is usually between 2 and 5, but fine-tuning this parameter is essential for optimizing the RAG system's performance for a particular use case.

## Discuss RAG Evaluation

### Subtask:
Cover general evaluation principles, key metrics, and strategies to generalize the `evaluate_rag` function.


## Discuss RAG Evaluation

### General Evaluation Principles for RAG Systems
Evaluating Retrieval-Augmented Generation (RAG) systems is crucial to ensure they provide accurate, relevant, and helpful responses. The core principle is to assess how well the system retrieves pertinent information and then uses that information to generate a coherent and correct answer. This involves looking at both the retrieval component and the generation component, as failures in either can degrade overall performance.

Key principles include:
1.  **Relevance**: Is the retrieved context directly related to the user's query?
2.  **Factual Consistency (Faithfulness)**: Is the generated answer supported by the retrieved context, and does it avoid hallucinating information?
3.  **Answer Correctness**: Is the generated answer factually accurate according to the real world, regardless of the retrieved context?
4.  **Completeness**: Does the generated answer address all aspects of the user's query?
5.  **Conciseness**: Is the generated answer to the point and free of unnecessary verbosity?

### Key Metrics for RAG Performance
Several metrics can be used to quantitatively and qualitatively evaluate RAG systems:

*   **Relevance**: Measures how well the retrieved documents and the generated answer align with the user's query. This can be assessed by human annotators or automated models.
    *   **Context Relevance**: How relevant are the retrieved documents to the query?
    *   **Answer Relevance**: How relevant is the generated answer to the query?

*   **Completeness**: Assesses whether the generated answer covers all the information implicitly or explicitly requested in the query, based on the provided context.

*   **Conciseness**: Evaluates if the answer is brief and direct, avoiding superfluous details while still being informative.

*   **Faithfulness (Groundedness)**: This metric checks if all claims made in the generated answer are directly supported by the retrieved source documents. It's crucial for preventing hallucinations.

*   **Answer Correctness / Accuracy**: Compares the generated answer to a ground truth answer to determine its factual accuracy. This is often the ultimate goal of a Q&A system.

*   **RAGAS Metrics**: A popular framework for RAG evaluation that provides metrics like:
    *   **Context Precision**: How precise is the retrieved context in relation to the query and ground truth?
    *   **Context Recall**: How much of the necessary context was retrieved?
    *   **Answer Semantic Similarity**: How similar is the generated answer to the ground truth answer semantically?

### Strategies to Generalize the `evaluate_rag` Function
The current `evaluate_rag` function appears to rely on OpenAI for its internal evaluation (as indicated by the comment `#Note - this currently works with OPENAI only`). To generalize it to work with different LLMs or evaluation criteria, several strategies can be employed:

1.  **Abstract LLM Calls**: Instead of hardcoding OpenAI API calls, abstract the LLM interaction into an interface or a configurable parameter. This would allow users to plug in different LLMs (e.g., Cohere, Google Gemini, local models like Llama 3) by simply changing a configuration or passing a different LLM object.
    *   **Example**: Modify `evaluate_rag` to accept an `eval_llm` parameter that could be `OpenAI()`, `Cohere()`, or `GoogleGenerativeAI()`, each configured with its respective API key.

2.  **Configurable Evaluation Prompts**: The prompts used to instruct the LLM for evaluation (e.g., judging relevance, completeness, conciseness) should be configurable. Different LLMs might respond better to different prompt structures, and users might want to adjust criteria.

3.  **Modular Metric Calculation**: Separate the logic for calculating each metric (relevance, completeness, conciseness, faithfulness) into distinct functions or classes. This allows for easier modification, addition of new metrics (like RAGAS metrics), or removal of existing ones.

4.  **Support for Local Embedding Models**: While the `encode_pdf` function already uses `HuggingFaceEmbeddings`, ensuring the evaluation logic itself can work with various embedding models (beyond OpenAI's embeddings if they were used for internal evaluation prompts) is important.

5.  **External Evaluation Framework Integration**: Integrate with dedicated RAG evaluation libraries like RAGAS or DeepEval. These frameworks provide a rich set of metrics and often abstract away the LLM interaction for evaluation, offering more robust and standardized assessment capabilities.

6.  **Parameterization of Thresholds/Criteria**: If any metrics involve thresholds (e.g., a score above X is considered relevant), these should be exposed as parameters to allow users to fine-tune the evaluation stringency.

By implementing these strategies, the `evaluate_rag` function can become a more flexible and powerful tool for assessing RAG systems, adaptable to a wider range of LLMs, embedding models, and specific evaluation requirements.

## Discuss RAG Evaluation

### General Evaluation Principles for RAG Systems
Evaluating Retrieval-Augmented Generation (RAG) systems is crucial to ensure they provide accurate, relevant, and helpful responses. The core principle is to assess how well the system retrieves pertinent information and then uses that information to generate a coherent and correct answer. This involves looking at both the retrieval component and the generation component, as failures in either can degrade overall performance.

Key principles include:
1.  **Relevance**: Is the retrieved context directly related to the user's query?
2.  **Factual Consistency (Faithfulness)**: Is the generated answer supported by the retrieved context, and does it avoid hallucinating information?
3.  **Answer Correctness**: Is the generated answer factually accurate according to the real world, regardless of the retrieved context?
4.  **Completeness**: Does the generated answer address all aspects of the user's query?
5.  **Conciseness**: Is the generated answer to the point and free of unnecessary verbosity?

### Key Metrics for RAG Performance
Several metrics can be used to quantitatively and qualitatively evaluate RAG systems:

*   **Relevance**: Measures how well the retrieved documents and the generated answer align with the user's query. This can be assessed by human annotators or automated models.
    *   **Context Relevance**: How relevant are the retrieved documents to the query?
    *   **Answer Relevance**: How relevant is the generated answer to the query?

*   **Completeness**: Assesses whether the generated answer covers all the information implicitly or explicitly requested in the query, based on the provided context.

*   **Conciseness**: Evaluates if the answer is brief and direct, avoiding superfluous details while still being informative.

*   **Faithfulness (Groundedness)**: This metric checks if all claims made in the generated answer are directly supported by the retrieved source documents. It's crucial for preventing hallucinations.

*   **Answer Correctness / Accuracy**: Compares the generated answer to a ground truth answer to determine its factual accuracy. This is often the ultimate goal of a Q&A system.

*   **RAGAS Metrics**: A popular framework for RAG evaluation that provides metrics like:
    *   **Context Precision**: How precise is the retrieved context in relation to the query and ground truth?
    *   **Context Recall**: How much of the necessary context was retrieved?
    *   **Answer Semantic Similarity**: How similar is the generated answer to the ground truth answer semantically?

### Strategies to Generalize the `evaluate_rag` Function
The current `evaluate_rag` function appears to rely on OpenAI for its internal evaluation (as indicated by the comment `#Note - this currently works with OPENAI only`). To generalize it to work with different LLMs or evaluation criteria, several strategies can be employed:

1.  **Abstract LLM Calls**: Instead of hardcoding OpenAI API calls, abstract the LLM interaction into an interface or a configurable parameter. This would allow users to plug in different LLMs (e.g., Cohere, Google Gemini, local models like Llama 3) by simply changing a configuration or passing a different LLM object.
    *   **Example**: Modify `evaluate_rag` to accept an `eval_llm` parameter that could be `OpenAI()`, `Cohere()`, or `GoogleGenerativeAI()`, each configured with its respective API key.

2.  **Configurable Evaluation Prompts**: The prompts used to instruct the LLM for evaluation (e.g., judging relevance, completeness, conciseness) should be configurable. Different LLMs might respond better to different prompt structures, and users might want to adjust criteria.

3.  **Modular Metric Calculation**: Separate the logic for calculating each metric (relevance, completeness, conciseness, faithfulness) into distinct functions or classes. This allows for easier modification, addition of new metrics (like RAGAS metrics), or removal of existing ones.

4.  **Support for Local Embedding Models**: While the `encode_pdf` function already uses `HuggingFaceEmbeddings`, ensuring the evaluation logic itself can work with various embedding models (beyond OpenAI's embeddings if they were used for internal evaluation prompts) is important.

5.  **External Evaluation Framework Integration**: Integrate with dedicated RAG evaluation libraries like RAGAS or DeepEval. These frameworks provide a rich set of metrics and often abstract away the LLM interaction for evaluation, offering more robust and standardized assessment capabilities.

6.  **Parameterization of Thresholds/Criteria**: If any metrics involve thresholds (e.g., a score above X is considered relevant), these should be exposed as parameters to allow users to fine-tune the evaluation stringency.

By implementing these strategies, the `evaluate_rag` function can become a more flexible and powerful tool for assessing RAG systems, adaptable to a wider range of LLMs, embedding models, and specific evaluation requirements.

## Introduce Advanced RAG Topics

### Subtask:
Suggest further learning, including advanced chunking, reranking, hybrid search, and production considerations.


## Advanced RAG Topics for Further Learning

To build more robust, efficient, and sophisticated Retrieval-Augmented Generation (RAG) systems, it's beneficial to explore advanced techniques beyond the basic setup. These topics enhance retrieval quality, optimize performance, and address real-world deployment challenges.

### 1. Advanced Chunking Strategies

While simple fixed-size text splitting is a good starting point, advanced chunking strategies aim to create more semantically meaningful or contextually rich chunks. This improves the quality of retrieval by ensuring that relevant information is captured within a single chunk.

*   **Semantic Chunking**: Instead of splitting purely by character count, semantic chunking uses embeddings to identify natural breaks in meaning. Chunks are formed based on semantic similarity, ensuring that a single chunk discusses a coherent topic. This can involve clustering sentences or paragraphs based on their vector representations.
*   **Summary-Based Chunking**: This involves creating larger initial chunks, then generating a summary for each chunk. The summary itself can be embedded and used for retrieval, or included alongside the original chunk. This helps capture the essence of longer sections without overloading the embedding model with too much detail.
*   **Agentic Chunking**: For very complex documents, an 'agent' (e.g., an LLM) might be employed to intelligently determine how to best split the document, considering hierarchies, sections, and the overall structure, to create optimal chunks for retrieval.

### 2. Reranking

After an initial retrieval step, where a large number of potentially relevant documents or chunks are fetched, reranking is used to re-order these results based on a more refined measure of relevance. This is crucial because initial retrieval (e.g., cosine similarity with embeddings) might be good at casting a wide net, but reranking helps pinpoint the most precise answers.

*   **Why Reranking?**: The primary retriever often retrieves a broader set of documents. A reranker applies a more sophisticated model to score the relevance of each retrieved chunk to the query, placing the most pertinent chunks at the top. This significantly improves the precision of the final context provided to the LLM.
*   **Common Techniques/Models**: Reranking models are often transformer-based architectures fine-tuned for relevance scoring. Popular examples include models based on cross-encoders (e.g., from `sentence-transformers` library) or more advanced proprietary models from providers like Cohere.

### 3. Hybrid Search

Hybrid search combines different retrieval methods to leverage their individual strengths. The most common form combines sparse retrieval (keyword-based) with dense retrieval (vector similarity).

*   **Sparse Retrieval (e.g., BM25)**: This method relies on keyword matching and statistical measures (like TF-IDF) to find documents. It's excellent for exact keyword matches and explicit information, less so for conceptual or semantic queries.
*   **Dense Retrieval (Vector Search)**: This method uses embeddings to find semantically similar documents, even if they don't share exact keywords. It's powerful for understanding the intent behind a query.
*   **Combining Strengths**: By performing both sparse and dense searches and then combining or fusing their results (e.g., using Reciprocal Rank Fusion - RRF), hybrid search can achieve better overall retrieval performance. It captures both explicit keyword relevance and implicit semantic relevance.

### 4. Production Considerations

Deploying RAG systems in a production environment introduces several practical challenges and considerations:

*   **Scalability**: The system must handle a large number of concurrent queries and potentially very large document collections. This requires efficient vector databases (like FAISS, Chroma, Pinecone, Weaviate), distributed architectures, and optimized retrieval algorithms.
*   **Latency**: Users expect quick responses. Minimizing the time taken for retrieval, reranking, and generation is critical. This involves efficient infrastructure, optimized model inference, and potentially caching mechanisms.
*   **Cost Optimization**: Running embedding models, large language models, and vector databases can be expensive. Strategies include choosing cost-effective models, optimizing API calls, and efficient resource allocation.
*   **Data Freshness**: Information can become outdated quickly. A production RAG system needs a robust mechanism to update its document corpus and vector store regularly, ensuring that retrieved information is current and accurate.
*   **Monitoring**: Continuous monitoring of retrieval quality, generation relevance, latency, and system health is essential. This allows for prompt detection of issues, performance bottlenecks, and drift in model effectiveness, enabling iterative improvements and maintenance.

## Final Task

### Subtask:
Consolidate all enhanced content into a comprehensive learning reference within the notebook.


## Summary:

### Data Analysis Key Findings

The comprehensive RAG learning reference was successfully enhanced by adding detailed explanations across several key areas:

*   **RAG Core Concepts**: The summary defines RAG as a system combining information retrieval with generative capabilities. Its main components were identified as PDF processing, text chunking, text cleaning, vector store creation (e.g., using `HuggingFaceEmbeddings` and `Chroma`), and retriever setup. The system aims to solve issues of scalability, flexibility, efficiency, advanced NLP integration, and providing grounded context for generated answers.
*   **Document Preprocessing**: Detailed explanations were provided for PDF loading (`PyPDFLoader`), text chunking (`RecursiveCharacterTextSplitter` with `chunk_size` and `chunk_overlap`), and text cleaning (e.g., `replace_t_with_space`). The importance of these steps for improving embedding quality, search accuracy, and generation quality was highlighted.
*   **Embeddings and Vector Stores**:
    *   **Embeddings**: Defined as numerical representations of text that capture semantic similarity, crucial for efficient search in RAG.
    *   **Vector Stores**: Explained as specialized databases (`FAISS`, `Chroma`) optimized for storing and querying these embedding vectors for rapid retrieval of relevant information.
    *   **Types and Trade-offs**: Various embedding models were discussed (e.g., `HuggingFaceEmbeddings`, `OpenAIEmbeddings`, `CohereEmbeddings`, `GoogleGenerativeAIEmbeddings`, `BedrockEmbeddings`), along with their respective trade-offs in terms of cost, performance, open-source vs. proprietary nature, latency, and privacy. Similarly, differences and advantages between `FAISS` and `Chroma` were outlined, considering aspects like scalability, persistence, and deployment complexity.
*   **Retriever Setup**: The process of configuring a retriever from a vector store using the `.as_retriever()` method was clarified. A particular focus was placed on the `k` parameter, which determines the number of top-scoring document chunks retrieved. The analysis explained that a small `k` risks missing relevant information, while a large `k` can introduce noise and overwhelm the LLM, emphasizing the need for an optimal balance through experimentation.
*   **RAG Evaluation**:
    *   **Principles**: Core evaluation principles were outlined, including relevance, factual consistency (faithfulness), answer correctness, completeness, and conciseness.
    *   **Key Metrics**: Important metrics for RAG performance were detailed, such as context relevance, answer relevance, completeness, conciseness, faithfulness (groundedness), answer accuracy, and the role of RAGAS metrics (Context Precision, Context Recall, Answer Semantic Similarity).
    *   **Generalization Strategies**: Concrete strategies were proposed to generalize the `evaluate_rag` function beyond OpenAI, including abstracting LLM calls, configuring evaluation prompts, modularizing metric calculation, supporting local embedding models, integrating external evaluation frameworks (like RAGAS), and parameterizing evaluation thresholds.
*   **Advanced RAG Topics**: Suggestions for further learning were introduced, covering:
    *   **Advanced Chunking**: Strategies like semantic, summary-based, and agentic chunking for more contextually rich chunks.
    *   **Reranking**: Techniques to re-order initial retrieval results for improved precision.
    *   **Hybrid Search**: Combining sparse (keyword-based) and dense (vector similarity) retrieval for enhanced overall performance.
    *   **Production Considerations**: Practical challenges and considerations for deploying RAG systems, such as scalability, latency, cost optimization, data freshness, and continuous monitoring.

### Insights or Next Steps

*   The enhanced notebook provides a solid educational foundation for understanding and implementing RAG, from basic components to advanced considerations, making it a valuable resource for developers and researchers.
*   The detailed strategies for generalizing the `evaluate_rag` function are critical for building flexible and robust RAG systems that can adapt to various LLMs and evaluation criteria, moving beyond single-provider dependencies.
