# 02 Introduction to RAG with Ollama

This notebook demonstrates a **Retrieval-Augmented Generation (RAG)** pipeline using the **Ollama** library for generating responses, **ChromaDB** for efficient vector-based retrieval, and **Jinja** for dynamic prompt templating. RAG combines the power of large language models with a vector database to enable dynamic and context-aware response generation based on external knowledge.

### Steps Covered:
1. **Data Preparation**: Load and preprocess text data to generate embeddings.
2. **Embedding Creation**: Use a language model to encode textual data into vector representations.
3. **Storage and Retrieval**: Store embeddings in ChromaDB and perform similarity-based queries.
4. **Dynamic Prompt Generation**: Use **Jinja** templates to build structured prompts dynamically based on retrieved data.
5. **Response Generation**: Use Ollama to generate relevant responses based on templated prompts and retrieved context.

### Key Libraries:
- **[Ollama](https://github.com/ollama/ollama-python)**: A Python library for interacting with language models for text generation and embedding.
- **[ChromaDB](https://www.trychroma.com/)**: A vector database for storing and querying high-dimensional embeddings.
- **[Jinja](https://jinja.palletsprojects.com/)**: A templating engine for building flexible and reusable text prompts.

This step-by-step guide provides a hands-on approach to building and understanding the RAG pipeline, leveraging modularity and clarity for easy adaptation into your own projects.

## 02.01 Installing Required Libraries

This cell installs the necessary Python libraries for implementing the RAG pipeline.

1. **`ollama`**:
   - A library for interacting with large language models.
   - Enables functionalities like generating embeddings and responses.
   - Official documentation: [Ollama GitHub](https://github.com/ollama/ollama-python).

2. **`chromadb`**:
   - A high-performance vector database for storing and retrieving embeddings.
   - Used for similarity searches in the RAG pipeline.
   - Official website: [ChromaDB](https://www.trychroma.com/).

3. **`Jinja2`**:
   - A templating engine for dynamically generating text prompts based on retrieved data.
   - Simplifies creating structured, reusable prompts.
   - Official documentation: [Jinja2 Documentation](https://jinja.palletsprojects.com/).

### Why `%pip`?
- The `%pip install` command ensures that the libraries are installed in the current Jupyter notebook environment, preventing conflicts with other Python environments.

**Note**: Run this cell only once to install the libraries. Re-running it after the libraries are installed will skip the installation process if the packages are already up to date.

In [None]:
# 02.01 Installing Required Libraries

# Installing the Ollama library for interaction with large language models.
%pip install ollama

# Installing ChromaDB, a vector database for storing and retrieving high-dimensional embeddings.
%pip install chromadb

# Installing Jinja2, a templating engine for dynamically generating structured prompts.
%pip install Jinja2

## 02.02 Configuring the Embedding Model

This section sets up the **Ollama Embedding Function** for generating vector representations, which will be used by ChromaDB for similarity-based retrieval.

### Steps and Requirements:
1. **Importing Embedding Functions**:
   - The `chromadb.utils.embedding_functions` module provides utilities for defining embedding functions compatible with ChromaDB.

2. **Embedding Model Configuration**:
   - The `OllamaEmbeddingFunction` is configured with:
     - `url`: The API endpoint for Ollama's embedding service.
     - `model_name`: Specifies the embedding model, `nomic-embed-text`, for vector generation.
   - Make sure the `nomic-embed-text` embedding model is installed and running.

### Additional Resources:
- **Supported Embedding Models**:
  - Refer to the [Ollama Supported Embedding Models](https://ollama.com/search?c=embedding) for a complete list of embedding models.
  - Install the required embedding model before proceeding.
- **Embedding Overview**:
  - Learn more about embeddings and their applications in [IBM's Documentation](https://www.ibm.com/think/topics/embedding).

### How to Verify Installed Models:
To check all the embedding models installed and available in Ollama, run:
```python
import ollama
available_models = ollama.list()
for model in available_models.models:
    print(model.model)
```

Ensure that the `nomic-embed-text:latest` model is listed in the output, else either install this model or use an existing model.

### Why Embeddings?

Embedding models convert textual data into high-dimensional vectors, enabling efficient similarity searches in a vector database like ChromaDB.

In [None]:
# 02.02 Configuring the Embedding Model

# Importing the embedding functions utility from ChromaDB.
import chromadb.utils.embedding_functions as embedding_functions

# Setting up an Ollama embedding function for ChromaDB.
# Ensure that the embedding model 'nomic-embed-text' is installed and available.
# This will utilize Ollama's embedding API for generating vector representations.
ollama_embedding = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",  # API endpoint for Ollama's embedding service.
    model_name="nomic-embed-text:latest"  # The embedding model to be used for generating vectors.
)

## 02.03 Generating Sample Embeddings

This section demonstrates how to generate embeddings for a sample text using the configured Ollama embedding function.

### Steps:
1. **Input Text**:
   - A sample text is passed as a list to the `ollama_embedding` function. Each string in the list will be converted into a high-dimensional vector.

2. **Generating Embeddings**:
   - The `OllamaEmbeddingFunction` processes the input text and returns its vector representation.

3. **Output**:
   - The output, `sample_embeddings`, is a list of vectors corresponding to the input text. These embeddings are suitable for tasks like similarity searches, clustering, or input to machine learning models.

### Example Use Case:
Given the input text:
```python
"This is a sample text to try Ollama embedding at the workshop"
```
The output will be a vector, such as:
```
[[0.1, 0.2, 0.3, ...]]  # Example representation of a high-dimensional vector.
```
### Why Embeddings Are Useful:

- **Dimensionality Reduction**: Transform text into numerical representations for computations.
- **Similarity Search**: Enable searching for semantically similar texts in a vector database like ChromaDB.
- **Downstream Applications**: Useful in clustering, classification, or as input to models.

**Notes:**

- Ensure that the embedding model (`nomic-embed-text:latest`) is properly installed and running.
- Embeddings are high-dimensional numerical data, so the output may not be human-readable but is critical for computational tasks.

In [None]:
# 02.03 Generating Sample Embeddings

# Generating embeddings for a sample text using the configured Ollama embedding function.
# The input is a list of strings, and the function returns corresponding vector representations.
sample_embeddings = ollama_embedding(
    ["This is a sample text to try Ollama embedding at the workshop"]
)

# Printing the generated embeddings to verify the output.
# Note: The output is a high-dimensional vector representation of the input text.
print(sample_embeddings)

## 02.04 Using a Persistent Vector Database with ChromaDB

This section explains using a **Persistent Database** with ChromaDB and its configuration for this demonstration.

### Why Use a Persistent Database?

1. **Efficiency**:
   - When working with large datasets, generating embeddings and indexing them can be computationally expensive and time-consuming.
   - A persistent database ensures that the embeddings and indexes are saved to disk, avoiding the need to re-index data every time the notebook is run.

2. **Reusability**:
   - Once the database is created, it can be reused across multiple sessions or notebooks.
   - You can load the database and immediately perform similarity queries without regenerating embeddings.

3. **Performance**:
   - ChromaDB's persistent storage is optimized for rapid retrieval of embeddings, enabling faster searches in subsequent runs.

### How ChromaDB Saves Data
ChromaDB saves embeddings, indexes, and metadata on disk at the specified path (`./vector-db/made-with-cc` in this case). 
- The data is stored in a structured format that supports efficient querying and retrieval.
- By using persistent storage, ChromaDB ensures data consistency and eliminates the need for in-memory reinitialization.

### Demonstration Example

For this demonstration, we use the book **"Made with Creative Commons"**:
- **Source**: [Made with Creative Commons](https://creativecommons.org/share-your-work/made-with-cc/)
- **License**: The book is licensed under Creative Commons Attribution-Sharealike 4.0 International license, allowing free use and sharing with attribution.

#### Prepared Data:
- The text of the book has been pre-processed:
  - Paragraphs were extracted line by line and saved into a text file: `made-with-cc.txt`.
  - This file is stored in the current working directory.
  
#### Notes:
- For your own projects, you may need to preprocess your data and create text files containing paragraphs or other units of text for embedding and indexing.
- This demonstration does not cover text extraction or preparation techniques.

### Persistent Database Initialization
In this cell:
- **`chromadb.PersistentClient()`**:
  - Configures a client with persistent storage.
  - Saves the database at `./vector-db/made-with-cc`.
  - When this notebook is rerun, the database will be reloaded from disk without requiring re-indexing.

### Key Advantages:
- Saves time by avoiding redundant computations.
- Simplifies workflows by separating data preparation from query execution.

In [None]:
# 02.04 Using a Persistent Vector Database with ChromaDB

# Importing ChromaDB, a vector database for efficient embedding storage and retrieval.
import chromadb

# Initializing a Persistent ChromaDB client.
# The database will be stored persistently on disk at the specified path.
cc_client = chromadb.PersistentClient(path="./vector-db/made-with-cc")  # Path to the persistent database.

## 02.05 Helper Functions for Data Processing

This section defines a set of helper functions to streamline the process of loading and embedding data into the vector database.

### Functions Overview:

1. **`generate_line_number_id(index)`**:
   - Creates a unique ID for each line in the input file based on its line number (1-based index).
   - Helps in tracking and referencing individual text entries.

2. **`get_embedding_for_document(document)`**:
   - Generates a vector embedding for a given text document using the configured `ollama_embedding` function.
   - Includes error handling to catch and report issues during the embedding process.

3. **`load_documents_with_line_ids_and_embeddings(file_path, collection)`**:
   - Loads documents from a text file, assigns a line-based ID to each, generates embeddings, and adds them to the specified ChromaDB collection.
   - Steps:
     - Reads the text file line by line, skipping empty lines.
     - Generates unique IDs for each document.
     - Uses `get_embedding_for_document` to compute embeddings sequentially.
     - Adds the documents, IDs, and embeddings to the ChromaDB collection using `collection.upsert()`.

### Key Points:
- **Error Handling**: Gracefully handles file not found errors and embedding failures.
- **Data Processing**: Each document is processed line by line, ensuring clean and consistent indexing.
- **Embedding Generation**: Embeddings are generated using the pre-configured `ollama_embedding` function.

### Example Usage:
```python
# Assuming 'my_collection' is a ChromaDB collection object.
load_documents_with_line_ids_and_embeddings("made-with-cc.txt", my_collection)
```
This approach simplifies loading and embedding data into the database, making it easy to adapt for larger or custom datasets.

In [None]:
# 02.05 Helper Functions for Data Processing

def generate_line_number_id(index):
    """
    Generates an ID based on the line number.

    Args:
        index (int): The zero-based index of the line in the file.

    Returns:
        str: The line number ID as a string (1-based index).
    """
    return str(index + 1)

def get_embedding_for_document(document):
    """
    Generates an embedding for a given document using the `ollama_embedding` function.

    Args:
        document (str): The document text.

    Returns:
        list: The embedding for the document.
    """
    try:
        return ollama_embedding([document])[0]
    except Exception as e:
        print(f"Error generating embedding for document: {document[:30]}... Error: {e}")
        return None

def load_documents_with_line_ids_and_embeddings(file_path, collection):
    """
    Loads documents from a text file, assigns a line number as the ID to each document,
    generates embeddings for each document, and adds them to a collection.

    Args:
        file_path (str): The path to the text file containing documents (one per line).
        collection (object): The collection object to which the documents, IDs, and embeddings will be added.

    Functionality:
        - Reads a text file line by line.
        - Strips whitespace from each line and skips empty lines.
        - Generates a line number ID for each document.
        - Generates embeddings for each document sequentially.
        - Adds the documents, IDs, and embeddings to the collection using `collection.upsert()`.

    Example Usage:
        collection = some_vector_database.collection("made-with-cc")
        load_documents_with_line_ids_and_embeddings("made-with-cc.txt", collection)
    """
    try:
        # Open and read the text file line by line
        with open(file_path, 'r') as file:
            lines = file.readlines()

        # Process each line to strip whitespace and remove empty entries
        documents = [line.strip() for line in lines if line.strip()]

        # Generate line number IDs for each document
        ids = [generate_line_number_id(i) for i in range(len(documents))]

        # Generate embeddings sequentially
        embeddings = []
        for doc in documents:
            embedding = get_embedding_for_document(doc)
            if embedding is not None:
                embeddings.append(embedding)
            else:
                embeddings.append([])  # Append an empty list for documents that fail

        # Add the documents, IDs, and embeddings to the collection
        collection.upsert(documents=documents, ids=ids, embeddings=embeddings)

        print(f"Successfully added {len(documents)} documents with embeddings to the collection.")

    except FileNotFoundError:
        print(f"Error: The file at {file_path} was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

## 02.06 Initializing and Loading the ChromaDB Collection

This section demonstrates how to initialize or retrieve a ChromaDB collection and ensure it is populated with data for querying.

### Key Steps:
1. **Create or Retrieve a Collection**:
   - The `get_or_create_collection()` method ensures that a collection named `"made-with-cc"` exists in the database.
   - If the collection does not already exist, it will be created.

2. **Check if the Collection is Empty**:
   - The `count()` method checks the number of documents currently stored in the collection.
   - If the collection is empty (`count() == 0`):
     - A message is displayed, and the helper function `load_documents_with_line_ids_and_embeddings` is called to load the data.
     - This step processes the `made-with-cc.txt` file and populates the collection with documents and embeddings.
   - If the collection is not empty:
     - A message is displayed indicating the collection is ready for querying, along with the current document count.

### Example Usage:
- When the notebook is run for the first time, the collection will be populated with embeddings from `made-with-cc.txt`.
- Subsequent runs will skip reloading if the collection is already populated, saving time and resources.

### Benefits:
- **Persistence**:
  - Leveraging ChromaDB's persistent storage, this approach avoids redundant re-indexing when the notebook is rerun.
- **Efficiency**:
  - Ensures that embeddings are only generated once, making the pipeline faster for future queries.
  
### Output Messages:
- For an empty collection:
  ```
  Collection ‘made-with-cc’ is empty. Loading documents… Please wait…
  ```
- For a preloaded collection:
  ```
  Loaded collection ‘made-with-cc’. Collection contains X documents.
  This collection is ready to be queried.
  ```

In [None]:
# 02.06 Initializing and Loading the ChromaDB Collection

# Create or retrieve a collection named "made-with-cc" in the persistent ChromaDB database.
cc_collection = cc_client.get_or_create_collection(name="made-with-cc")

# Check if the collection is empty.
if cc_collection.count() == 0:
    # If the collection is empty, print a message and load the documents.
    print(f"Collection 'made-with-cc' is empty. Loading documents... Please wait...")
    load_documents_with_line_ids_and_embeddings("made-with-cc.txt", cc_collection)
else:
    # If the collection is not empty, print the current document count.
    print(f"Loaded collection 'made-with-cc'. Collection contains {cc_collection.count()} documents. \nThis collection is ready to be queried.")

## 02.07 Querying the Database for Matching Concepts

This section demonstrates how to query the ChromaDB collection to retrieve documents that match a specific concept based on vector embeddings.

### Example Prompt
For this demonstration, the query prompt is:

```
How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?
```

This prompt reflects a conceptual query designed to retrieve semantically similar paragraphs from the database.

### Understanding Similarity and Distance
- **Concept of Distance**:
  - Distance is a numerical measure of how closely the query embedding matches document embeddings in the database.
  - A **smaller distance** indicates a higher similarity between the query and the document's concept.
  - A **larger distance** suggests that the document is less relevant to the given query.

- **How It Works**:
  - Each paragraph in the database is represented as a high-dimensional vector (embedding).
  - The query is also converted into a vector using the same embedding function.
  - The database computes the distance between the query vector and each document vector, returning the documents with the smallest distances.

- **Practical Meaning**:
  - The closer the distance, the more aligned the document's idea or concept is with the given query.
  - This allows for semantic (concept-based) matching rather than simple keyword matching.

### Query Process
1. **Query Embeddings**:
   - The prompt is converted into vector embeddings using the preconfigured `ollama_embedding` function.
   - These embeddings serve as the basis for similarity matching.

2. **ChromaDB Query**:
   - The `cc_collection.query()` method is used to search the database for the top `n_results` (10 in this case) matching documents based on similarity.

3. **Output Format**:
   - Results include:
     - Document IDs: Unique identifiers for the matched paragraphs.
     - Distances: Numerical values representing the similarity between the query and the results.
     - Paragraph Text: The actual content of the matched documents.

### Advanced Output Rendering
- **HTML Table**:
  - Results are presented in a styled HTML table using Jupyter's `IPython.display.HTML` for better readability.
  - The table includes columns for:
    - Document ID (Paragraph Number).
    - Distance (Similarity Score).
    - Paragraph Text (Matched Content).

- **Plain Text Option**:
  - If you prefer, uncomment the following lines to print the results in plain text format:
  ```python
  for id_num, document, distance in zip(ids, documents, distances):
      print(f"\n[{id_num}] [{distance}] {document}")
  ```

### Benefits:

- **Efficiency**: ChromaDB retrieves matches quickly, leveraging its persistent indexing.
- **Contextual Matching**: Embeddings allow semantic similarity searches, providing results even for abstract or indirect matches.

In [None]:
# 02.07 Querying the Database for Matching Concepts

# Query the ChromaDB collection for documents matching the given concept.
results = cc_collection.query(
    query_embeddings=ollama_embedding([
        "How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?"
    ]),
    n_results=10  # Number of top matching results to return.
)

# Extract results data
ids = results['ids'][0]  # Extract the IDs of the matching documents.
documents = results['documents'][0]  # Extract the document content of the matches.
distances = results['distances'][0]  # Extract the distances between the query and matches.

# We will be using Jupyter notebook's Advanced options to render this result as HTML 
# but if you prefer to render them directly as it is in plain text you can uncomment the below two lines.

#for id_num, document, distance in zip(ids, documents, distances):
#    print(f"\n[{id_num}] [{distance}] {document}")

# Jupyter advanced options: Render results as an HTML table
from IPython.display import HTML, display

# Prepare data for the HTML table
data = [[id_num, round(distance, 3), doc] for id_num, distance, doc in zip(ids, distances, documents)]

# Define an HTML table with styling
html_table = '''
<table style="width:100%; table-layout: auto; border-collapse: collapse;">
    <colgroup>
        <col style="white-space: nowrap;"> <!-- Paragraph Number column -->
        <col style="white-space: nowrap;"> <!-- Distance column -->
        <col style="width: auto;"> <!-- Paragraph Text column -->
    </colgroup>
    <tr style="background-color: #f8f9fa;">
        <th style="border: 1px solid #dee2e6; padding: 8px;">No</th>
        <th style="border: 1px solid #dee2e6; padding: 8px;">Distance</th>
        <th style="border: 1px solid #dee2e6; padding: 8px; text-align: left;">Paragraph Text</th>
    </tr>
    <tr>{}</tr>
</table>
'''.format(
    '</tr><tr>'.join(
        '<td style="border: 1px solid #dee2e6; padding: 8px; text-align: center;">{}</td><td style="border: 1px solid #dee2e6; padding: 8px; text-align: center;">{}</td><td style="border: 1px solid #dee2e6; padding: 12px; text-align: justify;">{}</td>'.format(*row) for row in data
    )
)

# Display the table
display(HTML(html_table))

## 02.08 Streaming and Displaying Markdown Responses from Ollama

This section provides a helper function to stream responses from an Ollama model and display them dynamically in Markdown format within a Jupyter Notebook. The alternative option is directly using the `chat` function for static outputs.

### Functionality

1. **`stream_markdown()`**:
   - Streams a response from the specified model and input prompt.
   - Dynamically updates the displayed content in Markdown as the response is received.

2. **Core Features**:
   - **Streaming**: The response is processed in chunks using Ollama's `chat` API with the `stream=True` option.
   - **Markdown Display**: Chunks are rendered as Markdown in real time using Jupyter's `display()` and `Markdown()` functions.
   - **Update Control**:
     - Updates the display every 300 milliseconds (configurable via `UPDATE_INTERVAL`) or if the buffer exceeds 500 characters.
     - Ensures smooth and responsive streaming in the notebook.

### Usage Example
```python
response = stream_markdown(
    'llama3.2:3b',  # Model name
    'How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?'  # Prompt
)
```

The response will be streamed and displayed dynamically in Markdown, making it more readable and interactive.

### Alternative Option

- For static responses without streaming:
Use the `chat()` function directly and display the response after processing:

```python
from ollama import chat
from IPython.display import display, Markdown

response = chat(
    model='llama3.2:3b',
    messages=[{'role': 'user', 'content': 'How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?'}]
)

display(Markdown(response['message']['content']))
```

- Print the stream as it is:

```python
from ollama import chat
stream = chat(
    model='llama3.2:3b',
    messages=[{'role': 'user', 'content': 'How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?'}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)
```

### Benefits of Streaming

- Real-time feedback during long responses.
- Improved interactivity for exploratory tasks.

In [None]:
# 02.08 Streaming and Displaying Markdown Responses from Ollama

from IPython.display import display, Markdown  # For rendering Markdown in Jupyter Notebook
from ollama import chat  # For interacting with the Ollama chat API
import time  # For controlling update intervals in the streaming output

def stream_markdown(model, prompt):
    """
    Streams a response from an Ollama chat model and displays it as Markdown in a Jupyter Notebook.

    Args:
        model (str): The name of the model to query.
        prompt (str): The input prompt for the model.

    Returns:
        str: The full response streamed from the model.
    """
    # Start the streaming chat with the specified model and user-provided prompt
    stream = chat(
        model=model,
        messages=[{'role': 'user', 'content': prompt}],
        stream=True,
    )
    
    # Initialize variables for tracking the response and buffer updates
    full_response = ""  # Stores the complete response
    buffer = ""  # Temporarily stores incoming chunks
    last_update = time.time()  # Tracks the last time the display was updated
    UPDATE_INTERVAL = 0.3  # Minimum time interval (in seconds) for updates
    
    try:
        # Process the streamed response chunk by chunk
        for chunk in stream:
            chunk_content = chunk['message']['content']  # Extract the text content from the chunk
            buffer += chunk_content  # Append new content to the buffer
            current_time = time.time()
            
            # Update the Markdown display if enough time has passed or buffer is too large
            if (current_time - last_update > UPDATE_INTERVAL) or len(buffer) > 500:
                full_response += buffer  # Add buffered content to the full response
                display(Markdown(full_response), clear=True)  # Render the Markdown
                buffer = ""  # Clear the buffer
                last_update = current_time
        
        # Final update to display any remaining content in the buffer
        if buffer:
            full_response += buffer
            display(Markdown(full_response), clear=True)
            
    except Exception as e:
        print(f"Error occurred: {e}")  # Print any errors encountered during streaming
        
    return full_response  # Return the complete response as a string

# Usage Example
response = stream_markdown(
    'llama3.2:3b',  # Model name
    'Explain what the llama3.2 model is and its main purpose in simple terms as a table.'  # Prompt for the model
)

## 02.09 Querying the Default Model Knowledge

In this section, we query the model using a simple prompt without providing any additional context or external data. This demonstrates how the model generates responses based solely on its default training and internal knowledge.

### Purpose
This section serves as a **baseline** for comparison:
- By observing the output generated without additional context, we can later compare it to outputs produced with the **RAG pipeline**.
- This highlights how including external data (retrieved from ChromaDB in RAG) enriches the model's responses.

### Example Prompt
For this demonstration, the prompt is:

```
“How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?”
```

This query is designed to explore how Creative Commons licenses impact the digital commons, balancing open sharing with market-based approaches.

### Usage
The `stream_markdown()` function streams the model's response dynamically and displays it in Markdown format. The response reflects the model's understanding of the topic based on its pre-trained knowledge.

### Key Insights
- **Default Knowledge**: The model's response is limited to what it has learned during training. It does not incorporate external or real-time context.
- **Reference Output**: The generated output serves as a reference point to evaluate the impact of augmenting the model with external data using RAG.

### Next Step
In subsequent sections, we will use the RAG pipeline to provide additional context to the model, compare the results, and demonstrate the value of integrating external knowledge.

In [None]:
#02.09 Querying the Default Model Knowledge

query = "How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?"
response = stream_markdown(
    'llama3.2:3b',  # Model name
    query  # Prompt for the model
)

## 02.10 Retrieval-Augmented Generation (RAG) Example

In this section, we use the **RAG pipeline** to enhance the model's output by providing it with additional context retrieved from a vector database. This demonstrates how augmenting the model with external knowledge improves the relevance and specificity of its responses.

### Steps
1. **Define the Query**:
   - The query is a conceptual question:
     ```
     "How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?"
     ```

2. **Retrieve Related Paragraphs**:
   - Using ChromaDB, we query the vector database to retrieve the top 10 paragraphs most relevant to the query.
   - These paragraphs are selected based on their semantic similarity to the query embeddings.

3. **Construct the RAG Prompt**:
   - The retrieved paragraphs are combined with the query to create a context-rich prompt:
     ```
     "How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms? - Answer using these references: [retrieved paragraphs]"
     ```

4. **Stream the Response**:
   - The `stream_markdown()` function sends the context-enhanced prompt to the model (`llama3.2:3b`) and streams the generated response.

### Benefits of RAG
- **Context-Aware Responses**:
  - By incorporating retrieved references, the model can provide more specific and informed answers.
- **Relevance**:
  - The retrieved paragraphs ground the model's response in external data, ensuring alignment with the provided context.

### Comparison with Default Knowledge
- Unlike the earlier section where the model relied solely on its pre-trained knowledge, this approach augments its capabilities with up-to-date and domain-specific data.

### Example Output
For the query:

```
“How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?”
```

The RAG-enhanced response will be more contextually accurate, referencing the retrieved paragraphs directly.

In [None]:
# 02.10 Retrieval-Augmented Generation (RAG) Example

query = "How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?"

related_paragraphs = cc_collection.query(
    query_embeddings=ollama_embedding([
        query
    ]),
    n_results=10  # Number of top matching results to return.
)['documents'][0]

basic_rag_prompt = f"{query} - Answer using these references: {' '.join(related_paragraphs)}"

response = stream_markdown(
    'llama3.2:3b',  # Model name
    basic_rag_prompt  # RAG Prompt for the model
)

## 02.11 Enhanced RAG Prompt Generation Using Jinja2

This section demonstrates how to create a detailed and structured prompt for **Retrieval-Augmented Generation (RAG)** using Jinja2 templates. The enhanced prompt incorporates retrieved contextual information with clear instructions and paragraph citations.

### Key Features of the RAG Template
- **Context Integration**:
  - Includes the retrieved paragraphs with paragraph IDs (e.g., `[p1]`).
  - Ensures the model has clear and structured context to generate responses.
- **Instructions**:
  - Guides the model to:
    - Prioritize the provided context.
    - Use logical flow and coherence.
    - Cite paragraphs explicitly using `[p{number}]` notation.
    - Acknowledge gaps if relevant information is missing.
    - Synthesize multiple paragraphs for comprehensive answers.

### `create_rag_prompt` Function
- **Purpose**:
  - Combines a user query and retrieved context into a formatted RAG prompt.
- **Validation**:
  - Ensures the `context_docs` dictionary contains both `documents` and `ids`.
  - Checks that both lists are non-empty and aligned in length.
- **Formatting**:
  - Formats each paragraph with its corresponding ID.
  - Joins all paragraphs into a cohesive block of context.
  - Renders the final prompt using the Jinja2 template.

### Example Workflow
1. **Retrieve Context**:
   - Use a vector database like ChromaDB to retrieve paragraphs related to the query.
2. **Generate Prompt**:
   - Use the `create_rag_prompt` function to combine the query and retrieved paragraphs.
3. **Example Query**:
   ```
   “How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?”
   ```
4. **Generated Prompt**:
The output includes the question, formatted context, and clear instructions.

### Benefits
- **Improved Clarity**:
  Structured prompts guide the model to produce focused and accurate responses.
- **Enhanced Citation**:
  Paragraph IDs ensure traceability of information sources.
- **Reusability**:
  Jinja2 templates allow for easy customization and scalability for other use cases.

In [None]:
## 02.11 Enhanced RAG Prompt Generation Using Jinja2

from jinja2 import Template  # Importing Jinja2 for creating and rendering templates.

# Enhanced RAG template with instructions for answering using context.
rag_template = Template("""
You are a knowledgeable assistant tasked with answering questions based on provided context.
Focus on providing accurate, relevant information from the given sources while maintaining logical flow.

Context:
{{ context }}

Question: {{ question }}

Instructions:
- Answer based primarily on the provided context
- Maintain coherent logical flow
- Cite specific paragraphs using [p{number}] notation when referencing information
- If information is missing from context, acknowledge this
- Synthesize information across multiple paragraphs when relevant
- Use paragraph numbers to build a cohesive narrative while maintaining accurate citations
""")

def create_rag_prompt(query, context_docs):
    """
    Creates an enhanced RAG prompt combining the query and relevant context with paragraph IDs.
    
    Args:
        query (str): User's question
        context_docs (dict): Retrieved documents from vector search containing:
            - documents: list of document content
            - ids: list of paragraph IDs
    
    Returns:
        str: Generated prompt with context and instructions.
    """
    # Validate input structure
    if not all(key in context_docs for key in ["documents", "ids"]):
        raise ValueError("context_docs must contain both 'documents' and 'ids' keys")
    
    if not context_docs["documents"] or not context_docs["ids"]:
        raise ValueError("Both documents and ids lists must not be empty")
        
    if len(context_docs["documents"][0]) != len(context_docs["ids"][0]):
        raise ValueError("Number of documents must match number of ids")

    # Combine documents with their paragraph IDs
    context_entries = []
    for doc, para_id in zip(context_docs["documents"][0], context_docs["ids"][0]):
        # Format each context entry with its paragraph ID
        context_entries.append(f"[p{para_id}] {doc.strip()}")
    
    # Join all retrieved documents with clear separation
    context = "\n\n".join(context_entries)
    
    # Generate prompt using template
    prompt = rag_template.render(
        question=query,
        context=context
    )
    
    return prompt

# Example usage
query = "How do Creative Commons licenses contribute to fostering the digital commons and balancing it with market norms?"

# Retrieve relevant documents using vector search.
related_paragraphs = cc_collection.query(
    query_embeddings=ollama_embedding([query]),
    n_results=10
)

# Create an improved RAG prompt using the retrieved paragraphs.
improved_rag_prompt = create_rag_prompt(query, related_paragraphs)

# Display the prompt for inspection.
print(improved_rag_prompt)

## 02.12 Generating a Response with the Improved RAG Prompt

This section demonstrates how to query the model using the **Improved RAG Prompt**. The enhanced prompt incorporates context retrieved from a vector database, along with structured instructions to guide the model's response generation.

### Workflow
1. **Improved Prompt**:
   - The `improved_rag_prompt` is generated using the `create_rag_prompt` function.
   - It combines:
     - The user query.
     - Context paragraphs retrieved using vector search.
     - Explicit instructions on how the model should use the provided context.

2. **Model Query**:
   - The `stream_markdown()` function sends the enhanced RAG prompt to the model (`llama3.2:3b`).
   - The response is streamed and displayed dynamically in Markdown format.

### Purpose
- To showcase how the model generates context-aware and citation-rich responses when provided with external knowledge.
- The improved RAG prompt enhances the model's ability to provide accurate, relevant, and coherent answers.

### Expected Output
- The response will cite specific paragraphs from the provided context using `[p{number}]` notation.
- It will synthesize information across multiple paragraphs, maintaining logical coherence.

### Comparison with Previous Outputs
- This output demonstrates the model's improved reasoning and accuracy when augmented with external data via RAG, compared to its default knowledge-based response.

In [None]:
# 02.12 Generating a Response with the Improved RAG Prompt
# Query the model with the improved RAG prompt, which includes retrieved context and structured instructions.
response = stream_markdown(
    'llama3.2:3b',  # Model name
    improved_rag_prompt  # Enhanced RAG prompt combining the query and retrieved context
)