# RAG + Embeddings Endpoint

We'll learn how to use our [embeddings endpoint](https://docs.kluster.ai/api-reference/reference/#create-embeddings) in a Retrieval Augmented Generation (RAG) pipeline with PDF document support using [LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/)


**Models:**
1. **Embeddings**: Leveraging the [BAAI/bge-m3](https://platform.kluster.ai/models) model.
2. **Language Model (LLM) for Querying**: Utilizing the [meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8](https://platform.kluster.ai/playground?model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8) model.


### Set up and Pre-requisites


In [None]:
# Install the necessary packages, including the PDF reader for LlamaIndex
%pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai-like llama-index-readers-file requests

### API keys and Model endpoints

We'll use getpass to securely input your kluster.ai API key without displaying it.

You can get your API Key from your [kluster.ai Account ](https://platform.kluster.ai/apikeys)

In [None]:
import os
import logging
import sys
import requests
import json
from getpass import getpass
from pprint import pprint

# Get API key securely using getpass
KLUSTER_API_KEY = getpass("Enter your Kluster.ai API Key: ")
KLUSTER_BASE_URL = "https://api.kluster.ai/v1" # Kluster.ai base URL

os.environ["OPENAI_API_KEY"] = KLUSTER_API_KEY # LlamaIndex uses this env var for OpenAI-compatible APIs
os.environ["OPENAI_API_BASE"] = KLUSTER_BASE_URL

## Embedding Demonstration

Let's first demonstrate how to **generate embeddings** directly using the kluster.ai dedicated endpoint.

This helps illustrate what embeddings look like and how they're used in RAG systems.

In [None]:
from openai import OpenAI

# Configure kluster.ai client
client = OpenAI(
    base_url=KLUSTER_BASE_URL,
    api_key=KLUSTER_API_KEY
)

# Generate embedding for our example text about Paris
sample_text = "The capital of France is Paris. It is known for the Eiffel Tower."

response = client.embeddings.create(
    model="BAAI/bge-m3",
    input=sample_text,
    encoding_format="float"
)

# Print the first 10 dimensions of the embedding vector
print(f"Sample text: '{sample_text}'")
print(f"Model used: {response.model}")
print(f"Embedding dimensions: {len(response.data[0].embedding)}")
print("\nFirst 10 dimensions of the embedding vector:")
print(response.data[0].embedding[:10])

# Show token usage information
print(f"\nToken usage: {response.usage.prompt_tokens} tokens")

Sample text: 'The capital of France is Paris. It is known for the Eiffel Tower.'
Model used: BAAI/bge-m3
Embedding dimensions: 1024

First 10 dimensions of the embedding vector:
[0.01739501953125, 0.048370361328125, -0.006679534912109375, 0.0302734375, -0.01477813720703125, -0.00627899169921875, -0.0053253173828125, 0.004657745361328125, -0.019866943359375, 0.03375244140625]

Token usage: 17 tokens


## Adding a PDF Document

In order to make sure our RAG is working, we need to ensure we have a document that can be used as a **knowledge base**. 

We'll store our PDF in the `sample_pdfs` directory in the same folder as this notebook.

In [4]:
import urllib.request
import os

# Create a directory for our PDFs if it doesn't exist
pdf_dir = "sample_pdfs"
os.makedirs(pdf_dir, exist_ok=True)

# Download a sample PDF about Polar Bears (you can replace with your own PDFs)
sample_pdf_url = "https://portals.iucn.org/library/sites/library/files/documents/SSC-OP-007.pdf"  
pdf_path = os.path.join(pdf_dir, "polar_bears.pdf")

if not os.path.exists(pdf_path):
    print(f"Downloading sample PDF to {pdf_path}...")
    urllib.request.urlretrieve(sample_pdf_url, pdf_path)
    print("Download complete!")
else:
    print(f"Sample PDF already exists at {pdf_path}")

Sample PDF already exists at sample_pdfs/polar_bears.pdf


### Load document

In [16]:
# Import the necessary document loader from llama_index
from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader

# Load documents from the PDF file
print(f"Loading PDF from {pdf_dir}...")
pdf_reader = SimpleDirectoryReader(input_dir=pdf_dir)
documents = pdf_reader.load_data()

print(f"Loaded {len(documents)} document(s) from PDF file")


Loading PDF from sample_pdfs...
Loaded 115 document(s) from PDF file


## Configure LlamaIndex Components

To set up [LlamaIndex](https://docs.llamaindex.ai/en/stable/) with **kluster.ai** we neesd to setup `OpenAILike` for the LLM and `OpenAILikeEmbedding` for the embedding model.

In [None]:
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
from llama_index.core import Settings

# Configure the LLM from kluster.ai with LlamaIndex
llm = OpenAILike(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    api_base=KLUSTER_BASE_URL,
    api_key=KLUSTER_API_KEY,
    is_chat_model=True
)

# Configure the embedding model from kluster.ai
embed_model = OpenAILikeEmbedding(
    model_name="BAAI/bge-m3",
    api_base=KLUSTER_BASE_URL, 
    api_key=KLUSTER_API_KEY
)

# Set the global settings for LlamaIndex
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512 # Set chunk size for document splitting
Settings.chunk_overlap = 20 # Set chunk overlap for document splitting

print("LlamaIndex LLM and Kluster AI Embedding Model configured.")

## Create an Index

Now we'll create a `VectorStoreIndex` from our PDF document.

- **Index**: A searchable structure built from your documents for fast similarity search.
- **Vector Store**: Stores the embeddings (vectors) for each document chunk, enabling rapid retrieval.

**Why this matters**: Creating a `VectorStoreIndex` allows our RAG pipeline to quickly find and use the most relevant content from the PDF, grounding LLM responses in real document data.


In [7]:
from llama_index.core import VectorStoreIndex

# Create the index from the PDF document
print("Creating index from PDF document...")
index = VectorStoreIndex.from_documents(
    documents
)
print("Index created successfully!")

Creating index from PDF document...
Index created successfully!


### Query the Index and Compare with Non-RAG Responses

Now we'll compare responses using RAG (with our knowledge base) versus direct LLM responses without context.
First, let's start by creating the `query engine`

In [8]:
# Create a query engine for RAG
query_engine = index.as_query_engine()

# Function to get a direct response from the LLM without using RAG
def get_direct_llm_response(query):
    """Get a response directly from the LLM without using RAG"""
    return llm.complete(query).text

print("Query engines prepared. Ready to compare RAG vs non-RAG responses!")

Query engines prepared. Ready to compare RAG vs non-RAG responses!


### Test your RAG

Now we'll **compare two queries**: one using our RAG knowledge base and one using only the LLM.
We'll ask specific questions that require information from the PDF, which the LLM alone is unlikely to answer accurately.

In [None]:
# Query about content from Polar Paper PDF
pdf_query = "Fact check this: <quote> The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.</quote> If you don't know, say 'I don't know'."

print(f"Query: {pdf_query}\n")

print("--- RAG Response (using our knowledge base) ---")
rag_response = query_engine.query(pdf_query)
print(f"{rag_response}")

print("--- Direct LLM Response (without RAG) ---")
direct_response = get_direct_llm_response(pdf_query)
print(direct_response)

Query: Fact check this: <quote> The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.</quote> If you dont know, say 'I don't know'.

--- RAG Response (using our knowledge base) ---
The statement is true. The given context information contains the exact quote on page_label: 10, confirming that the NWT indeed suggested caution regarding the proposal to transport polar bear hides to the U.S. on CITES permits and recommended considering whalebone carvings and seal-skin products first.
--- Direct LLM Response (without RAG) ---
To fact-check the given quote, we need to verify its content and context. The quote appears to refer to a discussion or meeting involving the Northwest Territories (NWT) government or representatives, concerning the potential export of polar bear hid

We continue to test queries against the knowledge base to evaluate how well the RAG system retrieves and grounds answers using the PDF document. 
This helps demonstrate the effectiveness of retrieval-augmented generation compared to direct LLM responses.

In [13]:
# Query about a specific technical detail in the paper
technical_query = "What does the Toxicology and Monitoring of Pollutant Levels in Polar Bear Tissue says about the CHC levels? IMPORTANT: If you don't know, say 'I don't know'."

print(f"Query: {technical_query}\n")

print("--- RAG Response (using our knowledge base) ---")
rag_response = query_engine.query(technical_query)
print(f"{rag_response}\n")

print("--- Direct LLM Response (without RAG) ---")
direct_response = get_direct_llm_response(technical_query)
print(direct_response)

Query: What does the Toxicology and Monitoring of Pollutant Levels in Polar Bear Tissue says about the CHC levels? IMPORTANT: If you don't know, say 'I don't know'.

--- RAG Response (using our knowledge base) ---
The levels of CHCs were generally inversely correlated to latitude, and reanalysis of polar bear fat samples showed that the level of most CHCs, especially chlordane compounds, had increased from 1969 to 1984 in Hudson Bay and Baffin Bay bears.

--- Direct LLM Response (without RAG) ---
I don't know the specific details about what the Toxicology and Monitoring of Pollutant Levels in Polar Bear Tissue says about the CHC levels. If you're looking for accurate information on this topic, I recommend consulting the original research or a reliable scientific summary.


In [14]:
# Query about authors and publication details
authors_query = "Who are the authors of the Polar Bear Paper?. IMPORTANT: If you don't know, say 'I don't know'."

print(f"Query: {authors_query}\n")

print("--- RAG Response (using our knowledge base) ---")
rag_response = query_engine.query(authors_query)
print(f"{rag_response}\n")

print("--- Direct LLM Response (without RAG) ---")
direct_response = get_direct_llm_response(authors_query)
print(direct_response)

Query: Who are the authors of the Polar Bear Paper?. IMPORTANT: If you don't know, say 'I don't know'.

--- RAG Response (using our knowledge base) ---
Steven C. Amstrup and Oystein Wiig are the compilers and editors of the Polar Bear publication, as mentioned on page 3. However, the authors of specific papers or research mentioned in the document include Stirling, Schweinsburg, Kolenosky, Juniper, Robertson, Luttich, Calvelt, Sjare, Taylor, Bunnell, DeMaster, and Smith. Without more specific information about the "Polar Bear Paper", it's difficult to provide a definitive answer. Therefore, a more accurate response would be that the compilers and editors are Steven C. Amstrup and Oystein Wiig, but there are multiple authors for the various research papers cited.

--- Direct LLM Response (without RAG) ---
I don't know.


## Conclusion

This notebook demonstrated a RAG system using LlamaIndex and KlusterAI that incorporates a PDF document as a knowledge source. We've seen:

1. **How embeddings work**: We generated and visualized embeddings using the BAAI/bge-m3 model.
2. **PDF integration**: We loaded and processed a research paper (the GPT-3 paper) for our knowledge base.
3. **RAG vs. Direct LLM**: We compared responses from our RAG system to direct LLM outputs.

**Key observations:**
- RAG responses include specific information from the PDF that may not be in the LLM's training data
- For queries about details in the paper, RAG provides more precise and accurate answers
- RAG helps ground the model's responses in the actual content of the document rather than relying on the model's pre-trained knowledge

**Next steps:**
- Try with your own PDFs or other document types
- Experiment with different chunking strategies to optimize retrieval