# RAG Pipeline with LlamaIndex and Qdrant

This notebook demonstrates a complete Retrieval-Augmented Generation (RAG) pipeline using:
- **LlamaIndex**: For indexing and retrieval orchestration
- **Qdrant**: As the vector database backend
- **Reranking**: To improve retrieval quality

The pipeline combines document retrieval with LLM-based generation for improved accuracy.

## 1. Import Required Libraries

In [1]:
def install_packages():
    packages = [
        "llama-index>=0.9.0",
        "llama-index-vector-stores-qdrant>=0.1.0",
        "llama-index-embeddings-huggingface>=0.1.0",
        "qdrant-client>=2.7.0",
        "sentence-transformers>=2.2.0",  # For SentenceTransformer reranking
        "python-dotenv>=1.0.0",
    ]
    
    for package in packages:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

# Uncomment to install packages
# install_packages()

In [2]:
# Import core libraries
import os
from typing import List, Optional
from dotenv import load_dotenv

# LlamaIndex imports
from llama_index.core import Document, VectorStoreIndex, Settings, SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SentenceTransformerRerank

# Qdrant imports
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

# Additional imports
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

print("✓ Libraries imported successfully")

  from .autonotebook import tqdm as notebook_tqdm


✓ Libraries imported successfully


## 2. Initialize Qdrant Vector Database

In [3]:
# Initialize Qdrant Vector Database
VECTOR_COLLECTION_NAME = "documents"
VECTOR_DIMENSION = 384  # Dimension for HuggingFace embeddings

# Create an in-memory Qdrant instance (use ":memory:" for testing)
# For production, use: QdrantClient(url="http://localhost:6333")
qdrant_client = QdrantClient(":memory:")

# Recreate collection if it exists
try:
    qdrant_client.delete_collection(VECTOR_COLLECTION_NAME)
    logger.info(f"Deleted existing collection: {VECTOR_COLLECTION_NAME}")
except:
    pass

# Create new collection
qdrant_client.create_collection(
    collection_name=VECTOR_COLLECTION_NAME,
    vectors_config=VectorParams(
        size=VECTOR_DIMENSION,
        distance=Distance.COSINE
    ),
)

logger.info(f"✓ Created Qdrant collection: {VECTOR_COLLECTION_NAME}")
logger.info(f"  Vector dimension: {VECTOR_DIMENSION}")
logger.info(f"  Distance metric: COSINE")

INFO:__main__:Deleted existing collection: documents
INFO:__main__:✓ Created Qdrant collection: documents
INFO:__main__:  Vector dimension: 384
INFO:__main__:  Distance metric: COSINE


## 3. Set Up Embeddings and Configure Settings

In [4]:
# Initialize HuggingFace embeddings
# Using sentence-transformers model for semantic embeddings
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    cache_folder="./models"
)

logger.info("✓ HuggingFace embedding model loaded")
logger.info(f"  Model: BAAI/bge-small-en-v1.5")

# Configure global settings for LlamaIndex
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50

logger.info("✓ LlamaIndex settings configured")

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
INFO:sentence_transformers.SentenceTransformer:1 prompt is loaded, with the key: query
INFO:__main__:✓ HuggingFace embedding model loaded
INFO:__main__:  Model: BAAI/bge-small-en-v1.5
INFO:__main__:✓ LlamaIndex settings configured


## 4. Load and Index Documents

In [5]:
# Load documents from the docs folder using LlamaIndex
docs_path = "../docs"

# Check if docs folder exists
if not os.path.exists(docs_path):
    logger.error(f"Docs folder not found at {docs_path}")
    logger.info("Please create a docs folder with markdown or text files")
    documents = []
else:
    # Use SimpleDirectoryReader to load all documents from the folder
    reader = SimpleDirectoryReader(input_dir=docs_path, recursive=True)
    documents = reader.load_data()
    logger.info(f"✓ Loaded {len(documents)} documents from {docs_path}")

if documents:
    print(f"\n✓ Successfully loaded {len(documents)} document(s)")
    print(f"  Docs folder: {os.path.abspath(docs_path)}\n")
    for i, doc in enumerate(documents, 1):
        title = doc.metadata.get('file_name', 'Unknown')
        content_preview = doc.text[:80].replace('\n', ' ') + "..." if len(doc.text) > 80 else doc.text.replace('\n', ' ')
        print(f"  {i}. {title}")
        print(f"     Preview: {content_preview}\n")
else:
    print("⚠ No documents loaded. Add markdown or text files to the docs folder.")
    print("  Example: ../docs/*.md")

INFO:__main__:✓ Loaded 32 documents from ../docs



✓ Successfully loaded 32 document(s)
  Docs folder: /Users/poornimata/Downloads/VectorDB/docs

  1. dspy.pdf
     Preview: Preprint DSP Y: C OMPILING DECLARATIVE LANGUAGE MODEL CALLS INTO SELF -I MPROVIN...

  2. dspy.pdf
     Preview: Preprint calls in existing LM pipelines and in popular developer frameworks are ...

  3. dspy.pdf
     Preview: Preprint 2 R ELATED WORK This work is inspired by the role that Torch (Collobert...

  4. dspy.pdf
     Preview: Preprint 3.1 N ATURAL LANGUAGE SIGNATURES CAN ABSTRACT PROMPTING & FINETUNING In...

  5. dspy.pdf
     Preview: Preprint ing Predict to ChainOfThought in the above program leads to a system th...

  6. dspy.pdf
     Preview: Preprint In DSPy, training sets may be small, potentially a handful of examples,...

  7. dspy.pdf
     Preview: Preprint ation of DSPy, we focus on demonstrations and find that simple rejectio...

  8. dspy.pdf
     Preview: Preprint Table 1: Results with in-context learning on GSM8K math word problems. ...



In [13]:
len(documents)

32

## 5. Create Vector Store Index with LlamaIndex

In [None]:
# Create a QdrantVectorStore instance
vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=VECTOR_COLLECTION_NAME
)

logger.info("✓ QdrantVectorStore created")

# Create VectorStoreIndex from the vector store and documents
# This will automatically generate embeddings and store them in Qdrant
vector_index = VectorStoreIndex.from_documents(
    documents=documents,
    vector_store=vector_store,
    embed_model=embed_model,
)

logger.info("✓ VectorStoreIndex created and documents indexed")
logger.info(f"  Total documents indexed: {len(documents)}")

# Display index information
print(f"\nVector Index Information:")
print(f"  Collection: {VECTOR_COLLECTION_NAME}")
print(f"  Indexed Documents: {len(documents)}")
print(f"  Vector Dimension: {VECTOR_DIMENSION}")

INFO:__main__:✓ QdrantVectorStore created
INFO:__main__:✓ VectorStoreIndex created and documents indexed
INFO:__main__:  Total documents indexed: 32



Vector Index Information:
  Collection: documents
  Indexed Documents: 32
  Vector Dimension: 384


## 6. Configure Reranking

In [7]:
# Initialize SentenceTransformer Reranker
# The reranker will improve retrieval quality by reordering results based on semantic relevance
# No API key required - runs locally using sentence transformers

try:
    reranker = SentenceTransformerRerank(
        model="BAAI/bge-reranker-base",
        top_n=3,  # Keep top 3 results after reranking
    )
    reranker_available = True
    logger.info("✓ SentenceTransformer Reranker initialized")
    logger.info("  Model: BAAI/bge-reranker-base")
    logger.info("  Top N: 3")
    logger.info("  Runs locally - no API key required")
except Exception as e:
    logger.error(f"Could not initialize SentenceTransformer Reranker: {e}")
    logger.info("✓ RAG pipeline will work without reranking")
    reranker = None
    reranker_available = False

print("\n" + "="*50)
print("Reranking Configuration")
print("="*50)
if reranker_available:
    print("Status: ✓ SentenceTransformer Reranking ENABLED")
    print("Model: BAAI/bge-reranker-base")
    print("Note: Local reranking - no API key required")
else:
    print("Status: ⚠ SentenceTransformer Reranking DISABLED")
    print("Note: Install: pip install sentence-transformers")
print("="*50)

KeyboardInterrupt: 

## 7. Create RAG Query Engine

In [8]:
# Create a retriever from the vector index
retriever = VectorIndexRetriever(
    index=vector_index,
    similarity_top_k=5,  # Retrieve top 5 similar documents
)

logger.info("✓ Vector Index Retriever created")
logger.info("  Similarity Top K: 5")

# Create the query engine with optional reranking
""" if reranker:
    query_engine = RetrieverQueryEngine(
        retriever=retriever,
        node_postprocessors=[reranker],  # Add reranker to improve results
    )
    logger.info("✓ RAG Query Engine created WITH reranking")
else: """
query_engine = RetrieverQueryEngine(
    retriever=retriever,
)
logger.info("✓ RAG Query Engine created WITHOUT reranking")

print("\n" + "="*50)
print("RAG Pipeline Configuration")
print("="*50)
print(f"Retriever: VectorIndexRetriever")
print(f"Vector Store: Qdrant (Collection: {VECTOR_COLLECTION_NAME})")
print(f"Similarity Top K: 5")
#print(f"Reranker: {'Cohere' if reranker else 'None'}")
print(f"Embedding Model: BAAI/bge-small-en-v1.5")
print("="*50)

INFO:__main__:✓ Vector Index Retriever created
INFO:__main__:  Similarity Top K: 5
INFO:__main__:✓ RAG Query Engine created WITHOUT reranking



RAG Pipeline Configuration
Retriever: VectorIndexRetriever
Vector Store: Qdrant (Collection: documents)
Similarity Top K: 5
Embedding Model: BAAI/bge-small-en-v1.5


## 8. Test RAG Pipeline with Retrieval

In [9]:
# Define sample queries to test the RAG pipeline
test_queries = [
    "What is machine learning?",
    "How does deep learning work?",
    "Tell me about vector databases",
    "What is RAG and why is it useful?",
]

# Test retrieval without generation
print("\n" + "="*70)
print("TESTING RAG PIPELINE - RETRIEVAL RESULTS")
print("="*70)

for i, query in enumerate(test_queries, 1):
    print(f"\n{'─'*70}")
    print(f"Query {i}: {query}")
    print(f"{'─'*70}")
    
    # Use retriever to get nodes
    retrieved_nodes = retriever.retrieve(query)
    
    print(f"Retrieved {len(retrieved_nodes)} documents:\n")
    for j, node in enumerate(retrieved_nodes, 1):
        score = node.score if hasattr(node, 'score') else "N/A"
        source = node.metadata.get('source', 'Unknown') if hasattr(node, 'metadata') else 'Unknown'
        text = node.text[:100] + "..." if len(node.text) > 100 else node.text
        
        print(f"  [{j}] Source: {source} | Score: {score}")
        print(f"      Text: {text}\n")


TESTING RAG PIPELINE - RETRIEVAL RESULTS

──────────────────────────────────────────────────────────────────────
Query 1: What is machine learning?
──────────────────────────────────────────────────────────────────────
Retrieved 5 documents:

  [1] Source: Unknown | Score: 0.6872841694952635
      Text: This is inspired by formative work by Bergstra et al.
(2010; 2013), Paszke et al. (2019), and Wolf e...

  [2] Source: Unknown | Score: 0.6810863004225625
      Text: In-context learning methods now routinely invoke tools, leading to LM pipelines that use retrieval
m...

  [3] Source: Unknown | Score: 0.6770315009422205
      Text: Preprint
calls in existing LM pipelines and in popular developer frameworks are generally implemente...

  [4] Source: Unknown | Score: 0.6687011254713254
      Text: Preprint
DSP Y: C OMPILING DECLARATIVE LANGUAGE
MODEL CALLS INTO SELF -I MPROVING PIPELINES
Omar Kha...

  [5] Source: Unknown | Score: 0.664805908749748
      Text: to logically connect the mo

## 9. Advanced: Creating a Full RAG Pipeline with LLM Generation

For complete RAG generation with an LLM, you would integrate with a language model provider:

In [10]:
# Configure Ollama LLM for RAG pipeline
from llama_index.llms.ollama import Ollama

# Initialize Ollama with llama3.2 1b model
llm = Ollama(
    base_url="http://localhost:11434",
    model="llama3.2:1b",
    temperature=0.7,
    context_window=2048,
    request_timeout=60.0,
)

logger.info("✓ Ollama LLM initialized")
logger.info("  Model: llama3.2:1b")
logger.info("  Base URL: http://localhost:11434")
logger.info("  Temperature: 0.7")
logger.info("  Context Window: 2048")

# Set the LLM in global settings
Settings.llm = llm

# Create the full RAG query engine with Ollama
rag_query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[],
)

print("\n" + "="*70)
print("FULL RAG PIPELINE READY WITH OLLAMA LLAMA3.2 1B")
print("="*70)
print("Query Engine Configuration:")
print(f"  LLM: Ollama (llama3.2:1b)")
print(f"  Retriever: VectorIndexRetriever (top 5)")
#print(f"  Reranker: {'Cohere' if reranker else 'None'}")
print(f"  Vector Store: Qdrant")
print("="*70)

INFO:__main__:✓ Ollama LLM initialized
INFO:__main__:  Model: llama3.2:1b
INFO:__main__:  Base URL: http://localhost:11434
INFO:__main__:  Temperature: 0.7
INFO:__main__:  Context Window: 2048



FULL RAG PIPELINE READY WITH OLLAMA LLAMA3.2 1B
Query Engine Configuration:
  LLM: Ollama (llama3.2:1b)
  Retriever: VectorIndexRetriever (top 5)
  Vector Store: Qdrant


## 11. Test Full RAG Pipeline with Ollama LLM

In [11]:
# Test full RAG pipeline with Ollama generation
print("\n" + "="*70)
print("TESTING FULL RAG PIPELINE WITH OLLAMA GENERATION")
print("="*70)

# Test queries
rag_test_queries = [
    "What is machine learning?",
    "How does vector database work?",
    "Explain RAG in simple terms",
]

for i, query in enumerate(rag_test_queries, 1):
    print(f"\n{'─'*70}")
    print(f"Query {i}: {query}")
    print(f"{'─'*70}")
    
    try:
        # Execute query through full RAG pipeline
        response = rag_query_engine.query(query)
        
        print(f"\nGenerated Response:")
        print(f"{response}\n")
        
        # Display source documents
        print(f"Retrieved Documents Used:")
        if hasattr(response, 'source_nodes') and response.source_nodes:
            for j, node in enumerate(response.source_nodes, 1):
                source = node.metadata.get('source', 'Unknown') if hasattr(node, 'metadata') else 'Unknown'
                score = node.score if hasattr(node, 'score') else "N/A"
                print(f"  [{j}] {source} (Score: {score})")
        
    except Exception as e:
        logger.error(f"Error processing query: {e}")
        print(f"✗ Error: {e}")
        print("\nNote: Ensure Ollama is running with: ollama serve")
        print("And llama3.2:1b model is available: ollama run llama3.2:1b")

print(f"\n{'='*70}")
print("RAG pipeline testing complete!")
print("="*70)


TESTING FULL RAG PIPELINE WITH OLLAMA GENERATION

──────────────────────────────────────────────────────────────────────
Query 1: What is machine learning?
──────────────────────────────────────────────────────────────────────


INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"



Generated Response:
Machine learning is a broad field that encompasses various techniques for developing intelligent systems capable of learning and adapting to new information. At its core, machine learning involves algorithms and statistical models that enable machines to make predictions or take actions based on data and patterns learned from experience. This includes the use of language models, which are computer programs designed to understand, interpret, and generate human language.

Retrieved Documents Used:
  [1] Unknown (Score: 0.6872841694952635)
  [2] Unknown (Score: 0.6810863004225625)
  [3] Unknown (Score: 0.6770315009422205)
  [4] Unknown (Score: 0.6687011254713254)
  [5] Unknown (Score: 0.664805908749748)

──────────────────────────────────────────────────────────────────────
Query 2: How does vector database work?
──────────────────────────────────────────────────────────────────────


INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"



Generated Response:
The general operation of a vector database typically involves storing and manipulating vectors, which are mathematical representations of data points in a high-dimensional space. Vectors can be thought of as numerical summaries or descriptors of the underlying data, allowing for efficient querying and retrieval of specific vectors.

Vector databases often employ techniques such as dimensionality reduction, indexing, and similarity searching to facilitate fast and accurate operations on vector data. This enables applications like image recognition, natural language processing, and recommendation systems to leverage vector representations of words, images, or other types of data in a scalable and efficient manner.

Retrieved Documents Used:
  [1] Unknown (Score: 0.7472497618909842)
  [2] Unknown (Score: 0.7251891859935607)
  [3] Unknown (Score: 0.7166444358715993)
  [4] Unknown (Score: 0.7067593485576941)
  [5] Unknown (Score: 0.7022880451162229)

───────────────────

INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"



Generated Response:
RAG (Retrieval-Augmented Generation) is a technique used to enhance computer systems' performance, particularly in question answering or text generation tasks. It combines input from external knowledge sources with the output of another system to create a more comprehensive understanding of the topic at hand.

External knowledge sources are utilized to retrieve relevant information, which is then analyzed and integrated into the generated response by the model like DSPy program. This process improves accuracy and relevance of responses, allowing RAG systems to produce more precise answers in various domains.

Retrieved Documents Used:
  [1] Unknown (Score: 0.653691818245061)
  [2] Unknown (Score: 0.6235403138668445)
  [3] Unknown (Score: 0.6229857847746048)
  [4] Unknown (Score: 0.6225677549925328)
  [5] Unknown (Score: 0.6191630385709631)

RAG pipeline testing complete!


## 10. Summary and Key Components

In [None]:
print("""
╔════════════════════════════════════════════════════════════════════════════╗
║                    RAG PIPELINE ARCHITECTURE SUMMARY                       ║
╚════════════════════════════════════════════════════════════════════════════╝

1. DOCUMENT INGESTION
   └─ Load documents from various sources
   └─ Split into chunks (512 tokens with 50 token overlap)
   └─ Generate embeddings (BAAI/bge-small-en-v1.5)

2. VECTOR STORAGE (QDRANT)
   └─ Store embeddings in Qdrant vector database
   └─ Collection: documents
   └─ Vector dimension: 384
   └─ Distance metric: Cosine Similarity

3. RETRIEVAL
   └─ Query engine retrieves top 5 most similar documents
   └─ Uses vector similarity search for fast retrieval

4. RERANKING (OPTIONAL)
   └─ Cohere Rerank models improve relevance ordering
   └─ Reduces results to top 3 after reranking
   └─ Requires COHERE_API_KEY for API access

5. GENERATION (OPTIONAL)
   └─ Feed retrieved context to LLM
   └─ Supports: OpenAI, Ollama, HuggingFace, etc.
   └─ LLM generates answer grounded in retrieved documents

KEY FEATURES:
✓ No LangChain dependency
✓ Pure LlamaIndex implementation
✓ Qdrant vector database integration
✓ Cohere reranking for improved retrieval
✓ Modular and extensible architecture
✓ Supports custom embeddings and LLMs

NEXT STEPS:
1. Load your own documents
2. Configure embedding model if needed
3. Set COHERE_API_KEY for reranking (optional)
4. Integrate an LLM provider for generation
5. Customize retrieval parameters for your use case
╚════════════════════════════════════════════════════════════════════════════╝
""")