# RAG Pipeline - Linear Execution Format

**üéì Practice Implementation**

This is a simplified, hands-on practice execution of a RAG (Retrieval-Augmented Generation) pipeline using the most basic components possible. The focus is on understanding the core workflow, not production-grade code.

## What's Implemented:

- **Vector Database:** Qdrant
- **Embedding Model:** nateraw/bge-large-en-v1.5 (via Replicate)
- **LLM:** OpenAI through Replicate
- **Dataset:** atitaarora/qdrant_doc
- **Evaluation:** RAGAS

## What's Intentionally Simplified:

‚ö†Ô∏è This notebook **purposefully omits**:
- ‚ùå Proper Python file structure & organization
- ‚ùå Complex text parsers or NLP preprocessing
- ‚ùå Advanced embedding models or fine-tuning
- ‚ùå Production-ready error handling
- ‚ùå Scalable database schemas
- ‚ùå Comprehensive logging & monitoring

**Goal:** Learn the RAG pipeline fundamentals by building it from scratch with minimal dependencies.

## Step 1: Install Dependencies


In [None]:
# Install required packages
%pip install -q qdrant-client
%pip install -q langchain
%pip install -q langchain-community
%pip install -q replicate
%pip install -q datasets
%pip install -q ragas
%pip install -q openai
%pip install -q sentence-transformers
%pip install -q numpy pandas tqdm

## Step 2: Import Libraries

In [1]:
import os
import replicate
from datasets import load_dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
from tqdm import tqdm
import json


## Step 3: Set API Key

In [2]:
with open("notebook_config.json", "r") as f:
    config = json.load(f)

In [3]:
replicate_api_token = config["api_keys"]["replicate_api_token"]
openai_api_key = config["api_keys"]["openai_api_key"]
qdrant_url = config["qdrant"]["url"]
qdrant_api_key = config["qdrant"]["api_key"]

os.environ["REPLICATE_API_TOKEN"] = replicate_api_token
print("‚úÖ Configuration loaded from config.json")

‚úÖ Configuration loaded from config.json


## Step 4: Load Dataset


In [4]:
# Load the Qdrant documentation dataset
dataset = load_dataset("atitaarora/qdrant_doc", split="train")
print(f"Loaded {len(dataset)} documents")
print(f"Dataset columns: {dataset.column_names}")

Loaded 491 documents
Dataset columns: ['text', 'source']


## Step 4.1: Replicate API test

In [5]:
# Validate Replicate API token
print("üîç Validating Replicate API Token...\n")
test_text = "This is a test sentence for embedding generation."

test_output = replicate.run(
            "beautyyuyanli/multilingual-e5-large:a06276a89f1a902d5fc225a9ca32b6e8e6292b7f3b136518878da97c458e2bad",
            input={
                "text": test_text,
                "batch_size": 32,
                "normalize_embeddings": True}
        )
print(f"‚úÖ API call successful!")
print(f"Output type: {type(test_output)}")
print(f"Output length: {len(test_output) if test_output else 'None'}")
if test_output:
    print(f"First 5 values: {test_output[1][:5]}")
else:
    print("‚ö†Ô∏è  Output is None or empty!")

üîç Validating Replicate API Token...

‚úÖ API call successful!
Output type: <class 'list'>
Output length: 3
First 5 values: [0.001451187883503735, -0.0232482198625803, -0.01903114840388298, -0.03325076773762703, 0.012469109147787094]


## Step 5: Extract Text from Dataset

In [6]:
# Extract documents from dataset
documents = []
for item in dataset:
    if 'text' in item:
        documents.append(item['text'])
    elif 'content' in item:
        documents.append(item['content'])
    else:
        # Try to get the first string field
        for key, value in item.items():
            if isinstance(value, str) and len(value) > 50:
                documents.append(value)
                break

print(f"Extracted {len(documents)} documents")
print(f"Average document length: {np.mean([len(doc) for doc in documents]):.0f} characters")


Extracted 491 documents
Average document length: 7345 characters


## Step 6: Split Documents into Chunks


In [7]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split documents
chunks = []
for doc in documents:
    splits = text_splitter.split_text(doc)
    chunks.extend(splits)


print(f"Created {len(chunks)} chunks from {len(documents)} documents")
print(f"Average chunk length: {np.mean([len(chunk) for chunk in chunks]):.0f} characters")
print(f"\nExample chunk:")
print(chunks[0][200:250])


Created 4877 chunks from 491 documents
Average chunk length: 840 characters

Example chunk:
ogo/voiceflow.svg

  -  /img/customers-logo/bosch-


## Step 7: Create Embedding Function Using Replicate


In [8]:
def get_embeddings_replicate(texts):
    """
    Get embeddings for a list of texts using Replicate's BGE model
    Returns a flat list of embedding vectors (no nesting)
    """
    embeddings = []

    for text in tqdm(texts, desc="Generating embeddings"):
        try:
            output = replicate.run(
                # "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1", - this is a better model by it's coldstart is not working as expected.
                "beautyyuyanli/multilingual-e5-large:a06276a89f1a902d5fc225a9ca32b6e8e6292b7f3b136518878da97c458e2bad",
                input={
                    "text": text,
                    "batch_size": 32,
                    "normalize_embeddings": True
                }
            )
            # The API returns a nested structure, so we need to access the right index
            if isinstance(output, list) and len(output) > 1:
                # The embedding is at index 1 based on your debug output
                embedding = output[1]
            elif isinstance(output, list) and len(output) == 1:
                # Fallback: if only one element, use it
                embedding = output[0]
            else:
                # Direct assignment if already correct format
                embedding = output

            # Ensure it's a flat list of floats
            if isinstance(embedding, list) and len(embedding) > 0:
                if isinstance(embedding[0], list):
                    # Still nested, unwrap once more
                    embedding = embedding[0]

            embeddings.append(embedding)

        except Exception as e:
            print(f"Error getting embedding: {e}")
            # Use zero vector as fallback (1024 dimensions for this model)
            embeddings.append([0.0] * 1024)

    return embeddings

print("‚úÖ Embedding function defined with proper unwrapping")

‚úÖ Embedding function defined with proper unwrapping


### üîç Debug: Check Query Embedding Format


In [10]:
# Debug: Check what format the query embedding has
test_query = "test query"
test_embedding = get_embeddings_replicate([test_query])

print(f"Type of result: {type(test_embedding)}")
print(f"Length: {len(test_embedding)}")
print(f"Type of first element: {type(test_embedding[0])}")

# Check if it's nested
if test_embedding[0] is not None:
    if isinstance(test_embedding[0], list):
        print(f"First element is a list with length: {len(test_embedding[0])}")
        print(f"First 5 values: {test_embedding[0][:5]}")

        # Check if it's double-nested
        if len(test_embedding[0]) > 0 and isinstance(test_embedding[0][0], list):
            print("‚ö†Ô∏è  ISSUE: Double-nested list detected!")
            print(f"Actual embedding is at: test_embedding[0][0]")
        else:
            print("‚úÖ Correctly formatted - single list of floats")
    else:
        print(f"First element type: {type(test_embedding[0])}")
else:
    print("‚ùå First element is None!")


Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.25it/s]

Type of result: <class 'list'>
Length: 1
Type of first element: <class 'list'>
First element is a list with length: 1024
First 5 values: [0.001451187883503735, -0.0232482198625803, -0.01903114840388298, -0.03325076773762703, 0.012469109147787094]
‚úÖ Correctly formatted - single list of floats





In [13]:
# For demo purposes, let's use a subset of chunks
# Remove this limit for production use
sample_size = min(25, len(chunks))  # Adjust based on your needs
chunks_sample = chunks[:sample_size]

print(f"Processing {len(chunks_sample)} chunks for embeddings...")
print("This may take a few minutes...")

# Generate embeddings
chunk_embeddings = get_embeddings_replicate(chunks_sample)

print(f"\nGenerated {len(chunk_embeddings)} embeddings")
print(f"Embedding dimension: {len(chunk_embeddings[0])}")



Processing 25 chunks for embeddings...
This may take a few minutes...


Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25/25 [00:13<00:00,  1.86it/s]


Generated 25 embeddings
Embedding dimension: 1024





### üîç Debug: Check Embeddings for all are returned

In [18]:
# Debug: Check what's in chunk_embeddings
print(f"Number of embeddings: {len(chunk_embeddings)}")
print(f"First embedding type: {type(chunk_embeddings[0])}")
print(f"First embedding value: {chunk_embeddings[1][:5]}")

# Count how many are None
none_count = sum(1 for emb in chunk_embeddings if emb is None)
print(f"\n‚ö†Ô∏è  Number of None embeddings: {none_count}")

# Check if any are valid
valid_count = sum(1 for emb in chunk_embeddings if emb is not None and isinstance(emb, list))
print(f"‚úÖ Number of valid embeddings: {valid_count}")


Number of embeddings: 25
First embedding type: <class 'list'>
First embedding value: [0.001451187883503735, -0.0232482198625803, -0.01903114840388298, -0.03325076773762703, 0.012469109147787094]

‚ö†Ô∏è  Number of None embeddings: 0
‚úÖ Number of valid embeddings: 25


## Step 9: Initialize Qdrant Client (Cloud Mode)


In [19]:
qdrant_client = QdrantClient(
    url=qdrant_url,
    api_key=qdrant_api_key
)
print("‚úÖ Connected to Qdrant")

‚úÖ Connected to Qdrant


In [28]:
# Collection name
collection_name = "qdrant_docs"

# Get embedding dimension
embedding_dim = len(chunk_embeddings[0])
print(f"Embedding dimension: {embedding_dim}")

# Create collection
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)
print(f"Created collection '{collection_name}' with dimension {embedding_dim}")

Embedding dimension: 1024
Created collection 'qdrant_docs' with dimension 1024


In [29]:
# Prepare points for upload
points = []
for idx, (chunk, embedding) in enumerate(zip(chunks_sample, chunk_embeddings)):
    point = PointStruct(
        id=idx,
        vector=embedding,
        payload={"text": chunk, "chunk_id": idx}
    )
    points.append(point)

# Upload to Qdrant
qdrant_client.upsert(
    collection_name=collection_name,
    points=points
)

print(f"Uploaded {len(points)} vectors to Qdrant")

# Verify collection
collection_info = qdrant_client.get_collection(collection_name)
print(f"Collection info: {collection_info.points_count} points")

Uploaded 25 vectors to Qdrant
Collection info: 25 points


## ‚úÖ Qdrant Collection Status

**Collection initialized successfully!**

- ‚úÖ Collection Name: `qdrant_docs`
- ‚úÖ Vectors uploaded: 25 embeddings (1024 dimensions each)
- ‚úÖ Verified in Qdrant Cloud UI

The vector database is now ready for semantic search queries.

## Step 12: Define Retrieval Function


In [34]:
# Fixed retrieval function with proper embedding format handling
def retrieve_documents(query, top_k=3):
    """
    Retrieve top_k most relevant documents for a query
    """
    # Get query embedding from Replicate
    query_embedding_result = get_embeddings_replicate([query])

    # Extract the actual embedding (handle nested structure)
    query_embedding = query_embedding_result[0]

    # If it's still nested (double list), unwrap it
    if isinstance(query_embedding, list) and len(query_embedding) > 0:
        if isinstance(query_embedding[0], list):
            # Double nested - take first element
            query_embedding = query_embedding[0]

    # Ensure it's a flat list of floats
    if not isinstance(query_embedding, list):
        raise ValueError(f"Expected list, got {type(query_embedding)}")

    if len(query_embedding) == 0:
        raise ValueError("Empty embedding vector")

    # Verify it contains numbers
    if isinstance(query_embedding[0], list):
        raise ValueError("Embedding is still nested! Check embedding generation.")

    # Search in Qdrant
    search_results = qdrant_client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        limit=top_k
    )

    # Extract documents
    retrieved_docs = []
    for result in search_results:
        retrieved_docs.append({
            "text": result.payload["text"],
            "score": result.score,
            "chunk_id": result.payload["chunk_id"]
        })

    return retrieved_docs

print("‚úÖ Fixed retrieval function defined!")


‚úÖ Fixed retrieval function defined!


## Step 13: Test Retrieval


In [41]:
# Test query
test_query = "How does Qdrant handle hybrid search?"

print(f"Query: {test_query}\n")
print("Retrieving documents...\n")

retrieved_docs = retrieve_documents(test_query, top_k=3)

print("Retrieved documents:\n")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"Document {i} (Score: {doc['score']:.4f})")
    print(f"Text: {doc['text']}...\n")
    # break

Query: How does Qdrant handle hybrid search?

Retrieving documents...



Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  2.21it/s]

Retrieved documents:

Document 1 (Score: 1.0000)
Text: ---

logos:

  -  /img/customers-logo/flipkart.svg

  -  /img/customers-logo/x.svg

  -  /img/customers-logo/quora.svg

sitemapExclude: true

---...

Document 2 (Score: 1.0000)
Text: image:

    src: /img/customers-case-studies/case-study.png

    alt: Preview

cases:

- id: 0

  logo:

    src: /img/customers-case-studies/visua.svg

    alt:  Visua Logo

  image:

    src: /img/customers-case-studies/case-visua.png

    alt: The hands of a person in a medical gown holding a tablet against the background of a pharmacy shop

  title: VISUA improves quality control process for computer vision with anomaly detection by 10x.

  link:

    text: Read Story

    url: /blog/case-study-visua/

- id: 1

  logo:

    src: /img/customers-case-studies/dust.svg

    alt: Dust Logo

  image:

    src: /img/customers-case-studies/case-dust.png

    alt: A man in a jeans shirt is holding a smartphone, only his hands are visible. In the foreground,


  search_results = qdrant_client.search(


## Step 14: Define LLM Function (OpenAI via Replicate)


In [42]:
# LLM generation function using OpenAI (Note: OpenAI isn't directly available on Replicate, using meta/llama instead)
def generate_response(query, context_docs):
    """
    Generate response using LLM via Replicate
    """
    # Build context from retrieved documents
    context = "\n\n".join([doc["text"] for doc in context_docs])

    # Create prompt
    prompt = f"""You are a helpful assistant. Use the following context to answer the question.

Context:
{context}

Question: {query}

Answer: Provide a comprehensive answer based on the context above. If the context doesn't contain enough information, say so."""

    # Generate response using Replicate (using Llama as OpenAI not available on Replicate directly)
    try:
        output = replicate.run(
            "openai/gpt-5-nano",
            input={
                "prompt": prompt,
                "max_tokens": 500,
                "temperature": 0.7
            }
        )

        # Concatenate output if it's a generator
        if hasattr(output, '__iter__') and not isinstance(output, str):
            response = "".join(output)
        else:
            response = output

        return response
    except Exception as e:
        print(f"Error generating response: {e}")
        return "Error: Could not generate response"

print("LLM function defined")


LLM function defined


## Step 15: Complete RAG Pipeline Execution


In [None]:
# Complete RAG pipeline
query = "What is Qdrant and how does it handle vector search?"

print(f"Query: {query}\n")
print("=" * 80)

# Step 1: Retrieve
print("\nStep 1: Retrieving relevant documents...")
retrieved_docs = retrieve_documents(query, top_k=3)
print(f"Retrieved {len(retrieved_docs)} documents")

# Step 2: Generate
print("\nStep 2: Generating response...")
response = generate_response(query, retrieved_docs)

# Display results
print("\n" + "=" * 80)
print("RESPONSE:")
print("=" * 80)
print(response)
print("\n" + "=" * 80)



Query: What is Qdrant and how does it handle vector search?


Step 1: Retrieving relevant documents...


Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  3.04it/s]
  search_results = qdrant_client.search(


Retrieved 3 documents

Step 2: Generating response...

RESPONSE:
Based on the provided context, there isn‚Äôt a direct, explicit definition of what Qdrant is or how it handles vector search. However, there are surrounding hints from user testimonials and case studies that imply its role and capabilities:

- Several customers reference Qdrant in the context of fast document retrieval and similarity search:
  - Leonard P√ºttmann (data scientist) says: ‚ÄúAmidst the hype around vector databases, Qdrant is by far my favorite one. It‚Äôs super fast (written in Rust) and open-source! At Kern AI we use Qdrant for fast document retrieval and to do quick similarity search for text data.‚Äù
  - Stanislas Polu (Dust) states: ‚ÄúQdrant's the best. By. Far.‚Äù
  - A Dust case study mentions: ‚ÄúDust uses Qdrant for RAG, achieving millisecond retrieval, reducing costs by 50%, and boosting scalability.‚Äù

From these, we can infer the following about Qdrant (as a product category and its typical use 

In [48]:
# Display sources
print("\nSOURCES:")
print("=" * 80)
for i, doc in enumerate(retrieved_docs, 1):
    print(f"\nSource {i} (Relevance: {doc['score']:.4f}):")
    print(doc['text'][:300] + "...")



SOURCES:

Source 1 (Relevance: 1.0000):
---

logos:

  -  /img/customers-logo/flipkart.svg

  -  /img/customers-logo/x.svg

  -  /img/customers-logo/quora.svg

sitemapExclude: true

---...

Source 2 (Relevance: 1.0000):
image:

    src: /img/customers-case-studies/case-study.png

    alt: Preview

cases:

- id: 0

  logo:

    src: /img/customers-case-studies/visua.svg

    alt:  Visua Logo

  image:

    src: /img/customers-case-studies/case-visua.png

    alt: The hands of a person in a medical gown holding a tab...

Source 3 (Relevance: 1.0000):
- id: 9

  name: Leonard P√ºttmann

  position: data scientist

  avatar:

    src: /img/customers/leonard-puttmann.svg

    alt: Avatar

  text: Amidst the hype around vector databases, Qdrant is by far my favorite one. It's super fast (written in Rust) and open-source! At Kern AI we use Qdrant for ...


### Next steps to do

1. Try loading the entire dataset. 
2. Check the chunks in the other notebook from the workshops.
3. Run it against the eval section to see if it's still failing 
4. Investigate why the matches are always 1