
### 🧠 What is Qdrant?

Qdrant (pronounced “quadrant”) is a **vector database** — a special kind of database designed to **store and search text by meaning** instead of exact words.

---

### 🍕 Real-World Analogy: Qdrant is Like a Smart Pizza Finder

#### Imagine this:

You own a massive **pizza recipe book** with **10,000+ pages**. Each page has one pizza recipe.

Now you want to find:

> “A pizza that uses spicy meat, not too cheesy, and has thin crust.”

### 🔍 What a regular database would do:

It looks for **exact words** like “spicy”, “cheesy”, “crust” and gives you any recipe that contains those words — even if it’s irrelevant.

### 🧠 What Qdrant does:

* It **converts each recipe** into a **mathematical vector** that captures the **meaning** of the recipe (spicy, texture, toppings, etc.).
* It also converts your search query into a vector.
* Then it **compares meanings**, not just words, and shows you the **closest match**.

👉 You say “spicy meat, thin crust”, and it gives you:
🧀 “Calabrese Pepperoni Pizza” — even if the words don’t match exactly.

---

### 💼 In Your Case:

* **PDF policies** are like your recipe book.
* **Each policy rule** is like a page or paragraph.
* **Qdrant helps your AI system quickly find the most relevant policy rule** when someone types:

  > “Can I email sensitive info over Gmail?”

Even if “Gmail” or “email” isn’t mentioned exactly, Qdrant will find similar concepts like:

> “Public email services must not be used for confidential data.”

---

### ✅ Why Use Qdrant (vs traditional DBs)?

| Feature           | Qdrant (Vector DB)                 | SQL/NoSQL (Traditional)        |
| ----------------- | ---------------------------------- | ------------------------------ |
| Search by meaning | ✅ Yes (semantic search)            | ❌ Only keyword match           |
| Scalable for AI   | ✅ Yes, handles 100K+ vectors fast  | ❌ Not optimized for embeddings |
| Built for LLMs    | ✅ Integrates with embedding models | ❌ Not designed for this        |




## 🧠 TL;DR:

| Feature        | **Qdrant**                          | **ChromaDB**                         |
| -------------- | ----------------------------------- | ------------------------------------ |
| Best for       | Scalable, production AI search      | Fast local prototyping with LLMs     |
| Think of it as | A smart **library server**          | A lightweight **personal notebook**  |
| Deploys on     | Local or cloud, works well at scale | Local file system (no network infra) |
| Built for      | Enterprise-grade search + scale     | Fast, easy local dev & prototyping   |

---

## 🍕 Analogy: A Pizza Restaurant's Recipe Management System

### 📚 Qdrant = **Enterprise Recipe Library System**

Imagine you're a large pizza chain with 200 locations.

* You store **all your recipes in a central database**.
* You want any chef in any location to search:

  > “Show me low-fat spicy meat pizzas”
* The system:

  * Understands **meaning**
  * Returns **relevant recipes**
  * Works fast even with **millions of recipes**
  * Can be deployed in **cloud or production server**

🔧 That’s **Qdrant**:
✅ Fast
✅ Smart (semantic search)
✅ Scalable
✅ Reliable in enterprise environments

---

### 🗒️ ChromaDB = **Chef’s Personal Recipe Notebook**

Now imagine you’re a single pizza chef experimenting at home.

* You keep **10-50 recipes** on your laptop.
* You want to quickly search:

  > “spicy crustless vegan pizza”
* You don't care about multi-location support, just quick local lookup
* You just use a **simple notebook with some tags**.

🟢 That’s **ChromaDB**:
✅ Super fast locally
✅ No setup
✅ Perfect for personal use or small experiments
❌ Not made for production or large scale

---

## 💼 When to Use What?

| Use Case                                  | Qdrant        | ChromaDB     |
| ----------------------------------------- | ------------- | ------------ |
| Local LLM prototype (1 dev)               | ❌ Overkill    | ✅ Ideal      |
| Production compliance engine (like yours) | ✅ Perfect     | ❌ Not enough |
| Works with Azure, Docker, cloud infra     | ✅ Yes         | ⚠️ Hacky     |
| Needs full-text + vector search           | ✅ Has filters | ❌ Basic only |
| Persistent, shared across services        | ✅ Yes         | ❌ Not yet    |

---

## ✅ Recommendation for You

Since you're building:

* A **compliance system**
* For **enterprise**
* That uses **20+ policies and growing**
* With **semantic lookups and real users**

➡️ **Stick with Qdrant**. It’s production-ready, fast, and scalable.



In [7]:
import fitz                # To read text from PDFs (PyMuPDF)
import os, re, json        # File handling and regular expressions
from datetime import datetime, timezone
import numpy as np        # For numerical operations
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from sentence_transformers import SentenceTransformer


In [2]:
PDF_DIR = "../data/policy_docs"
PROCESSED_JSON = "../data/processed/policy_rules.json"
QDRANT_HOST = "localhost"
QDRANT_PORT = 6333
COLLECTION_NAME = "policy_rules"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"




## 🧠 What is a Distilled Embedding Model?

> A **distilled model** is a smaller, faster, and more efficient version of a large language model (LLM), trained to mimic its performance but with fewer parameters.

💡 Think of it as:

* The **student** who learns from a larger **teacher model** (e.g., BERT, RoBERTa)
* Much **faster** to run
* Slightly less accurate, but **good enough** for real-time systems like email compliance, chatbots, etc.

---

## 🏭 In Enterprise, Why Do You Use Distilled Models?

| Concern                 | Distilled Model Solves It                                  |
| ----------------------- | ---------------------------------------------------------- |
| 🔥 Real-time latency    | Fast inference (<100ms on CPU/GPU)                         |
| 🧠 Memory usage         | Smaller size (100MB–500MB vs. multi-GB)                    |
| 🧾 Deployability        | Easier to run in Docker, AKS, edge devices, or local boxes |
| ✅ Cost control          | No GPU or expensive cloud models needed                    |
| 📦 Model auditability   | Static, deterministic, and explainable                     |
| 🚫 No cloud fine-tuning | You can fine-tune it **locally or on Ray**                 |

---

## ✅ Recommended Distilled Embedding Models (Enterprise Perspective)

Here’s how to choose, based on your use case (email + policy matching):

| Model Name                     | Type         | Size    | Dim | Pros                                           | When to Use                       |
| ------------------------------ | ------------ | ------- | --- | ---------------------------------------------- | --------------------------------- |
| `all-MiniLM-L6-v2`             | SentenceBERT | \~80MB  | 384 | Fast, high-quality sentence-level embeddings   | ✅ Default go-to model             |
| `distilroberta-base`           | Encoder      | \~300MB | 768 | Good for fine-tuning, classification           | Use if building a classifier      |
| `TinyLlama-1.1B`               | LLM          | \~500MB | N/A | Smallest LLM-style, for Q\&A or classification | For chat-based tools              |
| `intfloat/e5-small`            | Embeddings   | \~90MB  | 384 | Optimized for semantic search + RAG            | For high-recall retrieval         |
| `nomic-ai/nomic-embed-text-v1` | Embedder     | \~120MB | 768 | Multi-purpose, newer                           | For newer vector DBs like LanceDB |

---

## 🎯 For You — Recommended Model

Since you’re doing **policy violation detection + RAG Q\&A**, your use case involves both:

1. **Email/Document Embedding for Search**
2. **Policy Chunk Embedding for Retrieval**

✅ **Best Option:** `all-MiniLM-L6-v2`

* Balanced: Small, fast, enterprise-tested
* 384-dim vector (works well with Qdrant)
* Already supported by `sentence-transformers`
* Good zero-shot sentence similarity

---

## ✅ When to Consider Alternatives

| Situation                                | Consider Instead              |
| ---------------------------------------- | ----------------------------- |
| Need more accuracy, can afford 2× slower | `all-MiniLM-L12-v2`           |
| Want better performance on short phrases | `e5-small` or `mpnet-base-v2` |
| Need general LLM for chat + generation   | `TinyLlama` + LoRA tuning     |

---

## 📦 Enterprise Notes

* Always pin versions in `requirements.txt` for reproducibility
* Benchmark on a **small validation set** (your labeled JSONL)
* Track `embedding_dim`, `model_name`, and `vector_index_id` in logs/metadata
* Store model artifacts in a versioned location (Azure Blob, etc.)

---

## 🔚 Summary

| What You Need                    | Recommended Distilled Model |
| -------------------------------- | --------------------------- |
| Email + Policy semantic matching | ✅ `all-MiniLM-L6-v2`        |
| Lightweight, fast, deployable    | ✅ Yes                       |
| Good for RAG + compliance chat   | ✅ Yes                       |


In [3]:
model = SentenceTransformer(EMBEDDING_MODEL)
client = QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]



## ✅ What It's Doing

1. **`SentenceTransformer(...)`**:

   * Loads a **pretrained embedding model** (like `all-MiniLM-L6-v2`)
   * This model **converts a sentence into a fixed-size vector**
   * These vectors represent the **meaning** of the text

2. **`QdrantClient(...)`**:

   * Connects to your **local or remote Qdrant vector database**
   * Allows you to **insert, search, or manage** semantic data

---

## ⚙️ Technical Explanation

### 1️⃣ `SentenceTransformer(EMBEDDING_MODEL)`

* Loads a distilled transformer model (MiniLM) trained using **Siamese / triplet learning** on sentence pairs
* When you call:

  ```python
  model.encode("Can I send client data via Gmail?")
  ```

  It returns a **384-dimensional vector** like:

  ```python
  [0.11, -0.42, ..., 0.03]  # embedding
  ```

💡 These embeddings are:

* **Dense** (not sparse like TF-IDF)
* **Context-aware**
* **Universal** across topics (policies, emails, etc.)

---

### 2️⃣ `QdrantClient(...)`

* Connects to your Qdrant instance running on `localhost:6333`
* Lets you **push** and **query** vectors along with **metadata** (like title, file, date)
* Used later for **semantic search** or **chatbot retrieval**

---

## 🧠 Real-World Analogy

### Imagine this is a Library System

| Component                  | Real World Analogy                                                                                                                               |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| `SentenceTransformer(...)` | A **smart librarian** who reads each document and summarizes its meaning as a special “DNA code” (vector)                                        |
| `QdrantClient(...)`        | A **digital filing cabinet** that stores those DNA codes, indexed for super-fast lookup                                                          |
| Search later               | Asking the librarian: “Which policies talk about remote access over Gmail?” and she pulls the 3 most semantically related pages from the cabinet |

---

## 🏭 Why It’s Used in Industry

| Feature               | Why It Matters in Enterprise                          |
| --------------------- | ----------------------------------------------------- |
| `SentenceTransformer` | Fast, accurate semantic embedding without fine-tuning |
| Embeddings            | Enables similarity search, clustering, classification |
| `QdrantClient`        | Easy integration with scalable vector DB              |
| Fast + Explainable    | Runs fast, gives traceable results (with metadata)    |

Used in:

* **Chatbots**
* **Compliance scanners**
* **Semantic search portals**
* **Email/document matching engines**



In [5]:
def extract_sections(text):
    sections = []
    pattern = re.compile(r"(?<=\n)(\d{1,2}\.?\s?[A-Z].{5,}?)(?=\n)")
    matches = list(pattern.finditer(text))

    for i, match in enumerate(matches):
        start = match.end()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        heading = match.group(1).strip()
        content = text[start:end].strip()
        if content:
            sections.append({"title": heading, "content": content})
    return sections

Absolutely! This function `extract_sections(text)` is designed to **split a block of policy document text into structured sections** like "1. Introduction", "2. Purpose", etc.

Let me break it down line-by-line and explain the logic behind it — including a **real-world analogy**.

---

## 🧠 Purpose of the Function

To extract **titled sections** from a plain-text version of a PDF, such as:

```
1. Introduction
This policy outlines the purpose of...

2. Purpose
To ensure secure handling of...

3. Scope
This applies to all employees...
```

➡️ The goal is to return:

```python
[
  {"title": "1. Introduction", "content": "This policy outlines..."},
  {"title": "2. Purpose", "content": "To ensure secure handling..."},
  ...
]
```

---

## 🔍 Line-by-Line Breakdown

```python
def extract_sections(text):
    sections = []
```

✅ Starts with an empty list to store extracted sections.

---

```python
    pattern = re.compile(r"(?<=\n)(\d{1,2}\.?\s?[A-Z][^]{5,}?)(?=\n)")
```

🧠 This regex looks for lines like:

* `1. Introduction`
* `2 Purpose`
* `11. Legal Compliance`

### ⚙️ Regex Explained:

| Part         | Meaning                                           |
| ------------ | ------------------------------------------------- |
| `(?<=\n)`    | Look only if the line starts after a newline      |
| `\d{1,2}\.?` | Match section numbers like `1` or `12.`           |
| `\s?`        | Optional space                                    |
| `[A-Z]`      | First letter of the title must be capitalized     |
| `[^]{5,}?`   | Rest of the title should be at least 5 characters |
| `(?=\n)`     | Ends at the next newline                          |


---

```python
    matches = list(pattern.finditer(text))
```

🔍 Finds all matches of section headings in the document.

---

```python
    for i, match in enumerate(matches):
        start = match.end()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
```

📏 For each section title:

* `start` is where the title ends
* `end` is where the **next** section starts
* So you're grabbing the **text between two titles** — the section body

---

```python
        heading = match.group(1).strip()
        content = text[start:end].strip()
```

📝 Clean the section title and its content.

---

```python
        if content:
            sections.append({"title": heading, "content": content})
```

✅ Only store non-empty sections to avoid noise.

---

```python
    return sections
```

📤 Returns a list of structured sections.



In [6]:
def pdf_to_text(file_path):
    return "\n".join([page.get_text() for page in fitz.open(file_path)])

In [8]:
def process_pdfs(pdf_dir):
    all_sections = []
    for file in os.listdir(pdf_dir):
        if not file.endswith(".pdf"):
            continue
        text = pdf_to_text(os.path.join(pdf_dir, file))
        sections = extract_sections(text)

        meta = {
            "source": file,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        }

        for section in sections:
            record = {
                "title": section["title"],
                "content": section["content"].lower(),
                "metadata": meta,
            }
            all_sections.append(record)
    return all_sections

In [9]:
def save_json(data, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2)

In [10]:
def upload_to_qdrant(records):
    if COLLECTION_NAME not in client.get_collections().collections:
        client.recreate_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=-model.get_sentence_embedding_dimension(), distance=Distance.COSINE))
        points = []
        for idx, rec in enumerate(records):
            vector=model.encode(rec["content"]).tolist()
            payload={
                "title": rec["title"],
                "content": rec["content"],
                "metadata": rec["metadata"]
            }
            points.append(PointStruct(id=idx, vector=vector, payload=payload))
        client.upsert(
            collection_name=COLLECTION_NAME,
            points=points
        )

In [None]:
"""if __name__ == "__main__":
    records = process_pdfs(PDF_DIR)
    save_json(records, PROCESSED_JSON)
    upload_to_qdrant(records)
    print(f"✅ Processed and indexed {len(records)} policy sections.")"""