### 🔹 **Embeddings & Vector DB**

**Goal**: Prepare the data so it can be semantically searched.


1. **Data Collection**

   * Gather course documents (PDFs, HTML, Markdown, etc.)

2. **Text Preprocessing**

   * Clean the text (remove noise like headers, footers, boilerplate)
   * Optional: Normalize (lowercase, punctuation removal, etc.)


In [2]:
from markdown_cleaner import MarkdownCleaner

cleaner = MarkdownCleaner()
cleaned_content = cleaner.clean_file("devopt-document.md")

3. **Chunking**

   * Split documents into smaller chunks (e.g., 500–1000 tokens)
   * Optionally add overlap (to preserve context across chunks)


In [3]:
from supports import text_chunking

chunks = text_chunking(cleaned_content)

In [4]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="percentile" # "standard_deviation", "interquartile"
)
documents = text_splitter.create_documents([cleaned_content])

In [5]:
for doc in documents:
    print(doc)
    print("---")

page_content='### Git Repository
- Name: kdeployment
    - Description: Kubernetes manifest for deploying applications in Development, Staging and Production environments. - URL: https://bitbucket.org/cellcardvasdev/kdeployment

- Name: number-management
    - Description: management of phone numbers and related data.'
---
page_content='- URL: https://bitbucket.org/cellcardvasdev/number-management

- Name: dms-connector
    - Description: dms-connector
    - URL: https://bitbucket.org/cellcardvasdev/dms-connector

- Name: specialnumberseles-v2
    - Description: specialnumberseles-v2
    - URL: https://bitbucket.org/cellcardvasdev/specialnumberseles-v2

- Name: data-plus
    - Description:
    - URL: https://bitbucket.org/cellcardvasdev/data-plus

- Name: mlbb-addon-subscription
    - Description:
    - URL: https://bitbucket.org/cellcardvasdev/mlbb-addon-subscription

- Name:
    - Description:
    - URL: https://bitbucket.org/cellcardvasdev/socialpack'
---


4. **Embedding**

   * Use a model like `text-embedding-3-small` or `sentence-transformers`
   * Convert each chunk into a vector


In [6]:
from qdrant_client.http.models import PointStruct
import uuid
from supports import text_embedding 

points = []
for document in documents:
    embedding = text_embedding(document.page_content)
    points.append(PointStruct(
        id=str(uuid.uuid4()),
        vector=embedding,
        payload={"source": "devopt-document.md", "text": document.page_content}
    ))

5. **Insert into Vector DB**

   * Store vectors in Qdrant
   * Attach metadata (e.g., source, page number, doc title)


In [7]:
from qdrant_client.http.models import VectorParams, Distance
from supports import qdrant_client as qdrant, collection_name

if not qdrant.collection_exists(collection_name):
    qdrant.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
    )

In [8]:
from supports import qdrant_client as qdrant, collection_name
qdrant.upsert(collection_name=collection_name, points=points)

print(f"✅ Indexed {len(points)} chunks into Qdrant.")

✅ Indexed 2 chunks into Qdrant.


[View Collection](http://localhost:6333/dashboard#/collections/cellcard_dataplus)