What chunking does:

LLMs and embedding models cannot work on full documents reliably.
Chunking splits long documents into overlapping, semantically meaningful pieces so that:

	• embeddings capture local meaning
	• retrieval finds the right part of a document
	• generation avoids hallucination

Bad chunking = bad RAG. This step is critical.

#### 02 – Chunk Azure Compute Docs for RAG

This notebook reads raw Azure Compute documentation from Unity Catalog,
splits documents into overlapping text chunks, and writes the results
as a new Delta table for downstream embedding and retrieval.

Input table:
- databricks_rag_demo.default.raw_azure_compute_docs

Output table:
- databricks_rag_demo.default.azure_compute_doc_chunks

Chunking design:

We will use:

	•	Chunk size: ~400 tokens (approx, word-based)
	•	Overlap: ~50 tokens
	•	Deterministic chunk IDs
	•	Metadata preserved (doc_id, category, title, url)

This is industry-standard for RAG.

In [0]:
import re
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType, StringType

In [0]:
# Load raw docs from Unity Catalog
raw_df = spark.table(
    "databricks_rag_demo.default.raw_azure_compute_docs"
)

raw_df.select("doc_id", "category", "title").limit(5).show(truncate=False)

+----------------------------------------------------------------------------+--------------------------+----------------------------------------------+
|doc_id                                                                      |category                  |title                                         |
+----------------------------------------------------------------------------+--------------------------+----------------------------------------------+
|virtual-machine-scale-sets/standby-pools-create.md                          |virtual-machine-scale-sets|standby-pools-create                          |
|virtual-machine-scale-sets/standby-pools-update-delete.md                   |virtual-machine-scale-sets|standby-pools-update-delete                   |
|virtual-machine-scale-sets/azure-hybrid-benefit-linux.md                    |virtual-machine-scale-sets|azure-hybrid-benefit-linux                    |
|virtual-machine-scale-sets/flexible-virtual-machine-scale-sets-powershell.md|virt

In [0]:
# We’ll approximate tokens by words.
def tokenize(text: str):
    return re.findall(r"\b\w+\b", text.lower())

In [0]:
# Chunking function (with overlap)
def chunk_text(text, chunk_size=400, overlap=50):
    tokens = tokenize(text)
    chunks = []

    start = 0
    chunk_index = 0

    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_str = " ".join(chunk_tokens)

        chunks.append({
            "chunk_index": chunk_index,
            "chunk_text": chunk_str
        })

        chunk_index += 1
        start += chunk_size - overlap

    return chunks

In [0]:
# UDF = User Defined Function
# A Spark UDF is: A custom function you write (usually in Python) that Spark can apply to columns of a DataFrame, distributed across the cluster.

# Register Spark UDF
chunk_schema = ArrayType(
    StructType([
        StructField("chunk_index", IntegerType(), False),
        StructField("chunk_text", StringType(), False)
    ])
)

chunk_udf = F.udf(chunk_text, chunk_schema)

# Apply chunking + explode
chunked_df = (
    raw_df
    .withColumn("chunks", chunk_udf("raw_text"))
    .withColumn("chunk", F.explode("chunks"))
    .select(
        "doc_id",
        "source",
        "category",
        "title",
        "url",
        F.col("chunk.chunk_index").alias("chunk_index"),
        F.col("chunk.chunk_text").alias("chunk_text"),
        "ingest_time"
    )
)

# Add a stable chunk ID:
chunked_df = chunked_df.withColumn(
    "chunk_id",
    F.sha2(
        F.concat_ws("::", F.col("doc_id"), F.col("chunk_index")),
        256
    )
)

##### Step 0: Input DataFrame

Before chunking, the input DataFrame raw_df has one row per document:

| doc_id | category | title | raw_text              | ingest_time |
|-------|----------|-------|-----------------------|-------------|
| vm.md | virtual-machines | Intro | "very long text..."   | ...         |

raw_text is one long string.

##### Step 1: Apply UDF

.withColumn("chunks", chunk_udf("raw_text"))

What this does:

- Applies Spark UDF chunk_udf to raw_text
- Creates a new column called chunks
- Keeps all existing columns

| doc_id | raw_text     | chunks |
|-------|--------------|--------|
| vm.md | "...long..."  | `[ {chunk_index: 0, chunk_text: "..."}, {chunk_index: 1, chunk_text: "..."} ]` |


Still one row per document

##### Step 2: Explode

.withColumn("chunk", F.explode("chunks"))

What explode does: explode takes an array and turns each element into its own row

Before explode (1 row)
```text
chunks = [
  {chunk_index: 0, chunk_text: "..."},
  {chunk_index: 1, chunk_text: "..."}
]
```

After explode (2 rows)
| doc_id | chunk        |
|--------|--------------|
| vm.md  | {0, "..."}   |
| vm.md  | {1, "..."}   |

Spark duplicates all other columns automatically.

Data shape now:
| doc_id | raw_text | chunks | chunk        |
|--------|----------|--------|--------------|
| vm.md  | ...      | [...]  | {0, "..."}   |
| vm.md  | ...      | [...]  | {1, "..."}   |

here is one row per chunk.

##### Step 3: reorganize

.select(...)

what it does:

- Drops columns no longer need:
	- raw_text
	- chunks
- **Extracts fields from the chunk struct**
- Flattens the schema

##### Final result

chunked_df:

| doc_id | category | title | chunk_index | chunk_text          |
|--------|----------|-------|-------------|---------------------|
| vm.md  | vm       | Intro | 0           | "first chunk..."    |
| vm.md  | vm       | Intro | 1           | "second chunk..."   |

In [0]:
chunked_df.count()

5758

In [0]:
display(chunked_df.limit(5))

doc_id,source,category,title,url,chunk_index,chunk_text,ingest_time,chunk_id
virtual-machine-scale-sets/standby-pools-create.md,azure-compute-docs,virtual-machine-scale-sets,standby-pools-create,https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/standby-pools-create.md,0,title create a standby pool for virtual machine scale sets description learn how to create a standby pool to reduce scale out latency with virtual machine scale sets author mimckitt ms author mimckitt ms service azure virtual machine scale sets ms custom ignite 2024 ms topic how to ms date 5 6 2025 ms reviewer cynthn customer intent as a cloud infrastructure administrator i want to create and manage a standby pool for virtual machine scale sets so that i can reduce scale out latency and ensure high availability of resources create a standby pool important for standby pools to successfully create and manage resources it requires access to the associated resources in your subscription ensure the correct permissions are assigned to the standby pool resource provider in order for your standby pool to function properly for detailed instructions see configure role permissions for standby pools this article steps through creating a standby pool for virtual machine scale sets with flexible orchestration create a standby pool portal note setting the standby pool vm state to hibernated is not yet available in the azure portal to configure a standby pool with a hibernated vm state use an alternative sdk such as cli or powershell 1 navigate to your virtual machine scale set 2 under availability scale select standby pool 3 select manage pool 4 provide a name for your pool select a provisioning state and set the maximum and minimum ready capacity 5 select save image type content source media standby pools enable standby pool after vmss creation png alt text a screenshot showing how to enable a standby pool on an existing virtual machine scale set in the azure portal you can also configure a standby pool during virtual machine scale set creation by navigating to the management tab and checking the box to enable standby pools image type content source media standby pools enable standby pool during vmss create png alt text a screenshot showing how to enable a standby pool during the virtual machine scale set create experience in the portal cli create a standby pool and associate it with an existing scale set using az standby vm pool create powershell create a standby pool and associate it with an existing scale set using new azstandbyvmpool arm template create a standby pool and associate it with an existing scale set create a template and deploy it using az,2026-01-07T07:18:12.183441Z,eb530c82dafe9473e49352dd206a5287feb8ec035d56b0c3e80eb62b1d15e376
virtual-machine-scale-sets/standby-pools-create.md,azure-compute-docs,virtual-machine-scale-sets,standby-pools-create,https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/standby-pools-create.md,1,it with an existing scale set using az standby vm pool create powershell create a standby pool and associate it with an existing scale set using new azstandbyvmpool arm template create a standby pool and associate it with an existing scale set create a template and deploy it using az deployment group create or new azresourcegroupdeployment bicep create a standby pool and associate it with an existing scale set deploy the template using az deployment group create or new azresourcegroupdeployment rest create a standby pool and associate it with an existing scale set using create or update next steps learn how to update and delete a standby pool,2026-01-07T07:18:12.183441Z,731634abb2b494dc2ed1029f4b3eefea16271b25a44eb7e8dbd758c5b1a9ff9e
virtual-machine-scale-sets/standby-pools-update-delete.md,azure-compute-docs,virtual-machine-scale-sets,standby-pools-update-delete,https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/standby-pools-update-delete.md,0,title delete or update a standby pool for virtual machine scale sets description learn how to delete or update a standby pool for virtual machine scale sets author mimckitt ms author mimckitt ms service azure virtual machine scale sets ms custom ignite 2024 ms topic how to ms date 5 6 2025 ms reviewer cynthn customer intent as a cloud administrator i want to update or delete standby pools in virtual machine scale sets so that i can manage resource allocation and optimize performance based on operational requirements update or delete a standby pool important for standby pools to successfully create and manage resources it requires access to the associated resources in your subscription ensure the correct permissions are assigned to the standby pool resource provider in order for your standby pool to function properly for detailed instructions see configure role permissions for standby pools you can update the state of the instances and the max ready capacity of your standby pool at any time the standby pool name can only be set during standby pool creation if updating the provisioning state to hibernated ensure that the scale set is properly configured to use hibernated vms for more information see hibernation overview when changing the provisioning state of your standby pool transitioning between the following states below are supported transitioning between a stopped deallocated state and a hibernated state is not supported if using a stopped deallocated pool and you want to instead use a hibernated pool first transition to a running pool then update the provisioning state to hibernated initial state updated state running stopped deallocated running hibernated stopped deallocated running hibernated running hibernated stopped deallocated update a standby pool portal note setting the standby pool vm state to hibernated is not yet available in the azure portal to configure a standby pool with a hibernated vm state use an alternative sdk such as cli or powershell 1 navigate to virtual machine scale set the standby pool is associated with 2 under availability scale select standby pool 3 select manage pool 4 update the configuration and save any changes image type content source media standby pools managed standby pool after vmss create png alt text a screenshot of the networking tab in the azure portal during the virtual machine scale set creation process cli update an existing standby pool using az standby vm pool update powershell update an existing,2026-01-07T07:18:12.183971Z,b0d36c396ec479a0446845a2f8ece217e0f048a75bb38eb977213b10db21a084
virtual-machine-scale-sets/standby-pools-update-delete.md,azure-compute-docs,virtual-machine-scale-sets,standby-pools-update-delete,https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/standby-pools-update-delete.md,1,image type content source media standby pools managed standby pool after vmss create png alt text a screenshot of the networking tab in the azure portal during the virtual machine scale set creation process cli update an existing standby pool using az standby vm pool update powershell update an existing standby pool using update azstandbyvmpool arm template update an existing standby pool deployment deploy the updated template using az deployment group create or new azresourcegroupdeployment bicep update an existing standby pool deployment deploy the updated template using az deployment group create or new azresourcegroupdeployment rest update an existing standby pool using create or update delete a standby pool portal 1 navigate to virtual machine scale set the standby pool is associated with 2 under availability scale select standby pool 3 select delete pool 4 select delete image type content source media standby pools delete standby pool portal png alt text a screenshot showing how to delete a standby pool in the portal cli delete an existing standby pool using az standbypool delete powershell delete an existing standby pool using remove azstandbyvmpool rest delete an existing standby pool using delete next steps review the most frequently asked questions about standby pools for virtual machine scale sets,2026-01-07T07:18:12.183971Z,51e07a453a45d39f14184f4139da8bfaaaa5c3fe4de6e36566d0dc753ea01517
virtual-machine-scale-sets/azure-hybrid-benefit-linux.md,azure-compute-docs,virtual-machine-scale-sets,azure-hybrid-benefit-linux,https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/azure-hybrid-benefit-linux.md,0,title azure hybrid benefit for linux virtual machine scale sets description learn how azure hybrid benefit can apply to virtual machine scale sets and save you money on linux virtual machines in azure services virtual machine scale sets author mathapli manager rochakm ms service azure virtual machine scale sets ms subservice azure hybrid benefit ms collection linux ms topic concept article ms date 06 14 2024 ms author mathapli ms custom kr2b contr experiment devx track azurecli linux related content sfi image nochange customer intent as a cloud administrator i want to utilize azure hybrid benefit for linux virtual machine scale sets so that i can reduce costs associated with running my rhel and sles instances while leveraging existing subscriptions explore azure hybrid benefit for linux virtual machine scale sets azure hybrid benefit can reduce the cost of running your red hat enterprise linux rhel and suse linux enterprise server sles virtual machine scale sets azure hybrid benefit for linux virtual machine scale sets is generally available now it s available for all rhel and sles pay as you go images from azure marketplace when you enable azure hybrid benefit the only fee that you incur is the cost of your scale set infrastructure note this article focuses on virtual machine scale sets running in uniform orchestration mode we recommend using flexible orchestration for new workloads for more information see orchestration modes for virtual machine scale sets in azure what is azure hybrid benefit for linux virtual machine scale sets azure hybrid benefit allows you to switch your virtual machine scale sets to bring your own subscription byos billing you can use your cloud access licenses from red hat or suse for this you can also switch pay as you go instances to byos without the need to redeploy a virtual machine scale set deployed from pay as you go azure marketplace images is charged both infrastructure and software fees when azure hybrid benefit is enabled image type content source media azure hybrid benefit linux azure hybrid benefit linux cost png alt text diagram that shows the effect of azure hybrid benefit on costs for linux virtual machines which linux virtual machines can use azure hybrid benefit azure hybrid benefit can be used on all rhel and sles pay as you go images from azure marketplace azure hybrid benefit isn t yet available for rhel or sles byos images or,2026-01-07T07:18:12.184325Z,647e2624515f93a04071cc5c18946bba1268828e815694c709fc00c16e70d977


In [0]:
# Write chunk table to Unity Catalog

(
    chunked_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(
        "databricks_rag_demo.default.azure_compute_doc_chunks"
    )
)

In [0]:
%sql

SELECT COUNT(*) FROM databricks_rag_demo.default.azure_compute_doc_chunks;



count(1)
5758


In [0]:
%sql

SELECT category, COUNT(*) AS chunks FROM databricks_rag_demo.default.azure_compute_doc_chunks GROUP BY category ORDER BY chunks DESC;


category,chunks
virtual-machines,3259
service-fabric,1865
virtual-machine-scale-sets,341
container-instances,268
azure-impact-reporting,25


In [0]:
%sql

SELECT doc_id, chunk_index, LENGTH(chunk_text) AS chunk_len FROM databricks_rag_demo.default.azure_compute_doc_chunks LIMIT 10

doc_id,chunk_index,chunk_len
virtual-machines/extensions/salt-minion.md,0,2383
virtual-machines/extensions/salt-minion.md,1,527
virtual-machines/extensions/backup-azure-sql-server-running-azure-vm.md,0,3559
virtual-machines/extensions/backup-azure-sql-server-running-azure-vm.md,1,1269
virtual-machines/extensions/custom-script-linux.md,0,2322
virtual-machines/extensions/custom-script-linux.md,1,2072
virtual-machines/extensions/custom-script-linux.md,2,2647
virtual-machines/extensions/custom-script-linux.md,3,2407
virtual-machines/extensions/custom-script-linux.md,4,2300
virtual-machines/extensions/custom-script-linux.md,5,2471


In [0]:
%sql
-- delete this table if needed, this will clean up the environment

DROP TABLE databricks_rag_demo.default.azure_compute_doc_chunks;