What chunking does:

LLMs and embedding models cannot work on full documents reliably.
Chunking splits long documents into overlapping, semantically meaningful pieces so that:

	• embeddings capture local meaning
	• retrieval finds the right part of a document
	• generation avoids hallucination

Bad chunking = bad RAG. This step is critical.

#### 02 – Chunk Azure Compute Docs for RAG

This notebook reads raw Azure Compute documentation from Unity Catalog,
splits documents into overlapping text chunks, and writes the results
as a new Delta table for downstream embedding and retrieval.

Input table:
- databricks_rag_demo.default.raw_azure_compute_docs

Output table:
- databricks_rag_demo.default.azure_compute_doc_chunks

Chunking design:

We will use:

	•	Chunk size: ~400 tokens (approx, word-based)
	•	Overlap: ~50 tokens
	•	Deterministic chunk IDs
	•	Metadata preserved (doc_id, category, title, url)

This is industry-standard for RAG.

In [0]:
import re
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType, StringType

In [0]:
%run ./00_constants

In [0]:
# Load raw docs from Unity Catalog
raw_df = spark.table(
    RAW_DOCS_TABLE
)

raw_df.select("doc_id", "category", "title").limit(5).show(truncate=False)

+-----------------------------------------------+----------------+---------------------------+
|doc_id                                         |category        |title                      |
+-----------------------------------------------+----------------+---------------------------+
|virtual-machines/hc-series-performance.md      |virtual-machines|hc-series-performance      |
|virtual-machines/premium-storage-performance.md|virtual-machines|premium-storage-performance|
|virtual-machines/understand-vm-reboots.md      |virtual-machines|understand-vm-reboots      |
|virtual-machines/disks-understand-billing.md   |virtual-machines|disks-understand-billing   |
|virtual-machines/hibernate-resume.md           |virtual-machines|hibernate-resume           |
+-----------------------------------------------+----------------+---------------------------+



In [0]:
# We'll approximate tokens by words.

def tokenize(text: str):
    return re.findall(r"\b\w+\b", text.lower())

In [0]:
# Chunking function (with overlap)
def chunk_text(text, chunk_size=400, overlap=50):
    tokens = tokenize(text)
    chunks = []

    start = 0
    chunk_index = 0

    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_str = " ".join(chunk_tokens)

        chunks.append({
            "chunk_index": chunk_index, ## chunk_index by itself is not globally unique It is only unique within a single doc_id.
            "chunk_text": chunk_str
        })

        chunk_index += 1
        start += chunk_size - overlap

    return chunks

In [0]:
# UDF = User Defined Function
# A Spark UDF is: A custom function you write (usually in Python) that Spark can apply to columns of a DataFrame, distributed across the cluster.

# Register Spark UDF
chunk_schema = ArrayType(
    StructType([
        StructField("chunk_index", IntegerType(), False),
        StructField("chunk_text", StringType(), False)
    ])
)

chunk_udf = F.udf(chunk_text, chunk_schema)

# Apply chunking + explode
chunked_df = (
    raw_df
    .withColumn("chunks", chunk_udf("raw_text"))
    .withColumn("chunk", F.explode("chunks"))
    .select(
        "doc_id",
        "source",
        "category",
        "title",
        "url",
        F.col("chunk.chunk_index").alias("chunk_index"),
        F.col("chunk.chunk_text").alias("chunk_text"),
        "ingest_time"
    )
)

# Add a stable chunk ID:
chunked_df = chunked_df.withColumn(
    "chunk_id",
    F.sha2(
        F.concat_ws("::", F.col("doc_id"), F.col("chunk_index")),
        256
    )
)

##### Step 0: Input DataFrame

Before chunking, the input DataFrame raw_df has one row per document:

| doc_id | category | title | raw_text              | ingest_time |
|-------|----------|-------|-----------------------|-------------|
| vm.md | virtual-machines | Intro | "very long text..."   | ...         |

raw_text is one long string.

##### Step 1: Apply UDF

.withColumn("chunks", chunk_udf("raw_text"))

What this does:

- Applies Spark UDF chunk_udf to raw_text
- Creates a new column called chunks
- Keeps all existing columns

| doc_id | raw_text     | chunks |
|-------|--------------|--------|
| vm.md | "...long..."  | `[ {chunk_index: 0, chunk_text: "..."}, {chunk_index: 1, chunk_text: "..."} ]` |


Still one row per document

##### Step 2: Explode

.withColumn("chunk", F.explode("chunks"))

What explode does: explode takes an array and turns each element into its own row

Before explode (1 row)
```text
chunks = [
  {chunk_index: 0, chunk_text: "..."},
  {chunk_index: 1, chunk_text: "..."}
]
```

After explode (2 rows)
| doc_id | chunk        |
|--------|--------------|
| vm.md  | {0, "..."}   |
| vm.md  | {1, "..."}   |

Spark duplicates all other columns automatically.

Data shape now:
| doc_id | raw_text | chunks | chunk        |
|--------|----------|--------|--------------|
| vm.md  | ...      | [...]  | {0, "..."}   |
| vm.md  | ...      | [...]  | {1, "..."}   |

here is one row per chunk.

##### Step 3: reorganize

.select(...)

what it does:

- Drops columns no longer need:
	- raw_text
	- chunks
- **Extracts fields from the chunk struct**
- Flattens the schema

##### Final result

chunked_df:

| doc_id | category | title | chunk_index | chunk_text          |
|--------|----------|-------|-------------|---------------------|
| vm.md  | vm       | Intro | 0           | "first chunk..."    |
| vm.md  | vm       | Intro | 1           | "second chunk..."   |

In [0]:
chunked_df.count()

947

In [0]:
display(chunked_df.limit(5))

doc_id,source,category,title,url,chunk_index,chunk_text,ingest_time,chunk_id
virtual-machines/hc-series-performance.md,azure-compute-docs,virtual-machines,hc-series-performance,https://learn.microsoft.com/en-us/azure/virtual-machines/hc-series-performance.md,0,title hc series vm size performance description learn about performance testing results for hc series vm sizes in azure ms service azure virtual machines ms subservice hpc ms topic concept article ms date 07 25 2024 ms reviewer cynthn ms author padmalathas author cynthn customer intent as a cloud architect i want to analyze the performance results of hc series vm sizes so that i can select the optimal configuration for my high performance computing workloads hc series virtual machine sizes applies to heavy_check_mark linux vms heavy_check_mark windows vms heavy_check_mark flexible scale sets heavy_check_mark uniform scale sets several performance tests have been run on hc series sizes the following are some of the results of this performance testing workload hb stream triad 190 gb s intel mlc avx 512 high performance linpack hpl 3520 gigaflops rpeak 2970 gigaflops rmax rdma latency bandwidth 1 05 microseconds 96 8 gb s fio on local nvme ssd 1 3 gb s reads 900 mb s writes ior on 4 azure premium ssd p30 managed disks raid0 780 mb s reads 780 mb writes mpi latency mpi latency test from the osu microbenchmark suite is run sample scripts are on github bash bin mpirun_rsh np 2 hostfile hostfile mv2_cpu_mapping insert core osu_latency mpi bandwidth mpi bandwidth test from the osu microbenchmark suite is run sample scripts are on github bash mvapich2 2 3 install bin mpirun_rsh np 2 hostfile hostfile mv2_cpu_mapping insert core mvapich2 2 3 osu_benchmarks mpi pt2pt osu_bw mellanox perftest the mellanox perftest package has many infiniband tests such as latency ib_send_lat and bandwidth ib_send_bw an example command is below console numactl physcpubind insert core ib_send_lat a next steps read about the latest announcements hpc workload examples and performance results at the azure compute tech community blogs for a higher level architectural view of running hpc workloads see high performance computing hpc on azure,2026-01-11T23:00:12.677298Z,e0a4a830db8bf7306f91c8934db2e3eb86a108e1367dcd3ee553adb9082f5212
virtual-machines/premium-storage-performance.md,azure-compute-docs,virtual-machines,premium-storage-performance,https://learn.microsoft.com/en-us/azure/virtual-machines/premium-storage-performance.md,0,title azure premium storage design for high performance description design high performance apps by using azure premium ssd managed disks azure premium storage offers high performance low latency disk support for i o intensive workloads running on azure vms author roygara ms service azure disk storage ms custom linux related content ms topic concept article ms date 06 29 2021 ms author rogarana customer intent as a developer i want to optimize application performance on premium storage so that i can ensure my high performance apps meet the demands of i o intensive workloads efficiently azure premium storage design for high performance applies to heavy_check_mark linux vms heavy_check_mark windows vms heavy_check_mark flexible scale sets heavy_check_mark uniform scale sets this article provides guidelines for building high performance applications by using azure premium storage you can use the instructions provided in this document combined with performance best practices applicable to technologies used by your application to illustrate the guidelines we use sql server running on premium storage as an example throughout this document while we address performance scenarios for the storage layer in this article you need to optimize the application layer for example if you re hosting a sharepoint farm on premium storage you can use the sql server examples from this article to optimize the database server you can also optimize the sharepoint farm s web server and application server to get the most performance this article helps to answer the following common questions about optimizing application performance on premium storage how can you measure your application performance why aren t you seeing expected high performance which factors influence your application performance on premium storage how do these factors influence performance of your application on premium storage how can you optimize for input output operations per second iops bandwidth and latency we provide these guidelines specifically for premium storage because workloads running on premium storage are highly performance sensitive we provide examples where appropriate you can also apply some of these guidelines to applications running on infrastructure as a service iaas vms with standard storage disks note sometimes what appears to be a disk performance issue is actually a network bottleneck in these situations you should optimize your network performance if you re looking to benchmark your disk see the following articles for linux benchmark your application on azure disk storage for windows benchmark a disk if your vm supports,2026-01-11T23:00:12.680018Z,d289e62cefb20a7fac3323cc6fccb65d54c83a66cddd95e1b68b00c3e473ed43
virtual-machines/premium-storage-performance.md,azure-compute-docs,virtual-machines,premium-storage-performance,https://learn.microsoft.com/en-us/azure/virtual-machines/premium-storage-performance.md,1,to be a disk performance issue is actually a network bottleneck in these situations you should optimize your network performance if you re looking to benchmark your disk see the following articles for linux benchmark your application on azure disk storage for windows benchmark a disk if your vm supports accelerated networking make sure it s enabled if it s not enabled you can enable it on already deployed vms on both windows and linux before you begin if you re new to premium storage first read select an azure disk type for iaas vms and scalability targets for premium page blob storage accounts application performance indicators we assess whether an application is performing well or not by using performance indicators like how fast an application is processing a user request how much data an application is processing per request how many requests an application is processing in a specific period of time how long a user has to wait to get a response after submitting their request the technical terms for these performance indicators are iops throughput or bandwidth and latency in this section we discuss the common performance indicators in the context of premium storage in the section performance application checklist for disks you learn how to measure these performance indicators for your application later in optimize application performance you learn about the factors that affect these performance indicators and recommendations to optimize them iops iops is the number of requests that your application is sending to storage disks in one second an input output operation could be read or write sequential or random online transaction processing oltp applications like an online retail website need to process many concurrent user requests immediately the user requests are insert and update intensive database transactions which the application must process quickly for this reason oltp applications require very high iops oltp applications handle millions of small and random i o requests if you have such an application you must design the application infrastructure to optimize for iops for more information on all the factors to consider to get high iops see optimize application performance when you attach a premium storage disk to your high scale vm azure provisions for you a guaranteed number of iops according to the disk specification for example a p50 disk provisions 7 500 iops each high scale vm size also has a specific iops limit that,2026-01-11T23:00:12.680018Z,7eb206e0baf9dccd72fd1c0d1c1db060d37cdd64f5b5f190080eb48111b91d8e
virtual-machines/premium-storage-performance.md,azure-compute-docs,virtual-machines,premium-storage-performance,https://learn.microsoft.com/en-us/azure/virtual-machines/premium-storage-performance.md,2,optimize application performance when you attach a premium storage disk to your high scale vm azure provisions for you a guaranteed number of iops according to the disk specification for example a p50 disk provisions 7 500 iops each high scale vm size also has a specific iops limit that it can sustain for example a standard gs5 vm has an 80 000 iops limit throughput throughput or bandwidth is the amount of data that your application is sending to the storage disks in a specified interval if your application is performing input output operations with large i o unit sizes it requires high throughput data warehouse applications tend to issue scan intensive operations that access large portions of data at a time and commonly perform bulk operations in other words such applications require higher throughput if you have such an application you must design its infrastructure to optimize for throughput in the next section we discuss the factors you must tune to achieve this optimization when you attach a premium storage disk to a high scale vm azure provisions throughput according to that disk specification for example a p50 disk provisions 250 mb sec disk throughput each high scale vm size also has a specific throughput limit that it can sustain for example standard gs5 vm has a maximum throughput of 2 000 mb sec there s a relation between throughput and iops as shown in the following formula it s important to determine the optimal throughput and iops values that your application requires as you try to optimize one the other is also affected for more information about optimizing iops and throughput see optimize application performance latency latency is the time it takes an application to receive a single request send it to storage disks and send the response to the client latency is a critical measure of an application s performance in addition to iops and throughput the latency of a premium storage disk is the time it takes to retrieve the information for a request and communicate it back to your application premium storage provides consistently low latencies premium disks are designed to provide single digit millisecond latencies for most i o operations if you enable readonly host caching on premium storage disks you can get much lower read latency for more information on disk caching see disk caching when you optimize your application to get,2026-01-11T23:00:12.680018Z,8a7ca450c100c7094a4f2c614375f4a0d2de277efab28564f383eef914d2016a
virtual-machines/premium-storage-performance.md,azure-compute-docs,virtual-machines,premium-storage-performance,https://learn.microsoft.com/en-us/azure/virtual-machines/premium-storage-performance.md,3,low latencies premium disks are designed to provide single digit millisecond latencies for most i o operations if you enable readonly host caching on premium storage disks you can get much lower read latency for more information on disk caching see disk caching when you optimize your application to get higher iops and throughput it affects the latency of your application after you tune the application performance evaluate the latency of the application to avoid unexpected high latency behavior some control plane operations on managed disks might move the disk from one storage location to another moving the disk between storage locations is orchestrated via a background copy of data which can take several hours to complete typically the time is less than 24 hours depending on the amount of data in the disks during that time your application can experience higher than usual read latency because some reads can get redirected to the original location and take longer to complete during a background copy there s no effect on write latency for most disk types for premium ssd v2 and ultra disks if the disk has a 4k sector size it experiences higher read latency if the disk has a 512e sector size it experiences higher read and write latency the following control plane operations might move the disk between storage locations and cause increased latency update the storage type detach and attach a disk from one vm to another create a managed disk from a vhd create a managed disk from a snapshot convert unmanaged disks to managed disks performance application checklist for disks the first step in designing high performance applications running on premium storage is understanding the performance requirements of your application after you gather performance requirements you can optimize your application to achieve the most optimal performance in the previous section we explained the common performance indicators iops throughput and latency you must identify which of these performance indicators are critical to your application to deliver the desired user experience for example high iops matters most to oltp applications processing millions of transactions in a second high throughput is critical for data warehouse applications processing large amounts of data in a second extremely low latency is crucial for real time applications like live video streaming websites next measure the maximum performance requirements of your application throughout its lifetime use the following sample checklist as a,2026-01-11T23:00:12.680018Z,b1bd0487444e54435660f660ad8e0a26db5dcbabe8a447256d6a0040ca2df674


In [0]:
# Write chunk table to Unity Catalog

{
    chunked_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(CHUNKS_TABLE)
}

{None}

In [0]:
spark.sql(f"""
    SELECT COUNT(*) FROM {CHUNKS_TABLE};
""").display()


count(1)
947


In [0]:
spark.sql(f"""
    SELECT category, COUNT(*) AS chunks FROM {CHUNKS_TABLE} GROUP BY category ORDER BY chunks DESC
""").display()

category,chunks
service-fabric,267
virtual-machines,266
virtual-machine-scale-sets,210
container-instances,176
azure-impact-reporting,28


In [0]:
spark.sql(f"""
    SELECT doc_id, chunk_index, LENGTH(chunk_text) AS chunk_len FROM {CHUNKS_TABLE} LIMIT 10
""").display()

doc_id,chunk_index,chunk_len
virtual-machines/hc-series-performance.md,0,1938
virtual-machines/premium-storage-performance.md,0,2633
virtual-machines/premium-storage-performance.md,1,2474
virtual-machines/premium-storage-performance.md,2,2380
virtual-machines/premium-storage-performance.md,3,2524
virtual-machines/premium-storage-performance.md,4,2650
virtual-machines/premium-storage-performance.md,5,2278
virtual-machines/premium-storage-performance.md,6,2449
virtual-machines/premium-storage-performance.md,7,2347
virtual-machines/premium-storage-performance.md,8,2273


In [0]:
## delete this table if needed, this will clean up the environment

# spark.sql(f"""
#     DROP TABLE {CHUNKS_TABLE}
# """).display()