What chunking does:

LLMs and embedding models cannot work on full documents reliably.
Chunking splits long documents into overlapping, semantically meaningful pieces so that:

	• embeddings capture local meaning
	• retrieval finds the right part of a document
	• generation avoids hallucination

Bad chunking = bad RAG. This step is critical.

#### 02 - Chunk Azure Compute Docs for RAG

This notebook reads raw Azure Compute documentation from Unity Catalog, splits documents into overlapping text chunks, and writes the results as a new Delta table for downstream embedding and retrieval.

Input table:
- databricks_rag_demo.default.raw_azure_compute_docs

Output table:
- databricks_rag_demo.default.azure_compute_doc_chunks

Chunking design:

- Chunk size: ~400 tokens/words (approx, word-based)
- Overlap: ~50 tokens
- Deterministic chunk IDs
- Metadata preserved (doc_id, category, title, url)

This is industry-standard for RAG.

In [0]:
import re
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType, StringType

In [0]:
%run ./00_constants

In [0]:
# Load raw docs from Unity Catalog
raw_df = spark.table(RAW_DOCS_TABLE)
raw_df.select("doc_id", "category", "title", "url").limit(5).show(truncate=False)

+------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|doc_id                                                                                    |category           |title                                                              |url                                                                                                                                                              |
+------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------

In [0]:
# We'll approximate tokens by words.

def tokenize(text: str):
    return re.findall(r"\b\w+\b", text.lower())

In [0]:
# Chunking function (with overlap)
# chunk by 400 words
def chunk_text(text, chunk_size=400, overlap=50):
    tokens = tokenize(text)
    chunks = []

    start = 0
    chunk_index = 0

    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_str = " ".join(chunk_tokens)

        chunks.append({
            "chunk_index": chunk_index, ## chunk_index by itself is not globally unique It is only unique within a single doc_id.
            "chunk_text": chunk_str
        })

        chunk_index += 1
        start += chunk_size - overlap

    return chunks

%md
##### Step 0: Input DataFrame

Before chunking, the input DataFrame raw_df has one row per document:

| doc_id | category | title | raw_text              | ingest_time |
|-------|----------|-------|-----------------------|-------------|
| vm.md | virtual-machines | Intro | "very long text..."   | ...         |

raw_text is one long string.

##### Step 1: Apply UDF

.withColumn("chunks", chunk_udf("raw_text"))

What this does:

- Applies Spark UDF chunk_udf to raw_text
- Creates a new column called chunks
- Keeps all existing columns

| doc_id | raw_text     | chunks |
|-------|--------------|--------|
| vm.md | "...long..."  | `[ {chunk_index: 0, chunk_text: "..."}, {chunk_index: 1, chunk_text: "..."} ]` |


Still one row per document

##### Step 2: Explode

.withColumn("chunk", F.explode("chunks"))

What explode does: explode takes an array and turns each element into its own row

Before explode (1 row)
```text
chunks = [
  {chunk_index: 0, chunk_text: "..."},
  {chunk_index: 1, chunk_text: "..."}
]
```

After explode (2 rows)
| doc_id | chunk        |
|--------|--------------|
| vm.md  | {0, "..."}   |
| vm.md  | {1, "..."}   |

Spark duplicates all other columns automatically.

Data shape now:
| doc_id | raw_text | chunks | chunk        |
|--------|----------|--------|--------------|
| vm.md  | ...      | [...]  | {0, "..."}   |
| vm.md  | ...      | [...]  | {1, "..."}   |

here is one row per chunk.

##### Step 3: reorganize

.select(...)

what it does:

- Drops columns no longer need:
	- raw_text
	- chunks
- **Extracts fields from the chunk struct**
- Flattens the schema

##### Final result

chunked_df:

| doc_id | category | title | chunk_index | chunk_text          |
|--------|----------|-------|-------------|---------------------|
| vm.md  | vm       | Intro | 0           | "first chunk..."    |
| vm.md  | vm       | Intro | 1           | "second chunk..."   |

In [0]:
# UDF = User Defined Function
# A Spark UDF is: A custom function you write (usually in Python) that Spark can apply to columns of a DataFrame, distributed across the cluster.

# Register Spark UDF
chunk_schema = ArrayType(
    StructType([
        StructField("chunk_index", IntegerType(), False),
        StructField("chunk_text", StringType(), False)
    ])
)

chunk_udf = F.udf(chunk_text, chunk_schema)

# Apply chunking + explode
chunked_df = (
    raw_df
    .withColumn("chunks", chunk_udf("raw_text"))
    .withColumn("chunk", F.explode("chunks"))
    .select(
        "doc_id",
        "source",
        "category",
        "title",
        "url",
        F.col("chunk.chunk_index").alias("chunk_index"),
        F.col("chunk.chunk_text").alias("chunk_text"),
        "ingest_time"
    )
)

# Add a stable chunk ID:
chunked_df = chunked_df.withColumn(
    "chunk_id",
    F.sha2(
        F.concat_ws("::", F.col("doc_id"), F.col("chunk_index")),
        256
    )
)

In [0]:
chunked_df.count()

938

In [0]:
display(chunked_df.limit(5))

doc_id,source,category,title,url,chunk_index,chunk_text,ingest_time,chunk_id
container-instances/container-instances-quickstart.md,azure-compute-docs,container-instances,container-instances-quickstart,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/container-instances/container-instances-quickstart.md,0,title quickstart deploy docker container to container instance azure cli description in this quickstart you use the azure cli to quickly deploy a containerized web app that runs in an isolated azure container instance ms topic quickstart ms author tomcassidy author tomvcassidy ms service azure container instances services container instances ms date 11 17 2025 ms update cycle 180 days ms custom mvc devx track azurecli mode api customer intent as a developer i want to quickly deploy a docker container using the command line so that i can run my web application without managing complex orchestration platforms quickstart deploy a container instance in azure using the azure cli use azure container instances to run serverless docker containers in azure with simplicity and speed deploy an application to a container instance on demand when you don t need a full container orchestration platform like azure kubernetes service in this quickstart you use the azure cli to deploy an isolated docker container and make its application available with a fully qualified domain name fqdn a few seconds after you execute a single deployment command you can browse to the application running in the container view an app deployed to azure container instances in browser aci app browser include quickstarts free trial note include azure cli prepare your environment md this quickstart requires version 2 0 55 or later of the azure cli if using azure cloud shell the latest version is already installed warning best practice user s credentials passed via command line interface cli are stored as plain text in the backend storing credentials in plain text is a security risk microsoft advises customers to store user credentials in cli environment variables to ensure they are encrypted transformed when stored in the backend create a resource group azure container instances like all azure resources must be deployed into a resource group resource groups allow you to organize and manage related azure resources first create a resource group named myresourcegroup in the eastus location with the az group create az group create command azurecli interactive az group create name myresourcegroup location eastus create a container now that you have a resource group you can run a container in azure to create a container instance with the azure cli provide a resource group name container instance name and docker container image to the az container create az container create command in this,2026-01-15T00:41:03.347812Z,270cb38d2abc6af424f9c11d46123bbfb03530b32545c638a1b734f1e9767434
container-instances/container-instances-quickstart.md,azure-compute-docs,container-instances,container-instances-quickstart,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/container-instances/container-instances-quickstart.md,1,eastus create a container now that you have a resource group you can run a container in azure to create a container instance with the azure cli provide a resource group name container instance name and docker container image to the az container create az container create command in this quickstart you use the public mcr microsoft com azuredocs aci helloworld image this image packages a small web app written in node js that serves a static html page you can expose your containers to the internet by specifying one or more ports to open a dns name label or both in this quickstart you deploy a container with a dns name label so that the web app is publicly reachable execute a command similar to the following to start a container instance set a dns name label value that s unique within the azure region where you create the instance if you receive a dns name label not available error message try a different dns name label azurecli interactive az container create resource group myresourcegroup name mycontainer image mcr microsoft com azuredocs aci helloworld dns name label aci demo ports 80 os type linux memory 1 5 cpu 1 to deploy the container into a specific availability zone use the zone argument and specify the logical zone number azurecli interactive az container create resource group myresourcegroup name mycontainer image mcr microsoft com azuredocs aci helloworld dns name label aci demo ports 80 os type linux memory 1 5 cpu 1 zone 1 important zonal deployments are only available in regions that support availability zones to see if your region supports availability zones see azure regions list within a few seconds you should get a response from the azure cli indicating the deployment completed check its status with the az container show az container show command azurecli interactive az container show resource group myresourcegroup name mycontainer query fqdn ipaddress fqdn provisioningstate provisioningstate out table when you run the command the container s fully qualified domain name fqdn and its provisioning state are displayed output fqdn provisioningstate aci demo eastus azurecontainer io succeeded if the container s provisioningstate is succeeded go to its fqdn in your browser if you see a web page similar to the following congratulations you successfully deployed an application running in a docker container to azure view an app deployed to azure container instances in browser aci,2026-01-15T00:41:03.347812Z,4590942c3f8226cca172e9dc3d9d9a7d9d3ec44487ec8925019efd2940a29a16
container-instances/container-instances-quickstart.md,azure-compute-docs,container-instances,container-instances-quickstart,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/container-instances/container-instances-quickstart.md,2,io succeeded if the container s provisioningstate is succeeded go to its fqdn in your browser if you see a web page similar to the following congratulations you successfully deployed an application running in a docker container to azure view an app deployed to azure container instances in browser aci app browser if at first the application isn t displayed you might need to wait a few seconds while dns propagates then try refreshing your browser pull the container logs when you need to troubleshoot a container or the application it runs or just see its output start by viewing the container instance s logs pull the container instance logs with the az container logs az container logs command azurecli interactive az container logs resource group myresourcegroup name mycontainer the output displays the logs for the container and should show the http get requests generated when you viewed the application in your browser output listening on port 80 ffff 10 240 255 55 21 mar 2019 17 43 53 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 55 21 mar 2019 17 44 36 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 55 21 mar 2019 17 44 36 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 attach output streams in addition to viewing the logs you can attach your local standard out and standard error streams to that of the container first execute the az container attach az container attach command to attach your local console to the container s output streams azurecli interactive az container attach resource group myresourcegroup name mycontainer once attached refresh your browser a few times to generate some more output when you re done detach your console with control c you should see output similar to the following sample output container mycontainer is in state running count 1 last timestamp 2019 03 21 17 27 20 00 00 pulling image mcr microsoft com azuredocs aci helloworld count 1 last timestamp 2019 03 21,2026-01-15T00:41:03.347812Z,efdb721d468c11a3a99e75cc4e8e6035cd5a5f41f5030dbe830b7063ebed29e1
container-instances/container-instances-quickstart.md,azure-compute-docs,container-instances,container-instances-quickstart,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/container-instances/container-instances-quickstart.md,3,done detach your console with control c you should see output similar to the following sample output container mycontainer is in state running count 1 last timestamp 2019 03 21 17 27 20 00 00 pulling image mcr microsoft com azuredocs aci helloworld count 1 last timestamp 2019 03 21 17 27 24 00 00 successfully pulled image mcr microsoft com azuredocs aci helloworld count 1 last timestamp 2019 03 21 17 27 27 00 00 created container count 1 last timestamp 2019 03 21 17 27 27 00 00 started container start streaming logs listening on port 80 ffff 10 240 255 55 21 mar 2019 17 43 53 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 55 21 mar 2019 17 44 36 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 55 21 mar 2019 17 44 36 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 55 21 mar 2019 17 47 01 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 56 21 mar 2019 17 47 12 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 clean up resources when you re done with the container remove it using the az container delete az container delete command azurecli interactive az container delete resource group myresourcegroup name mycontainer to verify that the container deleted execute the az container list command azurecli interactive az container list resource group myresourcegroup output table the mycontainer container shouldn t appear in the command s output if you have no other containers in the resource group no output is displayed if you re done with the myresourcegroup resource group and all the resources it contains delete it with the az group delete az,2026-01-15T00:41:03.347812Z,7dcffc0990c5524e7f077598b05c63b9d72c6b43a822f73804f50bc776df74b6
container-instances/container-instances-quickstart.md,azure-compute-docs,container-instances,container-instances-quickstart,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/container-instances/container-instances-quickstart.md,4,output table the mycontainer container shouldn t appear in the command s output if you have no other containers in the resource group no output is displayed if you re done with the myresourcegroup resource group and all the resources it contains delete it with the az group delete az group delete command azurecli interactive az group delete name myresourcegroup next steps in this quickstart you created an azure container instance by using a public microsoft image if you d like to build a container image and deploy it from a private azure container registry continue to the azure container instances tutorial div class nextstepaction azure container instances tutorial to try out options for running containers in an orchestration system on azure see the azure kubernetes service aks container service quickstarts images aci app browser media container instances quickstart view an application running in an azure container instance png links external app github repo https github com azure samples aci helloworld git azure account https azure microsoft com free node js https nodejs org links internal az container attach cli azure container az_container_attach az container create cli azure container az_container_create az container delete cli azure container az_container_delete az container list cli azure container az_container_list az container logs cli azure container az_container_logs az container show cli azure container az_container_show az group create cli azure group az_group_create az group delete cli azure group az_group_delete azure cli install cli azure install azure cli container service azure aks intro kubernetes,2026-01-15T00:41:03.347812Z,d3d8c22762ecb7f17ad0fcb21dffcf4a62c1cb084a79134ba1d9aaae63ec0e64


In [0]:
# Write chunk table to Unity Catalog

{
    chunked_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(CHUNKS_TABLE)
}

{None}

In [0]:
spark.sql(f"""
    SELECT COUNT(*) FROM {CHUNKS_TABLE};
""").display()


count(1)
938


In [0]:
spark.sql(f"""
    SELECT category, COUNT(*) AS chunks FROM {CHUNKS_TABLE} GROUP BY category ORDER BY chunks DESC
""").display()

category,chunks
virtual-machines,272
service-fabric,230
virtual-machine-scale-sets,221
container-instances,187
azure-impact-reporting,28


In [0]:
spark.sql(f"""
    SELECT doc_id, chunk_index, LENGTH(chunk_text) AS chunk_len FROM {CHUNKS_TABLE} LIMIT 10
""").display()

# You will see chunk_len average 2400, we chunk with 400 words, and probably average 6 byte per word

doc_id,chunk_index,chunk_len
container-instances/container-instances-quickstart.md,0,2484
container-instances/container-instances-quickstart.md,1,2428
container-instances/container-instances-quickstart.md,2,2202
container-instances/container-instances-quickstart.md,3,2038
container-instances/container-instances-quickstart.md,4,1641
container-instances/container-instances-using-azure-container-registry.md,0,2510
container-instances/container-instances-using-azure-container-registry.md,1,2319
container-instances/container-instances-using-azure-container-registry.md,2,2480
container-instances/container-instances-using-azure-container-registry.md,3,2623
container-instances/container-instances-using-azure-container-registry.md,4,478


In [0]:
## delete this table if needed, this will clean up the environment

# spark.sql(f"""
#     DROP TABLE {CHUNKS_TABLE}
# """).display()