We have one row per chunk now. In this notebook we will convert each chunk's text into a numeric vector (embedding) so later can be used for: We have one row per chunk now. In this notebook we will convert each chunk's text into a numeric vector (embedding) so later can be used for: 
- search by semantic similarity
- get the retrieve relevant chunks
- feed them to an LLM for RAG.
 
It reads chunk-level documents from Unity Catalog, generates text embeddings using an external embedding model, and stores the resulting vectors as a Delta table for semantic search and retrieval-augmented generation (RAG).

Input and Output:
- Input table: databricks_rag_demo.default.azure_compute_doc_chunks
- Output table: databricks_rag_demo.default.azure_compute_doc_embeddings

Embedding strategy (important decisions)

For this project we will:
-   Use OpenAI-style embeddings (works with OpenAI or Azure OpenAI)
-   Generate embeddings in batches (not per row)
- Store embeddings as: ARRAY<FLOAT> (simple, portable)
- Keep metadata alongside vectors

This is the most common production pattern.

In [0]:
import mlflow
# Disable mlflow autologging
mlflow.autolog(disable=True)
mlflow.openai.autolog(disable=True)

In [0]:
%run ./00_install_deps_and_restart

In [0]:

%run ./00_constants

In [0]:
%run ./00_utils

Collecting openai<2.0.0,>=1.0.0
  Downloading openai-1.109.1-py3-none-any.whl.metadata (29 kB)
Collecting anyio<5,>=3.5.0 (from openai<2.0.0,>=1.0.0)
  Downloading anyio-4.12.1-py3-none-any.whl.metadata (4.3 kB)
Collecting httpx<1,>=0.23.0 (from openai<2.0.0,>=1.0.0)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai<2.0.0,>=1.0.0)
  Downloading jiter-0.12.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting sniffio (from openai<2.0.0,>=1.0.0)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting tqdm>4 (from openai<2.0.0,>=1.0.0)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.7 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai<2.0.0

In [0]:
%run ./00_init_openai_client

In [0]:
import os
import time
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType

In [0]:
chunks_df = spark.table(CHUNKS_TABLE)

In [0]:
spark.sql(f"""
    SELECT doc_id, category, chunk_id, chunk_text FROM {CHUNKS_TABLE} LIMIT 3
""").display()

doc_id,category,chunk_id,chunk_text
container-instances/container-instances-quickstart.md,container-instances,270cb38d2abc6af424f9c11d46123bbfb03530b32545c638a1b734f1e9767434,title quickstart deploy docker container to container instance azure cli description in this quickstart you use the azure cli to quickly deploy a containerized web app that runs in an isolated azure container instance ms topic quickstart ms author tomcassidy author tomvcassidy ms service azure container instances services container instances ms date 11 17 2025 ms update cycle 180 days ms custom mvc devx track azurecli mode api customer intent as a developer i want to quickly deploy a docker container using the command line so that i can run my web application without managing complex orchestration platforms quickstart deploy a container instance in azure using the azure cli use azure container instances to run serverless docker containers in azure with simplicity and speed deploy an application to a container instance on demand when you don t need a full container orchestration platform like azure kubernetes service in this quickstart you use the azure cli to deploy an isolated docker container and make its application available with a fully qualified domain name fqdn a few seconds after you execute a single deployment command you can browse to the application running in the container view an app deployed to azure container instances in browser aci app browser include quickstarts free trial note include azure cli prepare your environment md this quickstart requires version 2 0 55 or later of the azure cli if using azure cloud shell the latest version is already installed warning best practice user s credentials passed via command line interface cli are stored as plain text in the backend storing credentials in plain text is a security risk microsoft advises customers to store user credentials in cli environment variables to ensure they are encrypted transformed when stored in the backend create a resource group azure container instances like all azure resources must be deployed into a resource group resource groups allow you to organize and manage related azure resources first create a resource group named myresourcegroup in the eastus location with the az group create az group create command azurecli interactive az group create name myresourcegroup location eastus create a container now that you have a resource group you can run a container in azure to create a container instance with the azure cli provide a resource group name container instance name and docker container image to the az container create az container create command in this
container-instances/container-instances-quickstart.md,container-instances,4590942c3f8226cca172e9dc3d9d9a7d9d3ec44487ec8925019efd2940a29a16,eastus create a container now that you have a resource group you can run a container in azure to create a container instance with the azure cli provide a resource group name container instance name and docker container image to the az container create az container create command in this quickstart you use the public mcr microsoft com azuredocs aci helloworld image this image packages a small web app written in node js that serves a static html page you can expose your containers to the internet by specifying one or more ports to open a dns name label or both in this quickstart you deploy a container with a dns name label so that the web app is publicly reachable execute a command similar to the following to start a container instance set a dns name label value that s unique within the azure region where you create the instance if you receive a dns name label not available error message try a different dns name label azurecli interactive az container create resource group myresourcegroup name mycontainer image mcr microsoft com azuredocs aci helloworld dns name label aci demo ports 80 os type linux memory 1 5 cpu 1 to deploy the container into a specific availability zone use the zone argument and specify the logical zone number azurecli interactive az container create resource group myresourcegroup name mycontainer image mcr microsoft com azuredocs aci helloworld dns name label aci demo ports 80 os type linux memory 1 5 cpu 1 zone 1 important zonal deployments are only available in regions that support availability zones to see if your region supports availability zones see azure regions list within a few seconds you should get a response from the azure cli indicating the deployment completed check its status with the az container show az container show command azurecli interactive az container show resource group myresourcegroup name mycontainer query fqdn ipaddress fqdn provisioningstate provisioningstate out table when you run the command the container s fully qualified domain name fqdn and its provisioning state are displayed output fqdn provisioningstate aci demo eastus azurecontainer io succeeded if the container s provisioningstate is succeeded go to its fqdn in your browser if you see a web page similar to the following congratulations you successfully deployed an application running in a docker container to azure view an app deployed to azure container instances in browser aci
container-instances/container-instances-quickstart.md,container-instances,efdb721d468c11a3a99e75cc4e8e6035cd5a5f41f5030dbe830b7063ebed29e1,io succeeded if the container s provisioningstate is succeeded go to its fqdn in your browser if you see a web page similar to the following congratulations you successfully deployed an application running in a docker container to azure view an app deployed to azure container instances in browser aci app browser if at first the application isn t displayed you might need to wait a few seconds while dns propagates then try refreshing your browser pull the container logs when you need to troubleshoot a container or the application it runs or just see its output start by viewing the container instance s logs pull the container instance logs with the az container logs az container logs command azurecli interactive az container logs resource group myresourcegroup name mycontainer the output displays the logs for the container and should show the http get requests generated when you viewed the application in your browser output listening on port 80 ffff 10 240 255 55 21 mar 2019 17 43 53 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 55 21 mar 2019 17 44 36 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 ffff 10 240 255 55 21 mar 2019 17 44 36 0000 get http 1 1 304 mozilla 5 0 windows nt 10 0 win64 x64 applewebkit 537 36 khtml like gecko chrome 72 0 3626 121 safari 537 36 attach output streams in addition to viewing the logs you can attach your local standard out and standard error streams to that of the container first execute the az container attach az container attach command to attach your local console to the container s output streams azurecli interactive az container attach resource group myresourcegroup name mycontainer once attached refresh your browser a few times to generate some more output when you re done detach your console with control c you should see output similar to the following sample output container mycontainer is in state running count 1 last timestamp 2019 03 21 17 27 20 00 00 pulling image mcr microsoft com azuredocs aci helloworld count 1 last timestamp 2019 03 21


In [0]:
# sanity check

from pyspark.sql import functions as F

# chunk_text is too long to show
chunks_df.select(
    "doc_id",
    "category",
    "chunk_index",
    F.substring("chunk_text", 1, 200).alias("chunk_preview")
).show(3, truncate=False)

+-----------------------------------------------------+-------------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|doc_id                                               |category           |chunk_index|chunk_preview                                                                                                                                                                                           |
+-----------------------------------------------------+-------------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|container-instances/container-instances-quickstart.md|container-instances|0          |title quickstart deploy docker container to co

In [0]:
# Collect chunks in manageable batches

BATCH_SIZE = 64

rows = chunks_df.select(
    "chunk_id",
    "doc_id",
    "category",
    "title",
    "url",
    "chunk_index",
    "chunk_text"
).collect()

print(f"Found {len(rows)} rows")

Found 938 rows


In [0]:
# Below can run for a while depends on how many chunks

# Generate embeddings
embedded_rows = []

for i in range(0, len(rows), BATCH_SIZE):
    print("Chunk in row: ", i)
    batch = rows[i:i + BATCH_SIZE]
    texts = [r.chunk_text for r in batch]

    embeddings = embed_texts(texts)

    for r, emb in zip(batch, embeddings):
        embedded_rows.append((
            r.chunk_id,
            r.doc_id,
            r.category,
            r.title,
            r.url,
            r.chunk_index,
            r.chunk_text,
            emb
        ))

    # Below line is used during initial test to limit data: only include first 10 batches
    # if i > 10*BATCH_SIZE: break

    time.sleep(0.5)  # be polite to API

Chunk in row:  0
Chunk in row:  64
Chunk in row:  128
Chunk in row:  192
Chunk in row:  256
Chunk in row:  320
Chunk in row:  384
Chunk in row:  448
Chunk in row:  512
Chunk in row:  576
Chunk in row:  640
Chunk in row:  704
Chunk in row:  768
Chunk in row:  832
Chunk in row:  896


In [0]:
# Create embeddings DataFrame

embeddings_df = spark.createDataFrame(
    embedded_rows,
    schema=[
        "chunk_id",
        "doc_id",
        "category",
        "title",
        "url",
        "chunk_index",
        "chunk_text",
        "embedding"
    ]
)

# Check vector length:
embeddings_df.select(F.size("embedding").alias("dim")).distinct().show()

+----+
| dim|
+----+
|1536|
+----+



In [0]:
(
    embeddings_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(EMB_TABLE)
)

In [0]:
spark.sql(f"""
    SELECT COUNT(*) FROM {EMB_TABLE}
""").display()


count(1)
938


In [0]:
spark.sql(f"""
    SELECT category, size(embedding) AS embedding_dim FROM {EMB_TABLE} LIMIT 5
""").display()

category,embedding_dim
container-instances,1536
container-instances,1536
container-instances,1536
container-instances,1536
container-instances,1536
