We have one row per chunk now. In this notebook we will convert each chunk's text into a numeric vector (embedding) so later can be used for: search by semantic similarity, get the retrieve relevant chunks and feed them to an LLM for RAG
 
It reads chunk-level documents from Unity Catalog, generates text embeddings using an external embedding model, and stores the resulting vectors as a Delta table for semantic search and retrieval-augmented generation (RAG).

Input and Output:
- Input table: databricks_rag_demo.default.azure_compute_doc_chunks
- Output table: databricks_rag_demo.default.azure_compute_doc_embeddings

Embedding strategy (important decisions)

For this project we will:
-   Use OpenAI-style embeddings (works with OpenAI or Azure OpenAI)
-   Generate embeddings in batches (not per row)
- Store embeddings as: ARRAY<FLOAT> (simple, portable)
- Keep metadata alongside vectors

This is the most common production pattern.

In [0]:
import os
import time
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType

In [0]:
chunks_df = spark.table(
    "databricks_rag_demo.default.azure_compute_doc_chunks"
)

In [0]:
%sql
SELECT doc_id, category, chunk_id, chunk_text FROM databricks_rag_demo.default.azure_compute_doc_chunks LIMIT 3

doc_id,category,chunk_id,chunk_text
virtual-machines/extensions/salt-minion.md,virtual-machines,8c083b147e62c8532ca84e73b344b0079e8b0cadebffb92a9edfd1546f525b8a,title salt minion for linux or windows azure vms description install salt minion on linux or windows vms using the vm extension ms topic concept article ms service azure virtual machines ms subservice extensions ms custom devx track arm template devx track azurecli devx track terraform linux related content ms author gabsta author gabstamsft ms date 08 18 2025 customer intent as a cloud administrator i want to install salt minion on my azure vms using vm extensions so that i can effectively manage and automate configurations across my infrastructure install salt minion on linux or windows vms using the vm extension prerequisites a microsoft azure account with one or more windows or linux vms a salt master either on premises or in a cloud that can accept connections from salt minions hosted on azure the salt minion vm extension requires that the target vm is connected to the internet in order to fetch salt packages include vm assist troubleshooting tools supported platforms azure vm running any of the following supported os ubuntu 20 04 22 04 x86_64 debian 10 11 x86_64 oracle linux 7 8 9 x86_64 rhel 7 8 9 x86_64 microsoft windows 10 11 pro x86_64 microsoft windows server 2012 r2 2016 2019 2022 datacenter x86_64 if you want another distro to be supported assuming salt supports it an issue can be filed on gitlab supported salt minion versions 3006 and up onedir extension details publisher name turtletraction oss linux extension name salt minion linux windows extension name salt minion windows salt minion settings master_address salt master address to connect to localhost by default minion_id minion id hostname by default salt_version salt minion version to install for example 3006 1 latest by default install salt minion using the azure portal 1 select one of your vms 2 in the left menu click extensions applications 3 click add 4 in the gallery type salt minion in the search bar 5 select the salt minion tile and click next 6 enter configuration parameters in the provided form see salt minion settings 7 click review create install salt minion using the azure cli to uninstall it install salt minion using the azure arm template install salt minion using terraform assuming that you have defined a vm resource in terraform named vm_ubuntu then use something like this to install the extension on it support for commercial support or assistance with salt
virtual-machines/extensions/salt-minion.md,virtual-machines,b31b8f47afa253eb5a55d07e79f3ae3aec9ba1cb892b4161ca98873280c7eff0,the azure cli to uninstall it install salt minion using the azure arm template install salt minion using terraform assuming that you have defined a vm resource in terraform named vm_ubuntu then use something like this to install the extension on it support for commercial support or assistance with salt you can visit the extension creator turtletraction the source code of this extension is available on gitlab for azure related issues you can file an azure support incident go to the azure support site and select get support
virtual-machines/extensions/backup-azure-sql-server-running-azure-vm.md,virtual-machines,e81e730867bab0e1924516a758e63f50decf7d4ec97fe49e4e0640fe734eb34a,title azure backup for sql server running in azure vm description in this article learn how to register azure backup in sql server running in an azure virtual machine ms topic concept article ms service azure virtual machines ms subservice extensions ms author gabsta ms reviewer jushiman author gabstamsft ms collection windows ms date 08 18 2025 customer intent as a database administrator i want to register azure backup for my sql server running in an azure vm so that i can ensure reliable backup and recovery of my database workloads azure backup for sql server running in azure vm azure backup amongst other offerings provides support for backing up workloads such as sql server running in azure vms since the sql application is running within an azure vm the backup service needs permission to access the application and fetch the necessary details to do that azure backup installs the azurebackupwindowsworkload extension on the vm in which the sql server is running during the registration process triggered by the user include vm assist troubleshooting tools prerequisites for the list of supported scenarios refer to the supportability matrix supported by azure backup network connectivity azure backup supports nsg tags deploying a proxy server or listed ip ranges for details on each of the methods refer this article extension schema the extension schema and property values are the configuration values runtime settings that service is passing to crp api these config values are used during registration and upgrade azurebackupwindowsworkload extension also uses this schema the schema is pre set a new parameter can be added in the objectstr field the following json shows the schema for the workloadbackup extension property values name value example data type locale en us string taskid 1c0ae461 9d3b 418c a505 bb31dfe2095d string objectstr br publicsettings eyjjb250ywluzxjqcm9wzxj0awvzijp7iknvbnrhaw5lckleijoimzvjmjqxytitogrjny00zge5lwi4ntmtmjdjytjhndzlm2zkiiwiswrnz210q29udgfpbmvyswqiojm0nty3odg5lcjszxnvdxjjzulkijoimdu5nwiwogetyzi4zi00zmfllwe5oditotkwowmymgvjnjvhiiwiu3vic2nyaxb0aw9uswqioijkngezotliny1iyjayltq2mwmtoddmys1jntm5odi3ztgzntqilcjvbmlxdwvdb250ywluzxjoyw1lijoiodm4mdzjodutntq4os00nmnhlweyztctnwmznznhyjg3otcyin0sinn0yw1wtglzdci6w3siu2vydmljzu5hbwuiojusilnlcnzpy2vtdgftcfvybci6imh0dha6xc9cl015v0xgywjtdmmuy29tin1dfq string commandstarttimeutcticks 636967192566036845 string vmtype microsoft compute virtualmachines string objectstr br protectedsettings eyjjb250ywluzxjqcm9wzxj0awvzijp7iknvbnrhaw5lckleijoimzvjmjqxytitogrjny00zge5lwi4ntmtmjdjytjhndzlm2zkiiwiswrnz210q29udgfpbmvyswqiojm0nty3odg5lcjszxnvdxjjzulkijoimdu5nwiwogetyzi4zi00zmfllwe5oditotkwowmymgvjnjvhiiwiu3vic2nyaxb0aw9uswqioijkngezotliny1iyjayltq2mwmtoddmys1jntm5odi3ztgzntqilcjvbmlxdwvdb250ywluzxjoyw1lijoiodm4mdzjodutntq4os00nmnhlweyztctnwmznznhyjg3otcyin0sinn0yw1wtglzdci6w3siu2vydmljzu5hbwuiojusilnlcnzpy2vtdgftcfvybci6imh0dha6xc9cl015v0xgywjtdmmuy29tin1dfq string logsbloburi span https span seapod01coord1exsapk732 blob core windows net bcdrextensionlogs 111111111 1111 1111 1111 111111111111 vmubuntu1404ltsc v2 logs txt sv 2014 02 14 sr b sig dbwyhwfeac5yjzisgxokk 2fewqq2ao1vs1e0rdw 2flsbw 3d st 2017 11 09t14 3a33 3a29z se 2017 11 09t17 3a38 3a29z sp rw string statusbloburi span https span seapod01coord1exsapk732 blob core windows net bcdrextensionlogs 111111111 1111 1111 1111 111111111111 vmubuntu1404ltsc v2 status txt sv 2014 02 14 sr b sig 96rzbptkcjmv7qfexm5idub 2filktwgblwbwg6ih96ao 3d st 2017 11 09t14 3a33 3a29z se 2017 11 09t17 3a38 3a29z sp


In [0]:
# sanity check

from pyspark.sql import functions as F

# chunk_text is too long to show
chunks_df.select(
    "doc_id",
    "category",
    "chunk_index",
    F.substring("chunk_text", 1, 200).alias("chunk_preview")
).show(3, truncate=False)

+-----------------------------------------------------------------------+----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|doc_id                                                                 |category        |chunk_index|chunk_preview                                                                                                                                                                                           |
+-----------------------------------------------------------------------+----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|virtual-machines/extensions/salt-minion.md                             |virtual-machine

How to create Azure OpenAI service and get a api key:
- In Azure Portal: Create Azure OpenAI resource
- Deploy a model: text-embedding-3-small or text-embedding-ada-002
- Get:
	- Endpoint
	- API key


How to store API key in notebook (one-time)
- Workspace → Secrets
- Create scope: openai
  - Key: OPENAI_API_KEY
  - Value: your key

In [0]:
pip install openai

Collecting openai
  Downloading openai-2.14.0-py3-none-any.whl.metadata (29 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Downloading anyio-4.12.1-py3-none-any.whl.metadata (4.3 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.10.0 (from openai)
  Downloading jiter-0.12.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting sniffio (from openai)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting tqdm>4 (from openai)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.7 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Collecting h11>=0.16 (fro

In [0]:
## Initialize OpenAI emebdding client
import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-02-01",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"]
)

In [0]:
# We do NOT query embed for one row at a time, We do in batches for cost and performance.

def embed_texts(texts, model="text-embedding-3-small"):
    response = client.embeddings.create(
        model=model,
        input=texts
    )
    return [d.embedding for d in response.data]

In [0]:
# Collect chunks in manageable batches

BATCH_SIZE = 64

rows = chunks_df.select(
    "chunk_id",
    "doc_id",
    "category",
    "title",
    "url",
    "chunk_index",
    "chunk_text"
).collect()

print(f"Found {len(rows)} rows")

Found 5758 rows


In [0]:
#  Generate embeddings

embedded_rows = []

for i in range(0, len(rows), BATCH_SIZE):
    print(f"{i}")
    batch = rows[i:i + BATCH_SIZE]
    texts = [r.chunk_text for r in batch]

    embeddings = embed_texts(texts)

    for r, emb in zip(batch, embeddings):
        embedded_rows.append((
            r.chunk_id,
            r.doc_id,
            r.category,
            r.title,
            r.url,
            r.chunk_index,
            r.chunk_text,
            emb
        ))

        # first 10 batches
    if i > 10*BATCH_SIZE: break

    time.sleep(0.5)  # be polite to API

0
64
128
192
256
320
384
448
512
576
640
704


[Trace(request_id=tr-a7ed08f1d27942d79d688b840fc99724), Trace(request_id=tr-70e3746dea274d2e95f4f5ed6242f7a0), Trace(request_id=tr-f21c8bb53c1f458c88737fdd2b89f93b), Trace(request_id=tr-d76a90fbfcbf4076b48659a92c2fb3d2), Trace(request_id=tr-b9ca28528ec14c4facf6f4459e8e55d8), Trace(request_id=tr-df2459b6e0bf4be681ceca60f0e082b5), Trace(request_id=tr-b6bff65d7cab4c70bacb75c113a2468a), Trace(request_id=tr-0cfe7b59f735447faf204ce4fe52e9e5), Trace(request_id=tr-01168658d0c54ad099d7160a2c3d222a), Trace(request_id=tr-d4384a44be9543e9808ec32dae219f4f)]

In [0]:
# Create embeddings DataFrame

embeddings_df = spark.createDataFrame(
    embedded_rows,
    schema=[
        "chunk_id",
        "doc_id",
        "category",
        "title",
        "url",
        "chunk_index",
        "chunk_text",
        "embedding"
    ]
)

# Check vector length:
embeddings_df.select(F.size("embedding").alias("dim")).distinct().show()

+----+
| dim|
+----+
|1536|
+----+



In [0]:
(
    embeddings_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(
        "databricks_rag_demo.default.azure_compute_doc_embeddings"
    )
)

In [0]:
%sql
SELECT COUNT(*) FROM databricks_rag_demo.default.azure_compute_doc_embeddings;

count(1)
768


In [0]:
%sql
SELECT category, size(embedding) AS embedding_dim FROM databricks_rag_demo.default.azure_compute_doc_embeddings LIMIT 5;