We have one row per chunk now. In this notebook we will convert each chunk's text into a numeric vector (embedding) so later can be used for: We have one row per chunk now. In this notebook we will convert each chunk's text into a numeric vector (embedding) so later can be used for: search by semantic similarity, get the retrieve relevant chunks and feed them to an LLM for RAG.
 
It reads chunk-level documents from Unity Catalog, generates text embeddings using an external embedding model, and stores the resulting vectors as a Delta table for semantic search and retrieval-augmented generation (RAG).

Input and Output:
- Input table: databricks_rag_demo.default.azure_compute_doc_chunks
- Output table: databricks_rag_demo.default.azure_compute_doc_embeddings

Embedding strategy (important decisions)

For this project we will:
-   Use OpenAI-style embeddings (works with OpenAI or Azure OpenAI)
-   Generate embeddings in batches (not per row)
- Store embeddings as: ARRAY<FLOAT> (simple, portable)
- Keep metadata alongside vectors

This is the most common production pattern.

In [0]:
%run ./00_install_deps_and_restart

In [0]:

%run ./00_constants

In [0]:
%run ./00_utils

In [0]:
%run ./00_init_openai_client

In [0]:
import os
import time
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType

In [0]:
chunks_df = spark.table(CHUNKS_TABLE)

In [0]:
spark.sql(f"""
    SELECT doc_id, category, chunk_id, chunk_text FROM {CHUNKS_TABLE} LIMIT 3
""").display()

doc_id,category,chunk_id,chunk_text
virtual-machines/hc-series-performance.md,virtual-machines,e0a4a830db8bf7306f91c8934db2e3eb86a108e1367dcd3ee553adb9082f5212,title hc series vm size performance description learn about performance testing results for hc series vm sizes in azure ms service azure virtual machines ms subservice hpc ms topic concept article ms date 07 25 2024 ms reviewer cynthn ms author padmalathas author cynthn customer intent as a cloud architect i want to analyze the performance results of hc series vm sizes so that i can select the optimal configuration for my high performance computing workloads hc series virtual machine sizes applies to heavy_check_mark linux vms heavy_check_mark windows vms heavy_check_mark flexible scale sets heavy_check_mark uniform scale sets several performance tests have been run on hc series sizes the following are some of the results of this performance testing workload hb stream triad 190 gb s intel mlc avx 512 high performance linpack hpl 3520 gigaflops rpeak 2970 gigaflops rmax rdma latency bandwidth 1 05 microseconds 96 8 gb s fio on local nvme ssd 1 3 gb s reads 900 mb s writes ior on 4 azure premium ssd p30 managed disks raid0 780 mb s reads 780 mb writes mpi latency mpi latency test from the osu microbenchmark suite is run sample scripts are on github bash bin mpirun_rsh np 2 hostfile hostfile mv2_cpu_mapping insert core osu_latency mpi bandwidth mpi bandwidth test from the osu microbenchmark suite is run sample scripts are on github bash mvapich2 2 3 install bin mpirun_rsh np 2 hostfile hostfile mv2_cpu_mapping insert core mvapich2 2 3 osu_benchmarks mpi pt2pt osu_bw mellanox perftest the mellanox perftest package has many infiniband tests such as latency ib_send_lat and bandwidth ib_send_bw an example command is below console numactl physcpubind insert core ib_send_lat a next steps read about the latest announcements hpc workload examples and performance results at the azure compute tech community blogs for a higher level architectural view of running hpc workloads see high performance computing hpc on azure
virtual-machines/premium-storage-performance.md,virtual-machines,d289e62cefb20a7fac3323cc6fccb65d54c83a66cddd95e1b68b00c3e473ed43,title azure premium storage design for high performance description design high performance apps by using azure premium ssd managed disks azure premium storage offers high performance low latency disk support for i o intensive workloads running on azure vms author roygara ms service azure disk storage ms custom linux related content ms topic concept article ms date 06 29 2021 ms author rogarana customer intent as a developer i want to optimize application performance on premium storage so that i can ensure my high performance apps meet the demands of i o intensive workloads efficiently azure premium storage design for high performance applies to heavy_check_mark linux vms heavy_check_mark windows vms heavy_check_mark flexible scale sets heavy_check_mark uniform scale sets this article provides guidelines for building high performance applications by using azure premium storage you can use the instructions provided in this document combined with performance best practices applicable to technologies used by your application to illustrate the guidelines we use sql server running on premium storage as an example throughout this document while we address performance scenarios for the storage layer in this article you need to optimize the application layer for example if you re hosting a sharepoint farm on premium storage you can use the sql server examples from this article to optimize the database server you can also optimize the sharepoint farm s web server and application server to get the most performance this article helps to answer the following common questions about optimizing application performance on premium storage how can you measure your application performance why aren t you seeing expected high performance which factors influence your application performance on premium storage how do these factors influence performance of your application on premium storage how can you optimize for input output operations per second iops bandwidth and latency we provide these guidelines specifically for premium storage because workloads running on premium storage are highly performance sensitive we provide examples where appropriate you can also apply some of these guidelines to applications running on infrastructure as a service iaas vms with standard storage disks note sometimes what appears to be a disk performance issue is actually a network bottleneck in these situations you should optimize your network performance if you re looking to benchmark your disk see the following articles for linux benchmark your application on azure disk storage for windows benchmark a disk if your vm supports
virtual-machines/premium-storage-performance.md,virtual-machines,7eb206e0baf9dccd72fd1c0d1c1db060d37cdd64f5b5f190080eb48111b91d8e,to be a disk performance issue is actually a network bottleneck in these situations you should optimize your network performance if you re looking to benchmark your disk see the following articles for linux benchmark your application on azure disk storage for windows benchmark a disk if your vm supports accelerated networking make sure it s enabled if it s not enabled you can enable it on already deployed vms on both windows and linux before you begin if you re new to premium storage first read select an azure disk type for iaas vms and scalability targets for premium page blob storage accounts application performance indicators we assess whether an application is performing well or not by using performance indicators like how fast an application is processing a user request how much data an application is processing per request how many requests an application is processing in a specific period of time how long a user has to wait to get a response after submitting their request the technical terms for these performance indicators are iops throughput or bandwidth and latency in this section we discuss the common performance indicators in the context of premium storage in the section performance application checklist for disks you learn how to measure these performance indicators for your application later in optimize application performance you learn about the factors that affect these performance indicators and recommendations to optimize them iops iops is the number of requests that your application is sending to storage disks in one second an input output operation could be read or write sequential or random online transaction processing oltp applications like an online retail website need to process many concurrent user requests immediately the user requests are insert and update intensive database transactions which the application must process quickly for this reason oltp applications require very high iops oltp applications handle millions of small and random i o requests if you have such an application you must design the application infrastructure to optimize for iops for more information on all the factors to consider to get high iops see optimize application performance when you attach a premium storage disk to your high scale vm azure provisions for you a guaranteed number of iops according to the disk specification for example a p50 disk provisions 7 500 iops each high scale vm size also has a specific iops limit that


In [0]:
# sanity check

from pyspark.sql import functions as F

# chunk_text is too long to show
chunks_df.select(
    "doc_id",
    "category",
    "chunk_index",
    F.substring("chunk_text", 1, 200).alias("chunk_preview")
).show(3, truncate=False)

+-----------------------------------------------+----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|doc_id                                         |category        |chunk_index|chunk_preview                                                                                                                                                                                           |
+-----------------------------------------------+----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|virtual-machines/hc-series-performance.md      |virtual-machines|0          |title hc series vm size performance description learn about performance testing re

How to create Azure OpenAI service and get a api key:
- In Azure Portal: Create Azure OpenAI resource
- Navigate to Foundry portal: Deploy a model of text-embedding-3-small or text-embedding-ada-002
- Get:
	- Endpoint
	- API key


How to store API key in notebook (one-time)
- Workspace â†’ Secrets
- Create scope: openai
  - Key: OPENAI_API_KEY
  - Value: your key

In [0]:
# Collect chunks in manageable batches

BATCH_SIZE = 64

rows = chunks_df.select(
    "chunk_id",
    "doc_id",
    "category",
    "title",
    "url",
    "chunk_index",
    "chunk_text"
).collect()

print(f"Found {len(rows)} rows")

Found 947 rows


In [0]:
# Below can run for a while depends on how many chunks
#  Generate embeddings

embedded_rows = []

for i in range(0, len(rows), BATCH_SIZE):
    print("Chunk: ", i)
    batch = rows[i:i + BATCH_SIZE]
    texts = [r.chunk_text for r in batch]

    embeddings = embed_texts(texts)

    for r, emb in zip(batch, embeddings):
        embedded_rows.append((
            r.chunk_id,
            r.doc_id,
            r.category,
            r.title,
            r.url,
            r.chunk_index,
            r.chunk_text,
            emb
        ))

    # Below line is used during initial test to limit data: only include first 10 batches
    # if i > 10*BATCH_SIZE: break

    time.sleep(0.5)  # be polite to API

Chunk:  0
Chunk:  64
Chunk:  128
Chunk:  192
Chunk:  256
Chunk:  320
Chunk:  384
Chunk:  448
Chunk:  512
Chunk:  576
Chunk:  640
Chunk:  704
Chunk:  768
Chunk:  832
Chunk:  896


[Trace(request_id=tr-3630405f303c421ba2ae3e6d17f8db46), Trace(request_id=tr-f92470dbf8fb441ba29caffe543b2648), Trace(request_id=tr-99fce6a425f141e280f96be4a39e37fc), Trace(request_id=tr-2a766d608f1b471d8e993839896c64eb), Trace(request_id=tr-c1b92d831b4b4ac59f699549214b3fe4), Trace(request_id=tr-672700aa8c3548d8bf0d59fa25057def), Trace(request_id=tr-3cb626b9217c4721a55ffbefe0590b67), Trace(request_id=tr-390f28b14cc148ffa541e3d864313910), Trace(request_id=tr-eac8d7a8ce8542cc979808363deaf7a8), Trace(request_id=tr-dcba26e2afaf4972a65f5b483544fdc8)]

In [0]:
# Create embeddings DataFrame

embeddings_df = spark.createDataFrame(
    embedded_rows,
    schema=[
        "chunk_id",
        "doc_id",
        "category",
        "title",
        "url",
        "chunk_index",
        "chunk_text",
        "embedding"
    ]
)

# Check vector length:
embeddings_df.select(F.size("embedding").alias("dim")).distinct().show()

+----+
| dim|
+----+
|1536|
+----+



In [0]:
(
    embeddings_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(EMB_TABLE)
)

In [0]:
spark.sql(f"""
    SELECT COUNT(*) FROM {EMB_TABLE}
""").display()


count(1)
947


In [0]:
spark.sql(f"""
    SELECT category, size(embedding) AS embedding_dim FROM {EMB_TABLE} LIMIT 5
""").display()

category,embedding_dim
virtual-machines,1536
virtual-machines,1536
virtual-machines,1536
virtual-machines,1536
virtual-machines,1536
