#### 01 - Ingest Azure Compute Documentation into databricks Unity Catalog

This notebook downloads Azure Compute documentation from GitHub, cleans Markdown content, and writes the data into a Unity Catalog–managed Delta table.

**Execution environment**
- Run this notebook on **Azure Databricks (Premium tier)** with **Unity Catalog enabled**
- Uses a **single-user Databricks cluster** (DBR 15+)
- Writes data as **Unity Catalog–managed Delta tables**

**What this notebook does**
- Downloads the latest Azure Compute documentation from the public GitHub repository  
  `MicrosoftDocs/azure-compute-docs` using a shallow Git clone
- Parses and cleans Markdown files under the `articles/` directory
- Extracts metadata such as document ID, category, title, source URL, and ingestion time
- Persists the processed documents into a governed Delta table: databricks_rag_demo.default.raw_azure_compute_docs

This notebook establishes the **raw document ingestion layer** for a Retrieval-Augmented Generation (RAG) pipeline and intentionally avoids legacy DBFS-based storage in favor of **Unity Catalog–managed data objects**.


In [0]:
%run ./00_constants

In [0]:
%sql

----- workspace will create a default catelog with same name as the workspace, we will mostly work in this catelog
SHOW CATALOGS;

catalog
databricks_rag_demo
samples
system


In [0]:
%sql SHOW SCHEMAS IN databricks_rag_demo;

databaseName
default
information_schema


In [0]:
import os
import re
from pathlib import Path
from pyspark.sql import Row
from datetime import datetime
import shutil
import stat
from collections import defaultdict

In [0]:
TARGET_ARTICLE_PATH = f"{DOC_LOCAL_DOWNLOAD_DIR}/articles"

In [0]:
def download_azure_compute_docs():

    ## delete existing download
    if os.path.exists(DOC_LOCAL_DOWNLOAD_DIR):
        print(f"Deleting {DOC_LOCAL_DOWNLOAD_DIR}")
        shutil.rmtree(DOC_LOCAL_DOWNLOAD_DIR)
    else:
        print(f"Folder does not exist: {DOC_LOCAL_DOWNLOAD_DIR}")

    try:
        import git
    except ImportError:
        # gitpython is a Python wrapper around the git command. We will use this to do git clone
        %pip install gitpython
        dbutils.library.restartPython()
    
    from git import Repo

    # SHALLOW clone
    Repo.clone_from(
        DOC_REPO_URL,
        DOC_LOCAL_DOWNLOAD_DIR,
        depth=1 # Depth = how much git history you download
    )

download_azure_compute_docs()
os.listdir(f"{TARGET_ARTICLE_PATH}")

Folder does not exist: /tmp/azure_compute_docs


['container-instances',
 'virtual-machines',
 'virtual-machine-scale-sets',
 'azure-impact-reporting',
 'service-fabric']

In [0]:
# Function to clean markdown text
def clean_markdown(md_text: str) -> str:
    # Remove code blocks
    #md_text = re.sub(r"```.*?```", "", md_text, flags=re.S)
    # Remove images
    md_text = re.sub(r"!\[.*?\]\(.*?\)", "", md_text)
    # Remove links but keep text
    md_text = re.sub(r"\[(.*?)\]\(.*?\)", r"\1", md_text)
    # Remove headings symbols
    md_text = re.sub(r"#+ ", "", md_text)
    return md_text.strip()

In [0]:
def prepare_spark_row_data():
    article_path = Path(TARGET_ARTICLE_PATH)
    rows = []
    min_length = 500  # skip stubs / TOCs

    counts = defaultdict(int)

    for md_file in article_path.rglob("*.md"):
        try:
            with open(md_file, "r", encoding="utf-8", errors="ignore") as f:
                raw_md = f.read()

            cleaned = clean_markdown(raw_md)
            if len(cleaned) < min_length:
                continue

            rel_path = str(md_file.relative_to(article_path))
            category = rel_path.split("/", 1)[0] if "/" in rel_path else rel_path

            #### Limit the docs used for each category, you can comment out if want to use all docs
            if counts[category] >= MAX_DOCS_PER_CATEGORY:
                continue

            rows.append(Row(
                doc_id=rel_path,
                source="azure-compute-docs",
                category=category,  # e.g. virtual-machines
                title=md_file.stem,
                raw_text=cleaned,
                url=f"https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/{md_file.relative_to(TARGET_ARTICLE_PATH)}",
                ingest_time=datetime.utcnow()
            ))

            counts[category] += 1

        except Exception as e:
            # Fail-safe: skip bad files
            continue
    return rows

rows = prepare_spark_row_data()
print(f"Found {len(rows)} docs")
docs_df = spark.createDataFrame(rows)

display(docs_df.limit(2))



  ingest_time=datetime.utcnow()


Found 214 docs


doc_id,source,category,title,raw_text,url,ingest_time
container-instances/container-instances-quickstart.md,azure-compute-docs,container-instances,container-instances-quickstart,"--- title: Quickstart - Deploy Docker container to container instance - Azure CLI description: In this quickstart, you use the Azure CLI to quickly deploy a containerized web app that runs in an isolated Azure container instance ms.topic: quickstart ms.author: tomcassidy author: tomvcassidy ms.service: azure-container-instances services: container-instances ms.date: 11/17/2025 ms.update-cycle: 180-days ms.custom: mvc, devx-track-azurecli, mode-api Customer intent: As a developer, I want to quickly deploy a Docker container using the command line, so that I can run my web application without managing complex orchestration platforms. --- Quickstart: Deploy a container instance in Azure using the Azure CLI Use Azure Container Instances to run serverless Docker containers in Azure with simplicity and speed. Deploy an application to a container instance on-demand when you don't need a full container orchestration platform like Azure Kubernetes Service. In this quickstart, you use the Azure CLI to deploy an isolated Docker container and make its application available with a fully qualified domain name (FQDN). A few seconds after you execute a single deployment command, you can browse to the application running in the container: ![View an app deployed to Azure Container Instances in browser][aci-app-browser] !INCLUDE [quickstarts-free-trial-note] !INCLUDE [azure-cli-prepare-your-environment.md] - This quickstart requires version 2.0.55 or later of the Azure CLI. If using Azure Cloud Shell, the latest version is already installed.  > [!WARNING]  > Best practice: User’s credentials passed via command line interface (CLI) are stored as plain text in the backend. Storing credentials in plain text is a security risk; Microsoft advises customers to store user credentials in CLI environment variables to ensure they are encrypted/transformed when stored in the backend. Create a resource group Azure container instances, like all Azure resources, must be deployed into a resource group. Resource groups allow you to organize and manage related Azure resources. First, create a resource group named *myResourceGroup* in the *eastus* location with the [az group create][az-group-create] command: ```azurecli-interactive az group create --name myResourceGroup --location eastus ``` Create a container Now that you have a resource group, you can run a container in Azure. To create a container instance with the Azure CLI, provide a resource group name, container instance name, and Docker container image to the [az container create][az-container-create] command. In this quickstart, you use the public `mcr.microsoft.com/azuredocs/aci-helloworld` image. This image packages a small web app written in Node.js that serves a static HTML page. You can expose your containers to the internet by specifying one or more ports to open, a DNS name label, or both. In this quickstart, you deploy a container with a DNS name label so that the web app is publicly reachable. Execute a command similar to the following to start a container instance. Set a `--dns-name-label` value that's unique within the Azure region where you create the instance. If you receive a ""DNS name label not available"" error message, try a different DNS name label. ```azurecli-interactive az container create --resource-group myResourceGroup --name mycontainer --image mcr.microsoft.com/azuredocs/aci-helloworld --dns-name-label aci-demo --ports 80 --os-type linux --memory 1.5 --cpu 1 ``` To deploy the container into a specific availability zone, use the `--zone` argument and specify the logical zone number: ```azurecli-interactive az container create --resource-group myResourceGroup --name mycontainer --image mcr.microsoft.com/azuredocs/aci-helloworld --dns-name-label aci-demo --ports 80 --os-type linux --memory 1.5 --cpu 1 --zone 1 ``` > [!IMPORTANT] > Zonal deployments are only available in regions that support availability zones. To see if your region supports availability zones, see Azure Regions List. Within a few seconds, you should get a response from the Azure CLI indicating the deployment completed. Check its status with the [az container show][az-container-show] command: ```azurecli-interactive az container show --resource-group myResourceGroup --name mycontainer --query ""{FQDN:ipAddress.fqdn,ProvisioningState:provisioningState}"" --out table ``` When you run the command, the container's fully qualified domain name (FQDN) and its provisioning state are displayed. ```output FQDN ProvisioningState --------------------------------- ------------------- aci-demo.eastus.azurecontainer.io Succeeded ``` If the container's `ProvisioningState` is **Succeeded**, go to its FQDN in your browser. If you see a web page similar to the following, congratulations! You successfully deployed an application running in a Docker container to Azure. ![View an app deployed to Azure Container Instances in browser][aci-app-browser] If at first the application isn't displayed, you might need to wait a few seconds while DNS propagates, then try refreshing your browser. Pull the container logs When you need to troubleshoot a container or the application it runs (or just see its output), start by viewing the container instance's logs. Pull the container instance logs with the [az container logs][az-container-logs] command: ```azurecli-interactive az container logs --resource-group myResourceGroup --name mycontainer ``` The output displays the logs for the container, and should show the HTTP GET requests generated when you viewed the application in your browser. ```output listening on port 80 ::ffff:10.240.255.55 - - [21/Mar/2019:17:43:53 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ::ffff:10.240.255.55 - - [21/Mar/2019:17:44:36 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ::ffff:10.240.255.55 - - [21/Mar/2019:17:44:36 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ``` Attach output streams In addition to viewing the logs, you can attach your local standard out and standard error streams to that of the container. First, execute the [az container attach][az-container-attach] command to attach your local console to the container's output streams: ```azurecli-interactive az container attach --resource-group myResourceGroup --name mycontainer ``` Once attached, refresh your browser a few times to generate some more output. When you're done, detach your console with `Control+C`. You should see output similar to the following sample: ```output Container 'mycontainer' is in state 'Running'... (count: 1) (last timestamp: 2019-03-21 17:27:20+00:00) pulling image ""mcr.microsoft.com/azuredocs/aci-helloworld"" (count: 1) (last timestamp: 2019-03-21 17:27:24+00:00) Successfully pulled image ""mcr.microsoft.com/azuredocs/aci-helloworld"" (count: 1) (last timestamp: 2019-03-21 17:27:27+00:00) Created container (count: 1) (last timestamp: 2019-03-21 17:27:27+00:00) Started container Start streaming logs: listening on port 80 ::ffff:10.240.255.55 - - [21/Mar/2019:17:43:53 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ::ffff:10.240.255.55 - - [21/Mar/2019:17:44:36 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ::ffff:10.240.255.55 - - [21/Mar/2019:17:44:36 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ::ffff:10.240.255.55 - - [21/Mar/2019:17:47:01 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ::ffff:10.240.255.56 - - [21/Mar/2019:17:47:12 +0000] ""GET / HTTP/1.1"" 304 - ""-"" ""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"" ``` Clean up resources When you're done with the container, remove it using the [az container delete][az-container-delete] command: ```azurecli-interactive az container delete --resource-group myResourceGroup --name mycontainer ``` To verify that the container deleted, execute the az container list command: ```azurecli-interactive az container list --resource-group myResourceGroup --output table ``` The **mycontainer** container shouldn't appear in the command's output. If you have no other containers in the resource group, no output is displayed. If you're done with the *myResourceGroup* resource group and all the resources it contains, delete it with the [az group delete][az-group-delete] command: ```azurecli-interactive az group delete --name myResourceGroup ``` Next steps In this quickstart, you created an Azure container instance by using a public Microsoft image. If you'd like to build a container image and deploy it from a private Azure container registry, continue to the Azure Container Instances tutorial. > [!div class=""nextstepaction""] > Azure Container Instances tutorial To try out options for running containers in an orchestration system on Azure, see the [Azure Kubernetes Service (AKS)][container-service] quickstarts. [aci-app-browser]: ./media/container-instances-quickstart/view-an-application-running-in-an-azure-container-instance.png [app-github-repo]: https://github.com/Azure-Samples/aci-helloworld.git [azure-account]: https://azure.microsoft.com/free/ [node-js]: https://nodejs.org [az-container-attach]: /cli/azure/container#az_container_attach [az-container-create]: /cli/azure/container#az_container_create [az-container-delete]: /cli/azure/container#az_container_delete [az-container-list]: /cli/azure/container#az_container_list [az-container-logs]: /cli/azure/container#az_container_logs [az-container-show]: /cli/azure/container#az_container_show [az-group-create]: /cli/azure/group#az_group_create [az-group-delete]: /cli/azure/group#az_group_delete [azure-cli-install]: /cli/azure/install-azure-cli [container-service]: /azure/aks/intro-kubernetes",https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/container-instances/container-instances-quickstart.md,2026-01-15T00:41:03.347812Z
container-instances/container-instances-using-azure-container-registry.md,azure-compute-docs,container-instances,container-instances-using-azure-container-registry,"--- title: Deploy container image from Azure Container Registry using a service principal description: Learn how to deploy containers in Azure Container Instances by pulling container images from an Azure container registry using a service principal. ms.topic: how-to ms.author: tomcassidy author: tomvcassidy ms.service: azure-container-instances services: container-instances ms.date: 11/17/2025 ms.custom: mvc, devx-track-azurecli, devx-track-arm-template Customer intent: As a cloud developer, I want to deploy container images from a container registry using a service principal, so that I can ensure secure and manageable access control for automated deployments in Azure Container Instances. --- Deploy to Azure Container Instances from Azure Container Registry using a service principal Azure Container Registry is an Azure-based, managed container registry service used to store private Docker container images. This article describes how to pull container images stored in an Azure container registry when deploying to Azure Container Instances. One way to configure registry access is to create a Microsoft Entra service principal and password, and store the sign-in credentials in an Azure key vault. Prerequisites **Azure container registry**: You need an Azure container registry--and at least one container image in the registry--to complete the steps in this article. If you need a registry, see Create a container registry using the Azure CLI. **Azure CLI**: The command-line examples in this article use the Azure CLI and are formatted for the Bash shell. You can install the Azure CLI locally, or use the [Azure Cloud Shell][cloud-shell-bash]. Limitations * Windows containers don't support system-assigned managed identity-authenticated image pulls with ACR, only user-assigned. Configure registry authentication In a production scenario where you provide access to ""headless"" services and applications, we recommend you configure registry access by using a service principal. A service principal allows you to provide Azure role-based access control (Azure RBAC) to your container images. For example, you can configure a service principal with pull-only access to a registry. Azure Container Registry provides more authentication options. In the following section, you create an Azure key vault and a service principal, and store the service principal's credentials in the vault. Create key vault If you don't already have a vault in Azure Key Vault, create one with the Azure CLI using the following commands. Update the `RES_GROUP` variable with the name of an existing resource group in which to create the key vault, and `ACR_NAME` with the name of your container registry. For brevity, commands in this article assume that your registry, key vault, and container instances are all created in the same resource group.  Specify a name for your new key vault in `AKV_NAME`. The vault name must be unique within Azure and must be 3-24 alphanumeric characters in length, begin with a letter, end with a letter or digit, and can't contain consecutive hyphens. ```azurecli RES_GROUP=myresourcegroup Resource Group name ACR_NAME=myregistry Azure Container Registry registry name AKV_NAME=mykeyvault Azure Key Vault vault name az keyvault create -g $RES_GROUP -n $AKV_NAME ``` Create service principal and store credentials Now create a service principal and store its credentials in your key vault. The following commands use [az ad sp create-for-rbac][az-ad-sp-create-for-rbac] to create the service principal, and [az keyvault secret set][az-keyvault-secret-set] to store the service principal's **password** in the vault. Be sure to take note of the service principal's **appId** upon creation. ```azurecli Create service principal az ad sp create-for-rbac \  --name http://$ACR_NAME-pull \  --scopes $(az acr show --name $ACR_NAME --query id --output tsv) \  --role acrpull SP_ID=xxxx Replace with your service principal's appId Store the registry *password* in the vault az keyvault secret set \  --vault-name $AKV_NAME \  --name $ACR_NAME-pull-pwd \  --value $(az ad sp show --id $SP_ID --query password --output tsv) ``` The `--role` argument in the preceding command configures the service principal with the *acrpull* role, which grants it pull-only access to the registry. To grant both push and pull access, change the `--role` argument to *acrpush*. Next, store the service principal's *appId* in the vault, which is the **username** you pass to Azure Container Registry for authentication. ```azurecli Store service principal ID in vault (the registry *username*) az keyvault secret set \  --vault-name $AKV_NAME \  --name $ACR_NAME-pull-usr \  --value $(az ad sp show --id $SP_ID --query appId --output tsv) ``` You created an Azure key vault and stored two secrets in it: * `$ACR_NAME-pull-usr`: The service principal ID, for use as the container registry **username**. * `$ACR_NAME-pull-pwd`: The service principal password, for use as the container registry **password**. You can now reference these secrets by name when you or your applications and services pull images from the registry. Deploy container with Azure CLI Now that the service principal credentials are stored in Azure Key Vault secrets, your applications and services can use them to access your private registry. First get the registry's login server name by using the [az acr show][az-acr-show] command. The login server name is all lowercase and similar to `myregistry.azurecr.io`. ```azurecli ACR_LOGIN_SERVER=$(az acr show --name $ACR_NAME --resource-group $RES_GROUP --query ""loginServer"" --output tsv) ``` Execute the following az container create][az-container-create] command to deploy a container instance. The command uses the service principal's credentials stored in Azure Key Vault to authenticate to your container registry, and assumes you previously pushed the [aci-helloworld image to your registry. Update the `--image` value if you'd like to use a different image from your registry. ```azurecli az container create \  --name aci-demo \  --resource-group $RES_GROUP \  --image $ACR_LOGIN_SERVER/aci-helloworld:v1 \  --registry-login-server $ACR_LOGIN_SERVER \  --registry-username $(az keyvault secret show --vault-name $AKV_NAME -n $ACR_NAME-pull-usr --query value -o tsv) \  --registry-password $(az keyvault secret show --vault-name $AKV_NAME -n $ACR_NAME-pull-pwd --query value -o tsv) \  --dns-name-label aci-demo-$RANDOM \  --query ipAddress.fqdn ``` The `--dns-name-label` value must be unique within Azure, so the preceding command appends a random number to the container's DNS name label. The output from the command displays the container's fully qualified domain name (FQDN), for example: ```output ""aci-demo-25007.eastus.azurecontainer.io"" ``` Once the container starts successfully, you can navigate to its FQDN in your browser to verify the application is running successfully. Deploy with Azure Resource Manager template You can specify the properties of your Azure container registry in an Azure Resource Manager template by including the `imageRegistryCredentials` property in the container group definition. For example, you can specify the registry credentials directly: ```JSON [...] ""imageRegistryCredentials"": [  {  ""server"": ""imageRegistryLoginServer"",  ""username"": ""imageRegistryUsername"",  ""password"": ""imageRegistryPassword""  } ] [...] ``` For complete container group settings, see the Resource Manager template reference. For details on referencing Azure Key Vault secrets in a Resource Manager template, see Use Azure Key Vault to pass secure parameter value during deployment. Deploy with Azure portal If you maintain container images in an Azure container registry, you can easily create a container in Azure Container Instances using the Azure portal. When using the portal to deploy a container instance from a container registry, you must enable the registry's admin account. The admin account is designed for a single user to access the registry, mainly for testing purposes. 1. In the Azure portal, navigate to your container registry. 1. To confirm that the admin account is enabled, select **Access keys**, and under **Admin user** select **Enable**. 1. Select **Repositories**, then select the repository that you want to deploy from, right-click the tag for the container image you want to deploy, and select **Run instance**.  ![""Run instance"" in Azure Container Registry in the Azure portal][acr-runinstance-contextmenu] 1. Enter a name for the container and a name for the resource group. You can also change the default values if you wish.  ![Create menu for Azure Container Instances][acr-create-deeplink] 1. Once the deployment completes, you can navigate to the container group from the notifications pane to find its IP address and other properties.  ![Details view for Azure Container Instances container group][aci-detailsview] Next steps For more information about Azure Container Registry authentication, see Authenticate with an Azure container registry. [acr-create-deeplink]: ./media/container-instances-using-azure-container-registry/acr-create-deeplink.png [aci-detailsview]: ./media/container-instances-using-azure-container-registry/aci-detailsview.png [acr-runinstance-contextmenu]: ./media/container-instances-using-azure-container-registry/acr-runinstance-contextmenu.png [cloud-shell-bash]: https://shell.azure.com/bash [cloud-shell-try-it]: https://shell.azure.com/powershell [az-acr-show]: /cli/azure/acr#az_acr_show [az-ad-sp-create-for-rbac]: /cli/azure/ad/sp#az_ad_sp_create_for_rbac [az-container-create]: /cli/azure/container#az_container_create [az-keyvault-secret-set]: /cli/azure/keyvault/secret#az_keyvault_secret_set",https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/container-instances/container-instances-using-azure-container-registry.md,2026-01-15T00:41:03.348408Z


In [0]:
# In Databricks notebooks, spark is: a pre-initialized SparkSession object that Databricks creates for you.

type(spark)

pyspark.sql.connect.session.SparkSession

In [0]:
# write to table

(
    docs_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(RAW_DOCS_TABLE)
)

In [0]:
## Validate data is stored into the table successfully

spark.sql(f"""
    SELECT COUNT(*) FROM {RAW_DOCS_TABLE}
""").display()

count(1)
214


In [0]:
## delete this table if needed, this will clean up the environment

# spark.sql(f"""
#     DROP TABLE {RAW_DOCS_TABLE}
# """).display()