#### 01 - Ingest Azure Compute Documentation into databricks Unity Catalog

This notebook downloads Azure Compute documentation from GitHub, cleans Markdown content, and writes the data into a Unity Catalog–managed Delta table.

**Execution environment**
- Run this notebook on **Azure Databricks (Premium tier)** with **Unity Catalog enabled**
- Uses a **single-user Databricks cluster** (DBR 15+)
- Writes data as **Unity Catalog–managed Delta tables**

**What this notebook does**
- Downloads the latest Azure Compute documentation from the public GitHub repository  
  `MicrosoftDocs/azure-compute-docs` using a shallow Git clone
- Parses and cleans Markdown files under the `articles/` directory
- Extracts metadata such as document ID, category, title, source URL, and ingestion time
- Persists the processed documents into a governed Delta table:
databricks_rag_demo.default.raw_azure_compute_docs

This notebook establishes the **raw document ingestion layer** for a
Retrieval-Augmented Generation (RAG) pipeline and intentionally avoids
legacy DBFS-based storage in favor of **Unity Catalog–managed data objects**.


In [0]:
%sql

----- workspace will create a default catelog with same name as the workspace, we will mostly work in this catelog
SHOW CATALOGS;

catalog
databricks_rag_demo
samples
system


In [0]:
%sql SHOW SCHEMAS IN databricks_rag_demo;

databaseName
default
information_schema


In [0]:
##### gitpython is a Python wrapper around the git command. We will use this to do git clone

%pip install gitpython # gitpython is a Python wrapper around the git command.

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
from git import Repo
import os
import re
from pathlib import Path
from pyspark.sql import Row
from datetime import datetime
from git import Repo
import shutil
import stat

In [0]:
REPO_URL = "https://github.com/MicrosoftDocs/azure-compute-docs.git"

TARGET_DIR = "/tmp/azure-compute-docs" # it will be created on the driver VM’s local disk.
TARGET_ARTICLE_PATH = f"{TARGET_DIR}/articles"

DEFAULT_CATELOG_NAME = "databricks_rag_demo"
TABLE_NAME="raw_azure_compute_docs"

In [0]:
%sh rm -rf /tmp/azure-compute-docs

In [0]:
def download_azure_compute_docs():
    
    # SHALLOW clone
    Repo.clone_from(
        REPO_URL,
        TARGET_DIR,
        depth=1 # Depth = how much git history you download
    )

download_azure_compute_docs()
os.listdir(f"{TARGET_ARTICLE_PATH}")

['service-fabric',
 'virtual-machines',
 'container-instances',
 'azure-impact-reporting',
 'virtual-machine-scale-sets']

In [0]:
# Function to clean markdown text
def clean_markdown(md_text: str) -> str:
    # Remove code blocks
    #md_text = re.sub(r"```.*?```", "", md_text, flags=re.S)
    # Remove images
    md_text = re.sub(r"!\[.*?\]\(.*?\)", "", md_text)
    # Remove links but keep text
    md_text = re.sub(r"\[(.*?)\]\(.*?\)", r"\1", md_text)
    # Remove headings symbols
    md_text = re.sub(r"#+ ", "", md_text)
    return md_text.strip()

In [0]:
def prepare_spark_row_data():
    article_path = Path(TARGET_ARTICLE_PATH)
    rows = []
    min_length = 500  # skip stubs / TOCs

    for md_file in article_path.rglob("*.md"):
        try:
            with open(md_file, "r", encoding="utf-8", errors="ignore") as f:
                raw_md = f.read()

            cleaned = clean_markdown(raw_md)

            if len(cleaned) < min_length:
                continue

            rel_path = str(md_file.relative_to(article_path))
            category = rel_path.split("/", 1)[0] if "/" in rel_path else rel_path

            rows.append(Row(
                doc_id=rel_path,
                source="azure-compute-docs",
                category=category,  # e.g. virtual-machines
                title=md_file.stem,
                raw_text=cleaned,
                url=f"https://learn.microsoft.com/en-us/azure/{md_file.relative_to(TARGET_ARTICLE_PATH)}",
                ingest_time=datetime.utcnow()
            ))

        except Exception as e:
            # Fail-safe: skip bad files
            continue
    return rows

rows = prepare_spark_row_data()
docs_df = spark.createDataFrame(rows)

display(docs_df.limit(2))



  ingest_time=datetime.utcnow()


doc_id,source,category,title,raw_text,url,ingest_time
service-fabric/service-fabric-best-practices-infrastructure-as-code.md,azure-compute-docs,service-fabric,service-fabric-best-practices-infrastructure-as-code,"--- title: Azure Service Fabric infrastructure as Code Best Practices description: Best practices and design considerations for managing Azure Service Fabric as infrastructure as code. ms.topic: concept-article ms.author: tomcassidy author: tomvcassidy ms.service: azure-service-fabric services: service-fabric ms.date: 07/14/2022 Customer intent: ""As a cloud administrator, I want to utilize Infrastructure as Code to deploy and manage Azure Service Fabric clusters, so that I can ensure consistent and efficient resource configuration and maintenance."" --- Infrastructure as code In a production scenario, create Azure Service Fabric clusters using Resource Manager templates. Resource Manager templates provide greater control of resource properties and ensure that you have a consistent resource model. Sample Resource Manager templates are available for Windows and Linux in the Azure samples on GitHub. These templates can be used as a starting point for your cluster template. Download `azuredeploy.json` and `azuredeploy.parameters.json` and edit them to meet your custom requirements. !INCLUDE [updated-for-az] To deploy the `azuredeploy.json` and `azuredeploy.parameters.json` templates you downloaded above, use the following Azure CLI commands: ```azurecli ResourceGroupName=""sfclustergroup"" Location=""westus"" az group create --name $ResourceGroupName --location $Location az deployment group create --name $ResourceGroupName --template-file azuredeploy.json --parameters @azuredeploy.parameters.json ``` Creating a resource using PowerShell ```powershell $ResourceGroupName=""sfclustergroup"" $Location=""westus"" $Template=""azuredeploy.json"" $Parameters=""azuredeploy.parameters.json"" New-AzResourceGroup -Name $ResourceGroupName -Location $Location New-AzResourceGroupDeployment -Name $ResourceGroupName -TemplateFile $Template -TemplateParameterFile $Parameters ``` Service Fabric resources You can deploy applications and services onto your Service Fabric cluster via Azure Resource Manager. See Manage applications and services as Azure Resource Manager resources for details. The following are best practice Service Fabric application specific resources to include in your Resource Manager template resources. ```json {  ""apiVersion"": ""2019-03-01"",  ""type"": ""Microsoft.ServiceFabric/clusters/applicationTypes"",  ""name"": ""[concat(parameters('clusterName'), '/', parameters('applicationTypeName'))]"",  ""location"": ""[variables('clusterLocation')]"", }, {  ""apiVersion"": ""2019-03-01"",  ""type"": ""Microsoft.ServiceFabric/clusters/applicationTypes/versions"",  ""name"": ""[concat(parameters('clusterName'), '/', parameters('applicationTypeName'), '/', parameters('applicationTypeVersion'))]"",  ""location"": ""[variables('clusterLocation')]"", }, {  ""apiVersion"": ""2019-03-01"",  ""type"": ""Microsoft.ServiceFabric/clusters/applications"",  ""name"": ""[concat(parameters('clusterName'), '/', parameters('applicationName'))]"",  ""location"": ""[variables('clusterLocation')]"", }, {  ""apiVersion"": ""2019-03-01"",  ""type"": ""Microsoft.ServiceFabric/clusters/applications/services"",  ""name"": ""[concat(parameters('clusterName'), '/', parameters('applicationName'), '/', parameters('serviceName'))]"",  ""location"": ""[variables('clusterLocation')]"" } ``` To deploy your application using Azure Resource Manager, you first must create a sfpkg Service Fabric Application package. The following Python script is an example of how to create a sfpkg: ```python Create SFPKG that needs to be uploaded to Azure Storage Blob Container microservices_sfpkg = zipfile.ZipFile(  self.microservices_app_package_name, 'w', zipfile.ZIP_DEFLATED) package_length = len(self.microservices_app_package_path) for root, dirs, files in os.walk(self.microservices_app_package_path):  root_folder = root[package_length:]  for file in files:  microservices_sfpkg.write(os.path.join(  root, file), os.path.join(root_folder, file)) microservices_sfpkg.close() ``` Virtual machine OS automatic upgrade configuration Upgrading your virtual machines is a user initiated operation, and it is recommended that you enable virtual machine scale set automatic image upgrades for your Service Fabric cluster node patch management. Patch Orchestration Application (POA) is an alternative solution that is intended for non-Azure hosted clusters. Although POA can be used in Azure, hosting it requires more management than simply enabling scale set automatic OS image upgrades. The following are the virtual machine scale set Resource Manager template properties to enable automtic OS upgrades: ```json ""upgradePolicy"": {  ""mode"": ""Automatic"",  ""automaticOSUpgradePolicy"": {  ""enableAutomaticOSUpgrade"": true,  ""disableAutomaticRollback"": false  } }, ``` When using automatic OS upgrades with Service Fabric, the new OS image is rolled out one Update Domain at a time to maintain high availability of the services running in Service Fabric. To utilize Automatic OS Upgrades in Service Fabric, your cluster must be configured to use the Silver Durability Tier or higher. Ensure the following registry key is set to false to prevent your windows host machines from initiating uncoordinated updates: HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU. Set the following virtual machine scale set template properties to disable Windows Update: ```json ""osProfile"": {  ""computerNamePrefix"": ""{vmss-name}"",  ""adminUsername"": ""{your-username}"",  ""secrets"": [],  ""windowsConfiguration"": {  ""provisionVMAgent"": true,  ""enableAutomaticUpdates"": false  }  }, ``` Service Fabric cluster upgrade configuration The following is the Service Fabric cluster template property to enable automatic upgrade: ```json ""upgradeMode"": ""Automatic"", ``` To manually upgrade your cluster, download the cab/deb distribution to a cluster virtual machine, and then invoke the following PowerShell: ```powershell Copy-ServiceFabricClusterPackage -Code -CodePackagePath <""local_VM_path_to_msi""> -CodePackagePathInImageStore ServiceFabric.msi -ImageStoreConnectionString ""fabric:ImageStore"" Register-ServiceFabricClusterPackage -Code -CodePackagePath ""ServiceFabric.msi"" Start-ServiceFabricClusterUpgrade -Code -CodePackageVersion <""msi_code_version""> ``` Next steps * Create a cluster on VMs or computers running Windows Server: Service Fabric cluster creation for Windows Server * Create a cluster on VMs or computers running Linux: Create a Linux cluster * Learn about Service Fabric support options",https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-best-practices-infrastructure-as-code.md,2026-01-08T05:31:03.035457Z
service-fabric/service-fabric-reliable-actors-access-save-remove-state.md,azure-compute-docs,service-fabric,service-fabric-reliable-actors-access-save-remove-state,"--- title: Manage Azure Service Fabric state description: Learn about accessing, saving, and removing state for an Azure Service Fabric Reliable Actor, and considerations when designing an application. ms.topic: how-to ms.author: tomcassidy author: tomvcassidy ms.service: azure-service-fabric services: service-fabric ms.date: 07/11/2022 Customer intent: As a developer working with Reliable Actors, I want to manage actor state through access, saving, and removal methods, so that I can ensure data consistency and reliability in my cloud-based applications. --- Access, save, and remove Reliable Actors state Reliable Actors are single-threaded objects that can encapsulate both logic and state and maintain state reliably. Every actor instance has its own state manager: a dictionary-like data structure that reliably stores key/value pairs. The state manager is a wrapper around a state provider. You can use it to store data regardless of which persistence setting is used. State manager keys must be strings. Values are generic and can be any type, including custom types. Values stored in the state manager must be data contract serializable because they might be transmitted over the network to other nodes during replication and might be written to disk, depending on an actor's state persistence setting. The state manager exposes common dictionary methods for managing state, similar to those found in Reliable Dictionary. For information, see best practices in managing actor state. Access state State is accessed through the state manager by key. State manager methods are all asynchronous because they might require disk I/O when actors have persisted state. Upon first access, state objects are cached in memory. Repeat access operations access objects directly from memory and return synchronously without incurring disk I/O or asynchronous context-switching overhead. A state object is removed from the cache in the following cases: * An actor method throws an unhandled exception after it retrieves an object from the state manager. * An actor is reactivated, either after being deactivated or after failure. * The state provider pages state to disk. This behavior depends on the state provider implementation. The default state provider for the `Persisted` setting has this behavior. You can retrieve state by using a standard *Get* operation that throws `KeyNotFoundException`(C#) or `NoSuchElementException`(Java) if an entry does not exist for the key: ```csharp [StatePersistence(StatePersistence.Persisted)] class MyActor : Actor, IMyActor {  public MyActor(ActorService actorService, ActorId actorId)  : base(actorService, actorId)  {  }  public Task GetCountAsync()  {  return this.StateManager.GetStateAsync(""MyState"");  } } ``` ```Java @StatePersistenceAttribute(statePersistence = StatePersistence.Persisted) class MyActorImpl extends FabricActor implements MyActor {  public MyActorImpl(ActorService actorService, ActorId actorId)  {  super(actorService, actorId);  }  public CompletableFuture getCountAsync()  {  return this.stateManager().getStateAsync(""MyState"");  } } ``` You can also retrieve state by using a *TryGet* method that does not throw if an entry does not exist for a key: ```csharp class MyActor : Actor, IMyActor {  public MyActor(ActorService actorService, ActorId actorId)  : base(actorService, actorId)  {  }  public async Task GetCountAsync()  {  ConditionalValue result = await this.StateManager.TryGetStateAsync(""MyState"");  if (result.HasValue)  {  return result.Value;  }  return 0;  } } ``` ```Java class MyActorImpl extends FabricActor implements MyActor {  public MyActorImpl(ActorService actorService, ActorId actorId)  {  super(actorService, actorId);  }  public CompletableFuture getCountAsync()  {  return this.stateManager().tryGetStateAsync(""MyState"").thenApply(result -> {  if (result.hasValue()) {  return result.getValue();  } else {  return 0;  });  } } ``` Save state The state manager retrieval methods return a reference to an object in local memory. Modifying this object in local memory alone does not cause it to be saved durably. When an object is retrieved from the state manager and modified, it must be reinserted into the state manager to be saved durably. You can insert state by using an unconditional *Set*, which is the equivalent of the `dictionary[""key""] = value` syntax: ```csharp [StatePersistence(StatePersistence.Persisted)] class MyActor : Actor, IMyActor {  public MyActor(ActorService actorService, ActorId actorId)  : base(actorService, actorId)  {  }  public Task SetCountAsync(int value)  {  return this.StateManager.SetStateAsync(""MyState"", value);  } } ``` ```Java @StatePersistenceAttribute(statePersistence = StatePersistence.Persisted) class MyActorImpl extends FabricActor implements MyActor {  public MyActorImpl(ActorService actorService, ActorId actorId)  {  super(actorService, actorId);  }  public CompletableFuture setCountAsync(int value)  {  return this.stateManager().setStateAsync(""MyState"", value);  } } ``` You can add state by using an *Add* method. This method throws `InvalidOperationException`(C#) or `IllegalStateException`(Java) when it tries to add a key that already exists. ```csharp [StatePersistence(StatePersistence.Persisted)] class MyActor : Actor, IMyActor {  public MyActor(ActorService actorService, ActorId actorId)  : base(actorService, actorId)  {  }  public Task AddCountAsync(int value)  {  return this.StateManager.AddStateAsync(""MyState"", value);  } } ``` ```Java @StatePersistenceAttribute(statePersistence = StatePersistence.Persisted) class MyActorImpl extends FabricActor implements MyActor {  public MyActorImpl(ActorService actorService, ActorId actorId)  {  super(actorService, actorId);  }  public CompletableFuture addCountAsync(int value)  {  return this.stateManager().addOrUpdateStateAsync(""MyState"", value, (key, old_value) -> old_value + value);  } } ``` You can also add state by using a *TryAdd* method. This method does not throw when it tries to add a key that already exists. ```csharp [StatePersistence(StatePersistence.Persisted)] class MyActor : Actor, IMyActor {  public MyActor(ActorService actorService, ActorId actorId)  : base(actorService, actorId)  {  }  public async Task AddCountAsync(int value)  {  bool result = await this.StateManager.TryAddStateAsync(""MyState"", value);  if (result)  {  // Added successfully!  }  } } ``` ```Java @StatePersistenceAttribute(statePersistence = StatePersistence.Persisted) class MyActorImpl extends FabricActor implements MyActor {  public MyActorImpl(ActorService actorService, ActorId actorId)  {  super(actorService, actorId);  }  public CompletableFuture addCountAsync(int value)  {  return this.stateManager().tryAddStateAsync(""MyState"", value).thenApply((result)->{  if(result)  {  // Added successfully!  }  });  } } ``` At the end of an actor method, the state manager automatically saves any values that have been added or modified by an insert or update operation. A ""save"" can include persisting to disk and replication, depending on the settings used. Values that have not been modified are not persisted or replicated. If no values have been modified, the save operation does nothing. If saving fails, the modified state is discarded and the original state is reloaded. You can also save state manually by calling the `SaveStateAsync` method on the actor base: ```csharp async Task IMyActor.SetCountAsync(int count) {  await this.StateManager.AddOrUpdateStateAsync(""count"", count, (key, value) => count > value ? count : value);  await this.SaveStateAsync(); } ``` ```Java interface MyActor {  CompletableFuture setCountAsync(int count)  {  this.stateManager().addOrUpdateStateAsync(""count"", count, (key, value) -> count > value ? count : value).thenApply();  this.stateManager().saveStateAsync().thenApply();  } } ``` Remove state You can remove state permanently from an actor's state manager by calling the *Remove* method. This method throws `KeyNotFoundException`(C#) or `NoSuchElementException`(Java) when it tries to remove a key that doesn't exist. ```csharp [StatePersistence(StatePersistence.Persisted)] class MyActor : Actor, IMyActor {  public MyActor(ActorService actorService, ActorId actorId)  : base(actorService, actorId)  {  }  public Task RemoveCountAsync()  {  return this.StateManager.RemoveStateAsync(""MyState"");  } } ``` ```Java @StatePersistenceAttribute(statePersistence = StatePersistence.Persisted) class MyActorImpl extends FabricActor implements MyActor {  public MyActorImpl(ActorService actorService, ActorId actorId)  {  super(actorService, actorId);  }  public CompletableFuture removeCountAsync()  {  return this.stateManager().removeStateAsync(""MyState"");  } } ``` You can also remove state permanently by using the *TryRemove* method. This method does not throw when it tries to remove a key that does not exist. ```csharp [StatePersistence(StatePersistence.Persisted)] class MyActor : Actor, IMyActor {  public MyActor(ActorService actorService, ActorId actorId)  : base(actorService, actorId)  {  }  public async Task RemoveCountAsync()  {  bool result = await this.StateManager.TryRemoveStateAsync(""MyState"");  if (result)  {  // State removed!  }  } } ``` ```Java @StatePersistenceAttribute(statePersistence = StatePersistence.Persisted) class MyActorImpl extends FabricActor implements MyActor {  public MyActorImpl(ActorService actorService, ActorId actorId)  {  super(actorService, actorId);  }  public CompletableFuture removeCountAsync()  {  return this.stateManager().tryRemoveStateAsync(""MyState"").thenApply((result)->{  if(result)  {  // State removed!  }  });  } } ``` Next steps State that's stored in Reliable Actors must be serialized before it's written to disk and replicated for high availability. Learn more about Actor type serialization. Next, learn more about Actor diagnostics and performance monitoring.",https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-actors-access-save-remove-state.md,2026-01-08T05:31:03.036412Z


In [0]:
# In Databricks notebooks, spark is: a pre-initialized SparkSession object that Databricks creates for you.

type(spark)

pyspark.sql.connect.session.SparkSession

In [0]:
def save_to_delta_table(docs_df):

    # write to warehouse
    (
        docs_df
        .write
        .format("delta")
        .mode("overwrite")
        .saveAsTable(f"{DEFAULT_CATELOG_NAME}.default.{TABLE_NAME}")
    )

save_to_delta_table(docs_df)

In [0]:
%sql

--- Validate data is stored into the table successfully
SELECT COUNT(*) FROM databricks_rag_demo.default.raw_azure_compute_docs

count(1)
1751


In [0]:
%sql

-- delete this table if needed, this will clean up the environment
DROP TABLE databricks_rag_demo.default.raw_azure_compute_docs;