#### 01 - Ingest Azure Compute Documentation into databricks Unity Catalog

This notebook downloads Azure Compute documentation from GitHub, cleans Markdown content, and writes the data into a Unity Catalog–managed Delta table.

**Execution environment**
- Run this notebook on **Azure Databricks (Premium tier)** with **Unity Catalog enabled**
- Uses a **single-user Databricks cluster** (DBR 15+)
- Writes data as **Unity Catalog–managed Delta tables**

**What this notebook does**
- Downloads the latest Azure Compute documentation from the public GitHub repository  
  `MicrosoftDocs/azure-compute-docs` using a shallow Git clone
- Parses and cleans Markdown files under the `articles/` directory
- Extracts metadata such as document ID, category, title, source URL, and ingestion time
- Persists the processed documents into a governed Delta table:
databricks_rag_demo.default.raw_azure_compute_docs

This notebook establishes the **raw document ingestion layer** for a
Retrieval-Augmented Generation (RAG) pipeline and intentionally avoids
legacy DBFS-based storage in favor of **Unity Catalog–managed data objects**.


# gitpython is a Python wrapper around the git command. We will use this to do git clone

%pip install gitpython # gitpython is a Python wrapper around the git command.

In [None]:
from git import Repo
import os
import re
from pathlib import Path
from pyspark.sql import Row
from datetime import datetime
from git import Repo
import shutil

In [None]:
REPO_URL = "https://github.com/MicrosoftDocs/azure-compute-docs.git"

TARGET_DIR = "/tmp/azure-compute-docs" # it will be created on the driver VM’s local disk.
TARGET_ARTICLE_PATH = f"{TARGET_DIR}/articles"

# workspace will create a catelog with same name as the workspace, we will mostly work in this table
DEFAULT_CATELOG_NAME = "databricks_rag_demo"
TABLE_NAME="raw_azure_compute_docs"

In [None]:
def download_azure_compute_docs():
    
    # clean existing or any partial clones
    if os.path.exists(TARGET_DIR):
        shutil.rmtree(TARGET_DIR)

    # SHALLOW clone
    Repo.clone_from(
        REPO_URL,
        TARGET_DIR,
        depth=1 # Depth = how much git history you download
    )

download_azure_compute_docs()
os.listdir(f"{BASE_PATH}")

In [None]:
# Function to clean markdown text
def clean_markdown(md_text: str) -> str:
    # Remove code blocks
    md_text = re.sub(r"```.*?```", "", md_text, flags=re.S)
    # Remove images
    md_text = re.sub(r"!\[.*?\]\(.*?\)", "", md_text)
    # Remove links but keep text
    md_text = re.sub(r"\[(.*?)\]\(.*?\)", r"\1", md_text)
    # Remove headings symbols
    md_text = re.sub(r"#+ ", "", md_text)
    return md_text.strip()

In [None]:
def prepare_data():
    article_path = Path(TARGET_ARTICLE_PATH)
    rows = []
    min_length = 500  # skip stubs / TOCs

    for md_file in article_path.rglob("*.md"):
        try:
            with open(md_file, "r", encoding="utf-8", errors="ignore") as f:
                raw_md = f.read()

            cleaned = clean_markdown(raw_md)

            if len(cleaned) < min_length:
                continue

            rows.append(Row(
                doc_id=str(md_file.relative_to(article_path)),
                source="azure-compute-docs",
                category=md_file.parts[0],  # e.g. virtual-machines
                title=md_file.stem,
                raw_text=cleaned,
                url=f"https://learn.microsoft.com/en-us/azure/{md_file.relative_to(TARGET_ARTICLE_PATH)}",
                ingest_time=datetime.utcnow()
            ))

        except Exception as e:
            # Fail-safe: skip bad files
            continue
    return rows

rows = prepare_data()

In [None]:
def save_to_table(rows):

    docs_df = spark.createDataFrame(rows)

    # write to warehouse
    (
        docs_df
        .write
        .format("delta")
        .mode("overwrite")
        .saveAsTable(f"{DEFAULT_CATELOG_NAME}.default.{TABLE_NAME}")
    )

save_to_table(rows)

In [None]:
### Validate data is stored into the table successfully

TODO ADD

