# SharePoint Indexed Knowledge Source Setup (Agentic Retrieval)

This notebook sets up the **Indexed SharePoint** approach using `IndexedSharePointKnowledgeSource`.

When you create the knowledge source, Azure AI Search **automatically generates** a full indexer pipeline:
- **Data source** → points to your SharePoint site
- **Skillset** → chunks and optionally vectorizes content
- **Index** → stores enriched, searchable content
- **Indexer** → drives the pipeline

**Pipeline**: Knowledge Source (auto-generates indexer pipeline) → Knowledge Base → Retrieve

**Reference**: [Create an indexed SharePoint knowledge source](https://learn.microsoft.com/en-us/azure/search/agentic-knowledge-source-how-to-sharepoint-indexed?pivots=python)

### Comparison of Approaches

| Feature | Indexer (Notebook 1) | FoundryIQ (Notebook 2) | Indexed KS (This notebook) |
|---------|---------------------|------------------------|----------------------------|
| Index built by | You (manual) | None (Copilot API) | Auto-generated |
| Cross-tenant | Supported | Same-tenant only | Supported |
| Auth | App permissions | User identity (delegated) | App permissions |
| Rate limit | Standard Search limits | 200 req/user/hour | Standard Search limits |
| Chat model | Not required | Required | Optional (image verbalization) |
| Embedding model | You configure | N/A | Required (in knowledge source) |
| Security trimming | ACL fields in index | Automatic (user's access) | Configurable (`ingestionPermissionOptions`) |
| Licensing | Azure AI Search only | M365 Copilot license | Azure AI Search only |
| Customization | Full control | Minimal | Medium (ingestion params) |

> **Prerequisites**:
> - Azure AI Search (Basic+) with [semantic ranker enabled](https://learn.microsoft.com/en-us/azure/search/semantic-how-to-enable-disable)
> - Azure OpenAI with an **embedding model** deployed (`text-embedding-ada-002`, `text-embedding-3-small`, or `text-embedding-3-large`)
> - Azure OpenAI with a **chat model** deployed (optional, for image verbalization): `gpt-4o`, `gpt-4o-mini`, `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`, `gpt-5`, `gpt-5-mini`, `gpt-5-nano`
> - Entra ID app registration with **application permissions** (`Files.Read.All`, `Sites.Read.All`) and a client secret ([SharePoint indexer prerequisites](https://learn.microsoft.com/en-us/azure/search/search-how-to-index-sharepoint-online#prerequisites))
> - Copy `.env.template` to `.env` in this `notebooks/` folder and fill in values

## Step 0: Configuration

Loads values from `notebooks/.env` (same folder as this notebook).

This approach requires:
- Azure AI Search credentials
- Azure OpenAI credentials (including API key for embedding model)
- SharePoint connection details (site URL, Entra app ID, client secret, tenant ID)

In [None]:
import os
import json
import time
import requests
from dotenv import load_dotenv

# Load .env from the same directory as this notebook
load_dotenv()

# Azure AI Search
SEARCH_URL = os.getenv("AZURE_SEARCH_ENDPOINT")
SEARCH_API_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
API_VERSION = os.getenv("AZURE_SEARCH_API_VERSION", "2025-11-01-preview")

# Azure OpenAI
AOAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AOAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AOAI_EMBEDDING_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-ada-002")
AOAI_EMBEDDING_MODEL = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL", "text-embedding-ada-002")
AOAI_CHAT_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT", "gpt-4o")

# SharePoint / Entra ID
SPO_ENDPOINT = os.getenv("SPO_ENDPOINT")
SPO_APP_ID = os.getenv("SPO_APP_ID")
SPO_APP_SECRET = os.getenv("SPO_APP_SECRET")
SPO_TENANT_ID = os.getenv("SPO_TENANT_ID")

# Build SharePoint connection string (application permissions with client secret)
SPO_CONNECTION_STRING = (
    f"SharePointOnlineEndpoint={SPO_ENDPOINT};"
    f"ApplicationId={SPO_APP_ID};"
    f"ApplicationSecret={SPO_APP_SECRET};"
    f"TenantId={SPO_TENANT_ID}"
)

# Common headers
HEADERS = {
    "Content-Type": "application/json",
    "api-key": SEARCH_API_KEY
}

# Resource names
KNOWLEDGE_SOURCE_NAME = "sharepoint-indexed-ks"
KNOWLEDGE_BASE_NAME = "sharepoint-indexed-kb"

# Validate
missing = [v for v in [
    "AZURE_SEARCH_ENDPOINT", "AZURE_SEARCH_ADMIN_KEY",
    "AZURE_OPENAI_ENDPOINT", "AZURE_OPENAI_API_KEY",
    "SPO_ENDPOINT", "SPO_APP_ID", "SPO_APP_SECRET", "SPO_TENANT_ID"
] if not os.getenv(v)]

if missing:
    print(f"❌ Missing env vars: {', '.join(missing)}")
    print("   Copy .env.template to .env and fill in values.")
else:
    print(f"Search endpoint:     {SEARCH_URL}")
    print(f"OpenAI endpoint:     {AOAI_ENDPOINT}")
    print(f"Embedding model:     {AOAI_EMBEDDING_MODEL} ({AOAI_EMBEDDING_DEPLOYMENT})")
    print(f"Chat deployment:     {AOAI_CHAT_DEPLOYMENT}")
    print(f"SharePoint site:     {SPO_ENDPOINT}")
    print(f"Entra App ID:        {SPO_APP_ID}")
    print(f"Tenant ID:           {SPO_TENANT_ID}")
    print("Configuration loaded ✓")

### Helper function

In [None]:
def call_search_api(method, path, body=None, extra_headers=None, expected_status=None):
    """Make a call to Azure AI Search REST API."""
    url = f"{SEARCH_URL}/{path}"
    params = {"api-version": API_VERSION}
    headers = {**HEADERS, **(extra_headers or {})}

    response = requests.request(method, url, headers=headers, params=params, json=body)

    status = response.status_code
    if expected_status and status not in expected_status:
        print(f"❌ {method} {path} → HTTP {status}")
        print(response.text)
        return None

    print(f"✓ {method} {path} → HTTP {status}")
    if response.text:
        try:
            return response.json()
        except ValueError:
            return response.text
    return None

## Step 1: Create Indexed SharePoint Knowledge Source

An `IndexedSharePointKnowledgeSource` tells Azure AI Search to:
1. Connect to your SharePoint site using **application permissions** (client secret)
2. **Auto-generate** a full indexer pipeline (data source, skillset, index, indexer)
3. Index and chunk content, optionally creating vector embeddings

### Key parameters

| Parameter | Description |
|-----------|-------------|
| `connectionString` | SharePoint connection string with app credentials ([format](https://learn.microsoft.com/en-us/azure/search/search-how-to-index-sharepoint-online#connection-string-format)) |
| `containerName` | `defaultSiteLibrary` (default library) or `allSiteLibraries` (all libraries) |
| `query` | Optional — filter to specific libraries ([query syntax](https://learn.microsoft.com/en-us/azure/search/search-how-to-index-sharepoint-online#controlling-which-documents-are-indexed)) |

### Ingestion parameters

| Parameter | Description |
|-----------|-------------|
| `embeddingModel` | **Required** — text embedding model (`text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`) |
| `chatCompletionModel` | Optional — enables image verbalization (`gpt-4o`, `gpt-4.1`, etc.) |
| `contentExtractionMode` | `minimal` (default text extraction) or `standard` (Azure Content Understanding) |
| `disableImageVerbalization` | `false` (default) — set `true` to skip image verbalization |
| `ingestionSchedule` | Optional — schedule for automatic re-indexing |
| `ingestionPermissionOptions` | Optional — `user_ids`, `group_ids`, or `rbac_scope` for document-level permissions |

> **Note**: Once created, the auto-generated objects (data source, index, skillset, indexer) are named based on the knowledge source name. **Do not edit** these objects directly — modifying them can break the pipeline.

In [None]:
knowledge_source_body = {
    "name": KNOWLEDGE_SOURCE_NAME,
    "kind": "indexedSharePoint",
    "description": "Indexed SharePoint knowledge source for policy documents",
    "indexedSharePointParameters": {
        "connectionString": SPO_CONNECTION_STRING,
        "containerName": "defaultSiteLibrary",
        "ingestionParameters": {
            "disableImageVerbalization": False,
            "contentExtractionMode": "minimal",
            "embeddingModel": {
                "kind": "azureOpenAI",
                "azureOpenAIParameters": {
                    "resourceUri": AOAI_ENDPOINT,
                    "deploymentId": AOAI_EMBEDDING_DEPLOYMENT,
                    "apiKey": AOAI_API_KEY,
                    "modelName": AOAI_EMBEDDING_MODEL
                }
            },
            # Optional: uncomment to enable image verbalization with a chat model
            # "chatCompletionModel": {
            #     "kind": "azureOpenAI",
            #     "azureOpenAIParameters": {
            #         "resourceUri": AOAI_ENDPOINT,
            #         "deploymentId": AOAI_CHAT_DEPLOYMENT,
            #         "apiKey": AOAI_API_KEY,
            #         "modelName": AOAI_CHAT_DEPLOYMENT
            #     }
            # },
        }
    }
}

result = call_search_api("POST", "knowledgesources", knowledge_source_body, expected_status=[201])
if result:
    print(f"\nKnowledge source '{KNOWLEDGE_SOURCE_NAME}' created.")

    # Show the auto-generated resources
    params = result.get("indexedSharePointParameters", {})
    created = params.get("createdResources", {})
    if created:
        print("\n--- Auto-generated resources ---")
        for rtype, rname in created.items():
            print(f"  {rtype}: {rname}")

    print(f"\n{json.dumps(result, indent=2)}")

## Step 2: Check Ingestion Status

After creating the knowledge source, Azure AI Search starts indexing SharePoint content automatically. This step polls the ingestion status until it completes.

The status endpoint returns:
- `synchronizationStatus`: `creating` → `active` → idle
- `currentSynchronizationState`: items processed, failed, skipped
- `lastSynchronizationState`: summary of the last completed sync
- `statistics`: average sync duration and items processed

> **Note**: Initial indexing may take several minutes depending on the number of documents and their sizes.

In [None]:
def check_ingestion_status(knowledge_source_name, poll=False, interval=15, max_wait=600):
    """Check (and optionally poll) the ingestion status of a knowledge source."""
    elapsed = 0
    while True:
        result = call_search_api(
            "GET",
            f"knowledgesources/{knowledge_source_name}/status",
            expected_status=[200]
        )

        if result:
            sync_status = result.get("synchronizationStatus", "unknown")
            current = result.get("currentSynchronizationState") or {}
            last = result.get("lastSynchronizationState") or {}
            stats = result.get("statistics") or {}

            processed = current.get("itemUpdatesProcessed", 0)
            failed = current.get("itemsUpdatesFailed", 0)
            skipped = current.get("itemsSkipped", 0)

            print(f"\n  Status: {sync_status}")
            print(f"  Items processed: {processed}, failed: {failed}, skipped: {skipped}")

            if last:
                print(f"  Last sync: {last.get('startTime', '?')} → {last.get('endTime', '?')}")
                print(f"    Processed: {last.get('itemUpdatesProcessed', 0)}, "
                      f"Failed: {last.get('itemsUpdatesFailed', 0)}")

            if stats:
                print(f"  Total syncs: {stats.get('totalSynchronization', 0)}, "
                      f"Avg duration: {stats.get('averageSynchronizationDuration', '?')}")

            # Stop polling if not active or not polling mode
            if not poll or sync_status != "active":
                return result

        if elapsed >= max_wait:
            print(f"\n⏱ Timeout after {max_wait}s. Check status again later.")
            return result

        print(f"\n⏳ Waiting {interval}s... (elapsed: {elapsed}s / {max_wait}s max)")
        time.sleep(interval)
        elapsed += interval

# One-time status check (set poll=True to wait for completion)
check_ingestion_status(KNOWLEDGE_SOURCE_NAME, poll=True, interval=15, max_wait=120)

## Step 3: Review Auto-Generated Resources

When you create an indexed SharePoint knowledge source, Azure AI Search generates four objects:
- **Data source** (`{name}-datasource`) — connection to SharePoint
- **Index** (`{name}-index`) — stores chunks, vectors, and metadata
- **Skillset** (`{name}-skillset`) — chunking and embedding pipeline
- **Indexer** (`{name}-indexer`) — orchestrates the pipeline

You can also verify these in the [Azure portal](https://portal.azure.com) under your search service.

> ⚠️ **Do not edit** these auto-generated objects. Modifying them can break the pipeline.

In [None]:
# Retrieve the knowledge source definition to see created resources
result = call_search_api("GET", f"knowledgesources/{KNOWLEDGE_SOURCE_NAME}", expected_status=[200])
if result:
    params = result.get("indexedSharePointParameters", {})
    created = params.get("createdResources", {})

    if created:
        print("=== Auto-Generated Resources ===")
        for rtype, rname in created.items():
            print(f"  {rtype}: {rname}")

        # Check indexer status for details
        indexer_name = created.get("indexer")
        if indexer_name:
            print(f"\n=== Indexer Status ({indexer_name}) ===")
            indexer_status = call_search_api(
                "GET", f"indexers/{indexer_name}/status", expected_status=[200]
            )
            if indexer_status:
                last = indexer_status.get("lastResult", {})
                print(f"  Status: {last.get('status', 'unknown')}")
                print(f"  Items processed: {last.get('itemCount', 0)}")
                print(f"  Items failed: {last.get('failedItemCount', 0)}")
                if last.get("errors"):
                    print(f"  Errors:")
                    for err in last["errors"][:5]:
                        print(f"    - {err.get('errorMessage', '')[:200]}")

        # Check index document count
        index_name = created.get("index")
        if index_name:
            print(f"\n=== Index Stats ({index_name}) ===")
            index_stats = call_search_api(
                "GET", f"indexes/{index_name}/stats", expected_status=[200]
            )
            if index_stats:
                print(f"  Document count: {index_stats.get('documentCount', 0)}")
                print(f"  Storage size: {index_stats.get('storageSize', 0):,} bytes")
    else:
        print("No created resources found yet. The pipeline may still be provisioning.")

## Step 4: Create Knowledge Base

A Knowledge Base ties together:
- The knowledge source (indexed SharePoint)
- An AI model for **query planning** and **answer synthesis**

For indexed SharePoint knowledge sources, set `includeReferenceSourceData` to `true` in retrieve requests to pull the source document URL into citations.

Uses `PUT` (create-or-update) so it's idempotent.

In [None]:
knowledge_base_body = {
    "name": KNOWLEDGE_BASE_NAME,
    "description": "Knowledge base for indexed SharePoint policy documents",
    "knowledgeSources": [
        {"name": KNOWLEDGE_SOURCE_NAME}
    ],
    "models": [
        {
            "kind": "azureOpenAI",
            "azureOpenAIParameters": {
                "resourceUri": AOAI_ENDPOINT,
                "deploymentId": AOAI_CHAT_DEPLOYMENT,
                "apiKey": AOAI_API_KEY,
                "modelName": AOAI_CHAT_DEPLOYMENT   # e.g. "gpt-4o"
            }
        }
    ]
}

# PUT uses knowledgebases('name') format; Prefer header is required
result = call_search_api(
    "PUT",
    f"knowledgebases('{KNOWLEDGE_BASE_NAME}')",
    knowledge_base_body,
    extra_headers={"Prefer": "return=representation"},
    expected_status=[200, 201]
)
if result:
    print(f"\nKnowledge base '{KNOWLEDGE_BASE_NAME}' created/updated.")
    print(json.dumps(result, indent=2))

## Step 5: Query the Knowledge Base

The [`retrieve` action](https://learn.microsoft.com/en-us/azure/search/agentic-retrieval-how-to-retrieve) queries the indexed SharePoint data through the knowledge base.

Key differences from the FoundryIQ approach:
- **No user identity token** — data is in the search index, so the `api-key` is sufficient
- **No Copilot license** required
- **Standard search limits** apply (no 200 req/user/hour restriction)
- Set `includeReferenceSourceData` to `true` to get source document URLs in citations

> **Tip**: Make sure ingestion (Step 2) is complete before querying. You can verify with Step 3.

In [None]:
SEARCH_QUERY = "What do you know of ZAVA?"  # Change this to your query

retrieve_body = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": SEARCH_QUERY
                }
            ]
        }
    ],
    "includeActivity": True,
    "knowledgeSourceParams": [
        {
            "knowledgeSourceName": KNOWLEDGE_SOURCE_NAME,
            "kind": "indexedSharePoint",
            "includeReferences": True,
            "includeReferenceSourceData": True   # Required for source doc URLs in citations
        }
    ]
}

result = call_search_api(
    "POST",
    f"knowledgebases('{KNOWLEDGE_BASE_NAME}')/retrieve",
    retrieve_body,
    expected_status=[200, 206]
)

if result:
    # Display response messages
    print("=== Answer ===")
    for msg in result.get("response", []):
        for part in msg.get("content", []):
            text = part.get("text", "")
            print(text[:2000])

    # Display references
    refs = result.get("references", [])
    if refs:
        print(f"\n--- {len(refs)} Reference(s) ---")
        for i, ref in enumerate(refs, 1):
            url = ref.get('webUrl', ref.get('docKey', 'N/A'))
            score = ref.get('rerankerScore', '')
            chunk = ref.get('chunkId', '')
            print(f"  {i}. {url}")
            if score:
                print(f"     Reranker score: {score}")
            if chunk:
                print(f"     Chunk: {chunk}")

    # Display activity summary
    activity = result.get("activity", [])
    if activity:
        print(f"\n--- Activity ({len(activity)} steps) ---")
        for a in activity:
            elapsed = a.get('elapsedMs', '?')
            atype = a.get('type')
            extra = ""
            if 'inputTokens' in a:
                extra = f" (in: {a['inputTokens']}, out: {a.get('outputTokens', '?')})"
            if 'count' in a:
                extra = f" ({a['count']} results)"
            print(f"  {atype}: {elapsed}ms{extra}")
            if 'error' in a:
                print(f"    ❌ Error: {json.dumps(a['error'], indent=4)}")
else:
    print("❌ No result returned. Check the error output above.")
    print("   Make sure ingestion is complete (Step 2) before querying.")

## Step 6: List Knowledge Sources & Knowledge Bases

Verify your resources were created successfully.

In [None]:
print("=== Knowledge Sources ===")
result = call_search_api("GET", "knowledgesources", expected_status=[200])
if result:
    for ks in result.get("value", []):
        kind = ks.get('kind', '?')
        name = ks.get('name', '?')
        desc = ks.get('description', '')
        print(f"  - {name} ({kind}): {desc}")

print("\n=== Knowledge Bases ===")
result = call_search_api("GET", "knowledgebases", expected_status=[200])
if result:
    for kb in result.get("value", []):
        name = kb.get('name', '?')
        sources = [s.get('name') for s in kb.get('knowledgeSources', [])]
        print(f"  - {name}: sources={sources}")

## Cleanup (Optional)

Deleting the knowledge source also deletes all auto-generated objects (data source, index, skillset, indexer).

**Important**: You must delete the knowledge base **before** the knowledge source, since a knowledge source can't be deleted while it's referenced by a knowledge base.

> If you used an existing index to create a knowledge source, your index is **not** deleted.

In [None]:
# Uncomment to delete all resources
# Step 1: Delete the knowledge base first (it references the knowledge source)
# call_search_api("DELETE", f"knowledgebases('{KNOWLEDGE_BASE_NAME}')", expected_status=[204, 404])
# print(f"Knowledge base '{KNOWLEDGE_BASE_NAME}' deleted.")

# Step 2: Delete the knowledge source (also deletes auto-generated data source, index, skillset, indexer)
# call_search_api("DELETE", f"knowledgesources('{KNOWLEDGE_SOURCE_NAME}')", expected_status=[204, 404])
# print(f"Knowledge source '{KNOWLEDGE_SOURCE_NAME}' deleted (including all auto-generated resources).")

## Limitations & Notes

### Generated Pipeline
- **Do not edit** auto-generated objects (data source, index, skillset, indexer) — modifying them can break the pipeline
- Object names are based on the knowledge source name and **cannot be changed**
- Deleting the knowledge source **deletes all generated objects** (unless you used an existing index)
- The generated indexer runs once on creation; use `ingestionSchedule` to automate re-indexing

### SharePoint Indexer Limitations
- Only **document library** content is indexed — no SharePoint Lists, .ASPX pages, or OneNote notebooks
- Renaming a SharePoint folder breaks incremental indexing (treated as new content)
- No support for tenants with [Conditional Access](https://learn.microsoft.com/en-us/entra/identity/conditional-access/overview) enabled
- No Private Endpoint support — use [firewall rules](https://learn.microsoft.com/en-us/azure/search/service-configure-firewall) instead
- Encrypted files require [Purview sensitivity label configuration](https://learn.microsoft.com/en-us/azure/search/search-indexer-sensitivity-labels)

### Supported File Formats
CSV, EML, EPUB, GZ, HTML, JSON, KML, Markdown, Office (DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG), ODF (ODT/ODS/ODP), PDF, plain text, RTF, XML, ZIP

### Embedding & Content Extraction
- **Embedding models**: `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`
- **Content extraction modes**:
  - `minimal` — standard text extraction (default)
  - `standard` — advanced cracking + chunking via Azure Content Understanding (requires `aiServices` and `assetStore`)
- **Image verbalization**: Optional, requires a chat model (`gpt-4o`, `gpt-4.1`, etc.) and `disableImageVerbalization: false`
- Only `apiKey` and `deploymentName` are editable after creation for embedding and chat models

### Permissions
- Application permissions required: `Files.Read.All` + `Sites.Read.All`
- For [ACL sync](https://learn.microsoft.com/en-us/azure/search/search-indexer-sharepoint-access-control-lists): `Files.Read.All` + `Sites.FullControl.All`
- Use `ingestionPermissionOptions` (`user_ids`, `group_ids`, `rbac_scope`) for document-level permissions

### Preview API
- Uses `2025-11-01-preview` — API may change before GA