# Getting Started: SharePoint Indexed Knowledge Source

This notebook demonstrates how to ingest and index SharePoint Online documents into Azure AI Search for fast, pre-indexed retrieval.

## What You'll Learn

- Create an Azure AD App Registration for SharePoint access
- Configure SharePoint connection string
- Create an indexed SharePoint knowledge source
- Monitor ingestion progress
- Query pre-indexed SharePoint documents

## Prerequisites

- Azure subscription with SharePoint Online
- SharePoint Online site with documents
- Azure CLI installed and logged in (`az login`)
- Existing Azure AI Foundry project (see notebook 01)
- Existing Azure AI Search service (see notebook 01)
- Admin access to Azure AD (to create App Registration)

## Architecture Overview

```
SharePoint Online → [Ingestion] → Azure AI Search Index → Knowledge Base → Retrieval API
                         ↓
               Chunking + Embedding
```

**Note:** Indexed SharePoint sources ingest documents into Azure AI Search, enabling fast retrieval without real-time SharePoint queries.

## Step 1: Create Azure AD App Registration for SharePoint

To access SharePoint programmatically, you need to create an App Registration in Azure AD.

In [None]:
# Configuration
import subprocess
import json

APP_NAME = "SharePoint-Knowledge-App"

# Get tenant ID
result = subprocess.run(
    ["az", "account", "show", "--query", "tenantId", "-o", "tsv"],
    capture_output=True,
    text=True
)
TENANT_ID = result.stdout.strip()
print(f"Tenant ID: {TENANT_ID}")

In [None]:
# Create App Registration
result = subprocess.run(
    ["az", "ad", "app", "create",
     "--display-name", APP_NAME,
     "--output", "json"],
    capture_output=True,
    text=True
)

app_info = json.loads(result.stdout)
APP_ID = app_info["appId"]
OBJECT_ID = app_info["id"]

print(f"App Registration created!")
print(f"Application (client) ID: {APP_ID}")
print(f"Object ID: {OBJECT_ID}")

In [None]:
# Create Service Principal
result = subprocess.run(
    ["az", "ad", "sp", "create",
     "--id", APP_ID,
     "--output", "json"],
    capture_output=True,
    text=True
)

print("Service Principal created!")

In [None]:
# Grant SharePoint permissions
# SharePoint API ID: 00000003-0000-0ff1-ce00-000000000000
# Sites.Read.All permission ID: 205e70e5-aba6-4c52-a976-6d2d46c48043

# Add API permission for Sites.Read.All
!az ad app permission add \
  --id {APP_ID} \
  --api 00000003-0000-0ff1-ce00-000000000000 \
  --api-permissions 205e70e5-aba6-4c52-a976-6d2d46c48043=Role

print("SharePoint permission added (Sites.Read.All)")
print("\nIMPORTANT: You must manually grant admin consent in Azure Portal:")
print(f"1. Go to: https://portal.azure.com/#view/Microsoft_AAD_RegisteredApps/ApplicationMenuBlade/~/CallAnAPI/appId/{APP_ID}")
print("2. Click 'Grant admin consent for <your-tenant>'")
print("3. Wait for the consent to be granted before proceeding")

In [None]:
# Create client secret
result = subprocess.run(
    ["az", "ad", "app", "credential", "reset",
     "--id", APP_ID,
     "--append",
     "--output", "json"],
    capture_output=True,
    text=True
)

secret_info = json.loads(result.stdout)
CLIENT_SECRET = secret_info["password"]

print(f"Client secret created: {CLIENT_SECRET[:10]}...")
print("\nIMPORTANT: Save this secret - it won't be shown again!")

## Step 2: Configure SharePoint Connection String

Build the connection string for SharePoint access.

In [None]:
# SharePoint site configuration
SHAREPOINT_SITE_URL = "https://yourtenant.sharepoint.com/sites/yoursite"  # Update this!

# Build connection string
SHAREPOINT_CONNECTION_STRING = (
    f"SharePointOnlineEndpoint={SHAREPOINT_SITE_URL};"
    f"ApplicationId={APP_ID};"
    f"ApplicationSecret={CLIENT_SECRET};"
    f"TenantId={TENANT_ID}"
)

print("SharePoint connection string configured!")
print(f"Site: {SHAREPOINT_SITE_URL}")

## Step 3: Configure Existing Resources

Set up references to existing Azure resources.

In [None]:
# Existing resources (from notebook 01 or your own)
EXISTING_SEARCH_ENDPOINT = "https://<your-search-service>.search.windows.net"
EXISTING_SEARCH_API_KEY = "<your-search-api-key>"
EXISTING_FOUNDRY_ENDPOINT = "https://<your-foundry-project>.services.ai.azure.com/api/projects/<project-name>"
EXISTING_AZURE_OPENAI_KEY = "<your-api-key>"
EXISTING_EMBEDDING_DEPLOYMENT = "text-embedding-3-small"
EXISTING_CHAT_DEPLOYMENT = "gpt-4o-mini"

# API version
API_VERSION = "2025-11-01-preview"

## Step 4: Create Indexed SharePoint Knowledge Source

Create a knowledge source that ingests SharePoint documents.

In [None]:
import requests

KNOWLEDGE_SOURCE_NAME = "sharepoint-indexed-source"

url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}?api-version={API_VERSION}"

headers = {
    "api-key": EXISTING_SEARCH_API_KEY,
    "Content-Type": "application/json"
}

body = {
    "name": KNOWLEDGE_SOURCE_NAME,
    "kind": "indexedSharePoint",
    "description": "Indexed SharePoint Online documents",
    "indexedSharePointParameters": {
        "connectionString": SHAREPOINT_CONNECTION_STRING,
        "containerName": "allSiteLibraries",  # Options: allSiteLibraries, defaultSiteLibrary
        "query": "",  # Optional: filter files (e.g., "*.pdf OR *.docx")
        "ingestionParameters": {
            "identity": None,
            "embeddingModel": {
                "kind": "azureOpenAI",
                "azureOpenAIParameters": {
                    "resourceUri": EXISTING_FOUNDRY_ENDPOINT,
                    "deploymentId": EXISTING_EMBEDDING_DEPLOYMENT,
                    "modelName": EXISTING_EMBEDDING_DEPLOYMENT,
                    "apiKey": EXISTING_AZURE_OPENAI_KEY
                }
            },
            "chatCompletionModel": {
                "kind": "azureOpenAI",
                "azureOpenAIParameters": {
                    "resourceUri": EXISTING_FOUNDRY_ENDPOINT,
                    "deploymentId": EXISTING_CHAT_DEPLOYMENT,
                    "modelName": EXISTING_CHAT_DEPLOYMENT,
                    "apiKey": EXISTING_AZURE_OPENAI_KEY
                }
            },
            "contentExtractionMode": "minimal"  # Options: minimal, comprehensive
        }
    }
}

response = requests.put(url, headers=headers, json=body)
print(f"Status: {response.status_code}")
print(json.dumps(response.json(), indent=2))

## Step 5: Monitor Ingestion Progress

Check the status of document ingestion.

In [None]:
import time

status_url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}/status?api-version={API_VERSION}"

print("Monitoring ingestion progress...\n")
while True:
    response = requests.get(status_url, headers=headers)
    status = response.json()
    
    current_status = status.get("status", "unknown")
    print(f"Status: {current_status}")
    
    if "documentsProcessed" in status:
        print(f"Documents processed: {status['documentsProcessed']}")
    
    if current_status == "succeeded":
        print("\n✅ Ingestion completed successfully!")
        print(json.dumps(status, indent=2))
        break
    elif current_status == "failed":
        print("\n❌ Ingestion failed!")
        print(json.dumps(status, indent=2))
        break
    
    time.sleep(15)
    print("---")

## Step 6: Create Knowledge Base

Create a knowledge base using the indexed SharePoint source.

In [None]:
KNOWLEDGE_BASE_NAME = "sharepoint-indexed-kb"

url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeBases/{KNOWLEDGE_BASE_NAME}?api-version={API_VERSION}"

body = {
    "name": KNOWLEDGE_BASE_NAME,
    "description": "Knowledge base with indexed SharePoint documents",
    "knowledgeSources": [
        {
            "name": KNOWLEDGE_SOURCE_NAME
        }
    ],
    "models": [
        {
            "kind": "azureOpenAI",
            "azureOpenAIParameters": {
                "resourceUri": EXISTING_FOUNDRY_ENDPOINT,
                "deploymentId": EXISTING_CHAT_DEPLOYMENT,
                "modelName": EXISTING_CHAT_DEPLOYMENT,
                "apiKey": EXISTING_AZURE_OPENAI_KEY
            }
        }
    ],
    "outputMode": "answerSynthesis",
    "retrievalInstructions": "Retrieve relevant information from SharePoint documents.",
    "answerInstructions": "Provide clear, accurate answers with citations from SharePoint."
}

response = requests.put(url, headers=headers, json=body)
print(f"Status: {response.status_code}")
print(json.dumps(response.json(), indent=2))

## Step 7: Query the Knowledge Base

Query the pre-indexed SharePoint documents.

In [None]:
# Simple query
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeBases/{KNOWLEDGE_BASE_NAME}/retrieve?api-version={API_VERSION}"

query_body = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What are the latest project updates?"
                }
            ]
        }
    ],
    "includeActivity": True
}

response = requests.post(url, headers=headers, json=query_body)
result = response.json()

print("Answer:")
print(result["choices"][0]["message"]["content"])
print("\nReferences:")
for ref in result.get("activity", {}).get("references", []):
    print(f"- {ref.get('title', 'Unknown')}: {ref.get('url', 'No URL')}")

In [None]:
# Query with source-specific parameters
query_body = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Find policy documents related to data privacy"
                }
            ]
        }
    ],
    "includeActivity": True,
    "knowledgeSourceParams": [
        {
            "knowledgeSourceName": KNOWLEDGE_SOURCE_NAME,
            "kind": "indexedSharePoint",
            "includeReferences": True,
            "includeReferenceSourceData": True,
            "alwaysQuerySource": True,
            "rerankerThreshold": 0.5
        }
    ]
}

response = requests.post(url, headers=headers, json=query_body)
result = response.json()

print("Answer:")
print(result["choices"][0]["message"]["content"])

## Step 8: Optional - Schedule Automatic Re-indexing

You can configure automatic re-indexing to keep documents up-to-date.

In [None]:
# Update knowledge source with ingestion schedule
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}?api-version={API_VERSION}"

# Get current configuration
response = requests.get(url, headers=headers)
current_config = response.json()

# Add ingestion schedule
current_config["indexedSharePointParameters"]["ingestionParameters"]["ingestionSchedule"] = {
    "interval": "PT24H"  # Re-index every 24 hours (ISO 8601 duration)
}
# Other examples:
# "PT6H" - every 6 hours
# "PT12H" - every 12 hours
# "P1D" - every 1 day

response = requests.put(url, headers=headers, json=current_config)
print(f"Status: {response.status_code}")
print("Ingestion schedule updated to run every 24 hours")

## Container Name Options

When creating an indexed SharePoint knowledge source, you can choose from:

- **`allSiteLibraries`**: Indexes all document libraries in the SharePoint site
- **`defaultSiteLibrary`**: Indexes only the default "Shared Documents" library

## Query Filter Options

Use the `query` parameter to filter which files are indexed:

```python
# Index only PDF files
"query": "*.pdf"

# Index PDF and Word documents
"query": "*.pdf OR *.docx"

# Index all files (default)
"query": ""
```

## Cleanup

Clean up resources when done.

In [None]:
# Delete knowledge base
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeBases/{KNOWLEDGE_BASE_NAME}?api-version={API_VERSION}"
response = requests.delete(url, headers=headers)
print(f"Delete knowledge base: {response.status_code}")

In [None]:
# Delete knowledge source
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}?api-version={API_VERSION}"
response = requests.delete(url, headers=headers)
print(f"Delete knowledge source: {response.status_code}")

In [None]:
# Optional: Delete App Registration
!az ad app delete --id {APP_ID}
print(f"App Registration deleted: {APP_NAME}")

## Summary

In this notebook, you learned how to:

1. Create an Azure AD App Registration for SharePoint access
2. Configure SharePoint permissions (Sites.Read.All)
3. Build a SharePoint connection string
4. Create an indexed SharePoint knowledge source
5. Monitor document ingestion progress
6. Query pre-indexed SharePoint documents
7. Schedule automatic re-indexing

## Key Differences: Indexed vs. Remote SharePoint

| Feature | Indexed SharePoint | Remote SharePoint |
|---------|-------------------|-------------------|
| **Setup Complexity** | Higher (App Registration) | Lower (User token) |
| **Query Latency** | Low (pre-indexed) | Higher (real-time) |
| **Data Freshness** | Scheduled updates | Always current |
| **Storage Cost** | Yes (Azure AI Search) | No |
| **Query Cost** | Lower | Higher |
| **Best For** | Production workloads | Ad-hoc queries |

## Next Steps

- Explore OneLake knowledge sources (notebook 04)
- Combine multiple SharePoint sites into one knowledge base
- Implement hybrid approaches (indexed + remote)
- Set up monitoring and alerting for ingestion failures