# SharePoint FoundryIQ Setup (Agentic Retrieval)

This notebook sets up the **FoundryIQ / Agentic Retrieval** approach using `RemoteSharePointKnowledgeSource`.

Instead of building an index pipeline, this approach uses a **Knowledge Base** that connects directly to SharePoint via the [Copilot Retrieval API](https://learn.microsoft.com/en-us/microsoft-365-copilot/extensibility/api/ai-services/retrieval/overview).

**Pipeline**: Knowledge Source → Knowledge Base → Retrieve

**Reference**: [Create a remote SharePoint knowledge source](https://learn.microsoft.com/en-us/azure/search/agentic-knowledge-source-how-to-sharepoint-remote?pivots=python)

### Key Differences from Indexer Approach

| Feature | Indexer (Notebook 1) | FoundryIQ (This notebook) |
|---------|---------------------|---------------------------|
| Index needed | Yes (you build it) | No (handled by Copilot API) |
| Cross-tenant | Supported | Same-tenant only |
| Auth | App permissions | User identity (delegated) |
| Rate limit | Standard Search limits | 200 req/user/hour |
| Chat model | Not required | Required (gpt-4o / gpt-4.1 / gpt-5) |
| Security trimming | ACL fields in index | Automatic (user's access) |
| Licensing | Azure AI Search only | **Microsoft 365 Copilot license required** |

> **Prerequisites**:
> - Azure AI Search (Basic+) with [semantic ranker enabled](https://learn.microsoft.com/en-us/azure/search/semantic-how-to-enable-disable)
> - Azure OpenAI with a supported chat model deployed (`gpt-4o`, `gpt-4o-mini`, `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`, `gpt-5`, `gpt-5-mini`, `gpt-5-nano`)
> - Azure & Microsoft 365 must be in the **same Entra ID tenant**
> - **Microsoft 365 Copilot license** (usage is billed through M365)
> - SharePoint site accessible to the querying user
> - `az` CLI installed and logged in (for user identity token)
> - Copy `.env.template` to `.env` in this `notebooks/` folder and fill in values

## Step 0: Configuration

Loads values from `notebooks/.env` (same folder as this notebook).

In [None]:
import os
import json
import requests
import subprocess
from dotenv import load_dotenv

# Load .env from the same directory as this notebook
load_dotenv()

# Azure AI Search
SEARCH_URL = os.getenv("AZURE_SEARCH_ENDPOINT")
SEARCH_API_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
API_VERSION = os.getenv("AZURE_SEARCH_API_VERSION", "2025-11-01-preview")

# Azure OpenAI (chat model for query planning)
AOAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AOAI_CHAT_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT", "gpt-4o")

# SharePoint
SPO_ENDPOINT = os.getenv("SPO_ENDPOINT")
SPO_TENANT_ID = os.getenv("SPO_TENANT_ID")

# Common headers
HEADERS = {
    "Content-Type": "application/json",
    "api-key": SEARCH_API_KEY
}

# Resource names
KNOWLEDGE_SOURCE_NAME = "sharepoint-knowledge-source"
KNOWLEDGE_BASE_NAME = "sharepoint-knowledge-base"

# Validate
missing = [v for v in ["AZURE_SEARCH_ENDPOINT", "AZURE_SEARCH_ADMIN_KEY",
                        "AZURE_OPENAI_ENDPOINT", "SPO_ENDPOINT", "SPO_TENANT_ID"] if not os.getenv(v)]
if missing:
    print(f"\u274c Missing env vars: {', '.join(missing)}")
    print("   Copy .env.template to .env and fill in values.")
else:
    print(f"Search endpoint:  {SEARCH_URL}")
    print(f"OpenAI endpoint:  {AOAI_ENDPOINT}")
    print(f"Chat deployment:  {AOAI_CHAT_DEPLOYMENT}")
    print(f"SharePoint site:  {SPO_ENDPOINT}")
    print("Configuration loaded \u2713")

### Helper function

In [None]:
def call_search_api(method, path, body=None, extra_headers=None, expected_status=None):
    """Make a call to Azure AI Search REST API."""
    url = f"{SEARCH_URL}/{path}"
    params = {"api-version": API_VERSION}
    headers = {**HEADERS, **(extra_headers or {})}

    response = requests.request(method, url, headers=headers, params=params, json=body)

    status = response.status_code
    if expected_status and status not in expected_status:
        print(f"\u274c {method} {path} \u2192 HTTP {status}")
        print(response.text)
        return None

    print(f"\u2713 {method} {path} \u2192 HTTP {status}")
    if response.text:
        try:
            return response.json()
        except ValueError:
            return response.text
    return None

## Step 1: Create Knowledge Source

A `RemoteSharePointKnowledgeSource` connects to SharePoint at **query time** via the Copilot Retrieval API.

Unlike the indexer approach, **no data is copied** and no site URL is configured at creation — the user's identity token at query time determines which SharePoint content is accessible.

> The Azure subscription and the SharePoint site **must be in the same Entra ID tenant**.

### Optional parameters (`remoteSharePointParameters`)
- `containerTypeId` — for SharePoint Embedded connections (omit for standard SharePoint Online)
- `filterExpression` — [KQL expression](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference) to scope retrieval at definition time
- `resourceMetadata` — metadata fields to return with results (e.g., `Author`, `Title`, `LastModifiedTime`)

### Filter expression examples

| Scenario | Expression |
|----------|-----------|
| Filter to a single site by ID | `SiteID:"00aa00aa-bb11-cc22-dd33-44ee44ee44ee"` |
| Filter to files under a path | `Path:"https://contoso.sharepoint.com/sites/mysite/Shared Documents/en/mydocs"` |
| Filter by file type | `FileExtension:"docx" OR FileExtension:"pdf" OR FileExtension:"pptx"` |
| Filter by date range | `LastModifiedTime >= 2024-07-22 AND LastModifiedTime <= 2025-01-08` |
| Filter by sensitivity label | `InformationProtectionLabelId:"f0ddcc93-d3c0-4993-b5cc-76b0a283e252"` |

In [None]:
knowledge_source_body = {
    "name": KNOWLEDGE_SOURCE_NAME,
    "kind": "remoteSharePoint",
    # remoteSharePointParameters is optional for standard SharePoint Online.
    # Uncomment below to add KQL filtering or metadata fields:
    # "remoteSharePointParameters": {
    #     "filterExpression": "FileType:docx OR FileType:pdf",
    #     "resourceMetadata": ["Author", "LastModifiedTime"]
    # }
}

result = call_search_api("POST", "knowledgesources", knowledge_source_body, expected_status=[201])
if result:
    print(f"\nKnowledge source '{KNOWLEDGE_SOURCE_NAME}' created.")
    print(json.dumps(result, indent=2))

## Step 2: Create Knowledge Base

A Knowledge Base ties together:
- The knowledge source (SharePoint)
- An AI model for **query planning** (gpt-4o or gpt-4.1)

The model decomposes user queries into sub-queries for retrieval.

Uses `PUT` (create-or-update) so it's idempotent.

In [None]:
knowledge_base_body = {
    "name": KNOWLEDGE_BASE_NAME,
    "description": "SharePoint knowledge base for policy documents",
    "knowledgeSources": [
        {"name": KNOWLEDGE_SOURCE_NAME}
    ],
    "models": [
        {
            "kind": "azureOpenAI",
            "azureOpenAIParameters": {
                "resourceUri": AOAI_ENDPOINT,
                "deploymentId": AOAI_CHAT_DEPLOYMENT,
                "modelName": AOAI_CHAT_DEPLOYMENT   # e.g. "gpt-4o"
            }
        }
    ]
}

# Note: PUT URL uses knowledgebases('name') format; Prefer header is required
result = call_search_api(
    "PUT",
    f"knowledgebases('{KNOWLEDGE_BASE_NAME}')",
    knowledge_base_body,
    extra_headers={"Prefer": "return=representation"},
    expected_status=[200, 201]
)
if result:
    print(f"\nKnowledge base '{KNOWLEDGE_BASE_NAME}' created/updated.")
    print(json.dumps(result, indent=2))

## Step 3: Get User Identity Token

FoundryIQ queries SharePoint **on behalf of the user**. Two separate auth mechanisms are needed:

1. **`api-key` header** → authenticates to Azure AI Search (already configured in Step 0)
2. **`x-ms-query-source-authorization` header** → user identity token passed raw (no `Bearer` prefix)

The source authorization token must be scoped for **Azure AI Search** (`https://search.azure.com/.default`), as documented in the [remote SharePoint knowledge source](https://learn.microsoft.com/en-us/azure/search/agentic-knowledge-source-how-to-sharepoint-remote) guide. Azure AI Search then uses this token to call the Copilot Retrieval API on your behalf.

> Make sure you're logged in: `az login`

In [None]:
def get_user_token(scope="https://search.azure.com/.default"):
    """Get a user identity token scoped for Azure AI Search (used for SharePoint access)."""
    try:
        result = subprocess.run(
            ["az", "account", "get-access-token",
             "--scope", scope,
             "--query", "accessToken", "-o", "tsv"],
            capture_output=True,
            text=True,
            check=True
        )
        token = result.stdout.strip()
        print(f"\u2713 Token obtained for scope: {scope}")
        print(f"  Token length: {len(token)} chars")
        return token
    except subprocess.CalledProcessError as e:
        print(f"\u274c Failed to get token. Run 'az login' first.")
        print(e.stderr)
        return None

# Token scoped for Azure AI Search (Search service uses it to call Copilot Retrieval API on your behalf)
USER_TOKEN = get_user_token("https://search.azure.com/.default")

## Step 4: Query the Knowledge Base

The [`retrieve` action](https://learn.microsoft.com/en-us/azure/search/agentic-retrieval-how-to-retrieve) queries SharePoint through the knowledge base.

Key points:
- `x-ms-query-source-authorization` header carries the user identity token (raw JWT, no `Bearer` prefix)
- Results are automatically filtered by the user's SharePoint permissions
- The chat model decomposes your query into sub-queries for better retrieval
- Rate limited to **200 requests/user/hour** (Copilot Retrieval API limit)
- Query character limit: **1,500 characters**
- Maximum **25 results** per knowledge source per query
- Use `knowledgeSourceParams` to pass runtime overrides (e.g., `filterExpressionAddOn` for query-time KQL filters)

> **Tip**: Queries about the *content* of documents work best. Queries about file locations or dates should use `filterExpressionAddOn` instead.

In [None]:
SEARCH_QUERY = "What are the company travel expense policies?"  # Change this to your query

if not USER_TOKEN:
    print("\u274c No user token. Run Step 3 first.")
else:
    retrieve_body = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": SEARCH_QUERY
                    }
                ]
            }
        ],
        "includeActivity": True,
        # Optional: override knowledge source settings at query time
        "knowledgeSourceParams": [
            {
                "knowledgeSourceName": KNOWLEDGE_SOURCE_NAME,
                "kind": "remoteSharePoint",
                "includeReferences": True,
                "includeReferenceSourceData": True,
                # Optional: add a KQL filter at query time (AND'd with knowledge source filter)
                # "filterExpressionAddOn": "FileExtension:\"docx\""
            }
        ]
    }

    result = call_search_api(
        "POST",
        f"knowledgebases('{KNOWLEDGE_BASE_NAME}')/retrieve",
        retrieve_body,
        extra_headers={
            "x-ms-query-source-authorization": USER_TOKEN   # raw JWT, no "Bearer " prefix
        },
        expected_status=[200, 206]
    )

    if result:
        # Display response messages
        print("=== Answer ===")
        for msg in result.get("response", []):
            for part in msg.get("content", []):
                text = part.get("text", "")
                print(text[:2000])

        # Display references
        refs = result.get("references", [])
        if refs:
            print(f"\n--- {len(refs)} Reference(s) ---")
            for i, ref in enumerate(refs, 1):
                url = ref.get('webUrl', ref.get('docKey', 'N/A'))
                score = ref.get('rerankerScore', '')
                label = ref.get('searchSensitivityLabelInfo', {})
                label_name = f" [{label.get('displayName')}]" if label.get('displayName') else ""
                print(f"  {i}. [{ref.get('type')}] {url} (score: {score}){label_name}")

        # Display activity summary (including errors)
        activity = result.get("activity", [])
        if activity:
            print(f"\n--- Activity ({len(activity)} steps) ---")
            for a in activity:
                elapsed = a.get('elapsedMs', '?')
                atype = a.get('type')
                extra = ""
                if 'inputTokens' in a:
                    extra = f" (in: {a['inputTokens']}, out: {a.get('outputTokens', '?')})"
                if 'count' in a:
                    extra = f" ({a['count']} results)"
                print(f"  {atype}: {elapsed}ms{extra}")
                if 'error' in a:
                    print(f"    \u274c Error: {json.dumps(a['error'], indent=4)}")
    else:
        print("\u274c No result returned. Check the error output above.")

## Step 5: List Knowledge Sources & Knowledge Bases

Verify your resources were created successfully.

In [None]:
print("=== Knowledge Sources ===")
result = call_search_api("GET", "knowledgesources", expected_status=[200])
if result:
    for ks in result.get("value", []):
        print(f"  - {ks.get('name')} ({ks.get('kind')})")

print("\n=== Knowledge Bases ===")
result = call_search_api("GET", "knowledgebases", expected_status=[200])
if result:
    for kb in result.get("value", []):
        print(f"  - {kb.get('name')}: {kb.get('description', '')}")

## Cleanup (Optional)

Delete the knowledge base and knowledge source. Since no index was created, there's nothing else to clean up.

In [None]:
# Uncomment to delete all resources
# call_search_api("DELETE", f"knowledgebases('{KNOWLEDGE_BASE_NAME}')", expected_status=[204, 404])
# call_search_api("DELETE", f"knowledgesources('{KNOWLEDGE_SOURCE_NAME}')", expected_status=[204, 404])
# print("All resources deleted.")

## Limitations & Notes

### Licensing
- **Microsoft 365 Copilot license** is required — usage is billed through M365
- Without a Copilot license, retrieve calls will return errors

### Tenant & Identity
- **Same-tenant only**: Azure subscription and M365 must share the same Entra ID tenant
- **User identity**: Each query runs with the calling user's SharePoint permissions
- No support for Copilot connectors or OneDrive content — only SharePoint sites

### Query Limits ([Copilot Retrieval API](https://learn.microsoft.com/en-us/microsoft-365-copilot/extensibility/api/ai-services/retrieval/overview))
- **200 requests** per user per hour
- **1,500 character** query limit
- Maximum **25 results** from a single query
- Hybrid queries only supported for: `.doc`, `.docx`, `.pptx`, `.pdf`, `.aspx`, `.one`
- Multimodal retrieval (tables, images, charts) is **not supported**
- Invalid KQL filter expressions are silently ignored
- Results from Copilot Retrieval API are returned **unordered** (reranking is done by Azure AI Search)

### Model & Search
- **Chat model required**: `gpt-4o`, `gpt-4o-mini`, `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`, `gpt-5`, `gpt-5-mini`, `gpt-5-nano`
- **No custom embeddings**: Unlike the indexer approach, you can't customize chunking or vector dimensions
- **Preview API**: This feature uses `2025-11-01-preview` and may change