# Getting Started: OneLake (Microsoft Fabric) Knowledge Source

This notebook demonstrates how to create a knowledge source from Microsoft Fabric OneLake, enabling you to index documents stored in your Fabric Lakehouse.

## What You'll Learn

- Set up Microsoft Fabric workspace and Lakehouse
- Upload documents to OneLake
- Create an indexed OneLake knowledge source
- Configure automatic re-indexing schedules
- Query OneLake documents through the knowledge base

## Prerequisites

- Azure subscription
- Microsoft Fabric capacity or trial
- Azure CLI installed and logged in (`az login`)
- Existing Azure AI Foundry project (see notebook 01)
- Existing Azure AI Search service (see notebook 01)

## Architecture Overview

```
Microsoft Fabric OneLake → [Ingestion] → Azure AI Search Index → Knowledge Base → Retrieval API
                                ↓
                    Chunking + Embedding + Image Processing
```

**Note:** OneLake knowledge sources support automatic re-indexing and can process both text and image content.

## Step 1: Set Up Microsoft Fabric Workspace and Lakehouse

First, create a workspace and lakehouse in Microsoft Fabric.

In [None]:
# Configuration
import subprocess
import json

# Note: Microsoft Fabric workspace creation is typically done through the Fabric portal
# Visit: https://app.fabric.microsoft.com/

print("To create a Fabric workspace and lakehouse:")
print("1. Go to https://app.fabric.microsoft.com/")
print("2. Click 'Workspaces' → 'New workspace'")
print("3. Name your workspace (e.g., 'knowledge-demo-workspace')")
print("4. In the workspace, click 'New' → 'Lakehouse'")
print("5. Name your lakehouse (e.g., 'knowledge-lakehouse')")
print("\nOnce created, get the IDs from the URL or workspace settings.")

In [None]:
# Microsoft Fabric configuration
# You'll need to get these IDs from the Fabric portal

# From workspace URL: https://app.fabric.microsoft.com/groups/{WORKSPACE_ID}/...
FABRIC_WORKSPACE_ID = "<your-workspace-id>"  # GUID format

# From lakehouse properties
LAKEHOUSE_ITEM_ID = "<your-lakehouse-item-id>"  # GUID format

# Path within the lakehouse where documents are stored
TARGET_PATH = "/Files/documents"  # OneLake path (always starts with /Files/)

print(f"Fabric Workspace ID: {FABRIC_WORKSPACE_ID}")
print(f"Lakehouse Item ID: {LAKEHOUSE_ITEM_ID}")
print(f"Target Path: {TARGET_PATH}")

## Step 2: Upload Documents to OneLake

Upload sample documents to your Fabric Lakehouse.

In [None]:
# Create sample documents locally
import os

os.makedirs("onelake_docs", exist_ok=True)

# Sales report
with open("onelake_docs/sales_report_q1.txt", "w") as f:
    f.write("""
Q1 2024 Sales Report

Executive Summary:
Total revenue for Q1 2024 reached $12.5M, representing a 23% increase year-over-year.

Key Highlights:
- Enterprise segment grew 45% with 12 new enterprise customers
- Product launches in March contributed $2.1M in new revenue
- Customer retention rate improved to 94%
- Average deal size increased from $45K to $58K

Regional Performance:
- North America: $6.2M (50% of total)
- Europe: $4.1M (33% of total)
- Asia-Pacific: $2.2M (17% of total)
""")

# Marketing strategy
with open("onelake_docs/marketing_strategy_2024.txt", "w") as f:
    f.write("""
2024 Marketing Strategy

Objectives:
1. Increase brand awareness by 40%
2. Generate 5,000 qualified leads per quarter
3. Improve marketing ROI to 5:1

Key Initiatives:
- Content Marketing: Publish 2 whitepapers and 20 blog posts per month
- Digital Advertising: Focus on LinkedIn and Google Ads
- Events: Host 4 webinars per quarter and attend 6 industry conferences
- Partnership Marketing: Collaborate with 3 technology partners

Budget Allocation:
- Digital Advertising: 40%
- Content Creation: 25%
- Events: 20%
- Marketing Technology: 15%
""")

# Product roadmap
with open("onelake_docs/product_roadmap.txt", "w") as f:
    f.write("""
Product Roadmap 2024-2025

Q2 2024:
- Launch AI-powered analytics dashboard
- Introduce mobile app (iOS and Android)
- Add multi-language support (Spanish, French, German)

Q3 2024:
- Release API v2 with enhanced capabilities
- Implement advanced security features (SSO, RBAC)
- Launch customer success portal

Q4 2024:
- Introduce predictive insights engine
- Add integrations with Salesforce and HubSpot
- Release white-label solution for partners

Q1 2025:
- Launch enterprise plan with dedicated support
- Implement blockchain-based audit trails
- Release industry-specific templates
""")

print("Sample documents created!")
print("\nNext steps:")
print("1. Go to your Fabric Lakehouse")
print("2. Navigate to Files → documents (create folder if needed)")
print("3. Upload the files from ./onelake_docs/")
print("   OR use the OneLake file system APIs")

## Step 3: Configure Existing Resources

Set up references to existing Azure resources.

In [None]:
# Existing resources (from notebook 01 or your own)
EXISTING_SEARCH_ENDPOINT = "https://<your-search-service>.search.windows.net"
EXISTING_SEARCH_API_KEY = "<your-search-api-key>"
EXISTING_FOUNDRY_ENDPOINT = "https://<your-foundry-project>.services.ai.azure.com/api/projects/<project-name>"
EXISTING_AZURE_OPENAI_KEY = "<your-api-key>"
EXISTING_EMBEDDING_DEPLOYMENT = "text-embedding-3-small"
EXISTING_CHAT_DEPLOYMENT = "gpt-4o-mini"

# API version
API_VERSION = "2025-11-01-preview"

## Step 4: Create OneLake Knowledge Source

Create a knowledge source that ingests documents from OneLake.

In [None]:
import requests

KNOWLEDGE_SOURCE_NAME = "onelake-docs-source"

url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}?api-version={API_VERSION}"

headers = {
    "api-key": EXISTING_SEARCH_API_KEY,
    "Content-Type": "application/json"
}

body = {
    "name": KNOWLEDGE_SOURCE_NAME,
    "kind": "indexedOneLake",
    "description": "Knowledge source from Microsoft Fabric OneLake",
    "indexedOneLakeParameters": {
        "fabricWorkspaceId": FABRIC_WORKSPACE_ID,
        "lakehouseId": LAKEHOUSE_ITEM_ID,
        "targetPath": TARGET_PATH,
        "ingestionParameters": {
            "identity": None,
            "embeddingModel": {
                "kind": "azureOpenAI",
                "azureOpenAIParameters": {
                    "resourceUri": EXISTING_FOUNDRY_ENDPOINT,
                    "deploymentId": EXISTING_EMBEDDING_DEPLOYMENT,
                    "modelName": EXISTING_EMBEDDING_DEPLOYMENT,
                    "apiKey": EXISTING_AZURE_OPENAI_KEY
                }
            },
            "chatCompletionModel": {
                "kind": "azureOpenAI",
                "azureOpenAIParameters": {
                    "resourceUri": EXISTING_FOUNDRY_ENDPOINT,
                    "deploymentId": EXISTING_CHAT_DEPLOYMENT,
                    "modelName": EXISTING_CHAT_DEPLOYMENT,
                    "apiKey": EXISTING_AZURE_OPENAI_KEY
                }
            },
            "disableImageVerbalization": False,  # Enable image processing
            "ingestionSchedule": {
                "interval": "PT6H"  # Re-index every 6 hours
            },
            "contentExtractionMode": "minimal"  # Options: minimal, comprehensive
        }
    }
}

response = requests.put(url, headers=headers, json=body)
print(f"Status: {response.status_code}")
print(json.dumps(response.json(), indent=2))

## Step 5: Monitor Ingestion Progress

Check the status of document ingestion from OneLake.

In [None]:
import time

status_url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}/status?api-version={API_VERSION}"

print("Monitoring OneLake ingestion progress...\n")
while True:
    response = requests.get(status_url, headers=headers)
    status = response.json()
    
    current_status = status.get("status", "unknown")
    print(f"Status: {current_status}")
    
    if "documentsProcessed" in status:
        print(f"Documents processed: {status['documentsProcessed']}")
    if "lastIndexerExecutionTime" in status:
        print(f"Last execution: {status['lastIndexerExecutionTime']}")
    
    if current_status == "succeeded":
        print("\n✅ Ingestion completed successfully!")
        print(json.dumps(status, indent=2))
        break
    elif current_status == "failed":
        print("\n❌ Ingestion failed!")
        print(json.dumps(status, indent=2))
        break
    
    time.sleep(15)
    print("---")

## Step 6: Create Knowledge Base

Create a knowledge base using the OneLake source.

In [None]:
KNOWLEDGE_BASE_NAME = "onelake-kb"

url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeBases/{KNOWLEDGE_BASE_NAME}?api-version={API_VERSION}"

body = {
    "name": KNOWLEDGE_BASE_NAME,
    "description": "Knowledge base with OneLake business documents",
    "knowledgeSources": [
        {
            "name": KNOWLEDGE_SOURCE_NAME
        }
    ],
    "models": [
        {
            "kind": "azureOpenAI",
            "azureOpenAIParameters": {
                "resourceUri": EXISTING_FOUNDRY_ENDPOINT,
                "deploymentId": EXISTING_CHAT_DEPLOYMENT,
                "modelName": EXISTING_CHAT_DEPLOYMENT,
                "apiKey": EXISTING_AZURE_OPENAI_KEY
            }
        }
    ],
    "outputMode": "answerSynthesis",
    "retrievalInstructions": "Retrieve accurate business information from OneLake documents.",
    "answerInstructions": "Provide data-driven insights with specific citations from business documents."
}

response = requests.put(url, headers=headers, json=body)
print(f"Status: {response.status_code}")
print(json.dumps(response.json(), indent=2))

## Step 7: Query the Knowledge Base

Query OneLake documents through the knowledge base.

In [None]:
# Query about sales performance
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeBases/{KNOWLEDGE_BASE_NAME}/retrieve?api-version={API_VERSION}"

query_body = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What was the total revenue in Q1 2024 and how did different regions perform?"
                }
            ]
        }
    ],
    "includeActivity": True
}

response = requests.post(url, headers=headers, json=query_body)
result = response.json()

print("Answer:")
print(result["choices"][0]["message"]["content"])
print("\nReferences:")
for ref in result.get("activity", {}).get("references", []):
    print(f"- {ref.get('title', 'Unknown')}")

In [None]:
# Query about product roadmap
query_body = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What new features are planned for Q3 2024?"
                }
            ]
        }
    ],
    "includeActivity": True
}

response = requests.post(url, headers=headers, json=query_body)
result = response.json()

print("Answer:")
print(result["choices"][0]["message"]["content"])

In [None]:
# Query with advanced parameters
query_body = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Summarize our 2024 marketing strategy and budget allocation"
                }
            ]
        }
    ],
    "includeActivity": True,
    "retrievalReasoningEffort": {
        "kind": "medium"
    },
    "knowledgeSourceParams": [
        {
            "knowledgeSourceName": KNOWLEDGE_SOURCE_NAME,
            "kind": "indexedOneLake",
            "includeReferences": True,
            "includeReferenceSourceData": True,
            "alwaysQuerySource": True,
            "rerankerThreshold": 0.45
        }
    ]
}

response = requests.post(url, headers=headers, json=query_body)
result = response.json()

print("Answer:")
print(result["choices"][0]["message"]["content"])

## Step 8: Update Ingestion Schedule

Modify the automatic re-indexing schedule as needed.

In [None]:
# Update to re-index every 12 hours
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}?api-version={API_VERSION}"

# Get current configuration
response = requests.get(url, headers=headers)
current_config = response.json()

# Update ingestion schedule
current_config["indexedOneLakeParameters"]["ingestionParameters"]["ingestionSchedule"] = {
    "interval": "PT12H"  # Every 12 hours
}
# Other interval examples:
# "PT1H" - every 1 hour
# "PT6H" - every 6 hours
# "PT24H" or "P1D" - every 24 hours

response = requests.put(url, headers=headers, json=current_config)
print(f"Status: {response.status_code}")
print("Ingestion schedule updated to run every 12 hours")

## OneLake Path Structure

OneLake paths always follow this structure:

```
/Files/<folder-path>
```

Examples:
- `/Files/` - root Files folder
- `/Files/documents` - documents subfolder
- `/Files/data/reports` - nested folder structure

The knowledge source will index all files within the specified path and its subfolders.

## Cleanup

Clean up resources when done.

In [None]:
# Delete knowledge base
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeBases/{KNOWLEDGE_BASE_NAME}?api-version={API_VERSION}"
response = requests.delete(url, headers=headers)
print(f"Delete knowledge base: {response.status_code}")

In [None]:
# Delete knowledge source
url = f"{EXISTING_SEARCH_ENDPOINT}/knowledgeSources/{KNOWLEDGE_SOURCE_NAME}?api-version={API_VERSION}"
response = requests.delete(url, headers=headers)
print(f"Delete knowledge source: {response.status_code}")

## Summary

In this notebook, you learned how to:

1. Set up a Microsoft Fabric workspace and Lakehouse
2. Upload documents to OneLake
3. Create an indexed OneLake knowledge source
4. Configure automatic re-indexing schedules
5. Monitor ingestion progress
6. Query OneLake documents through the knowledge base
7. Update ingestion schedules dynamically

## OneLake vs. Blob Storage

| Feature | OneLake | Blob Storage |
|---------|---------|-------------|
| **Integration** | Native Fabric integration | Azure Storage |
| **Data Lake Features** | Built-in (Delta, Parquet) | Requires ADLS Gen2 |
| **Analytics** | Power BI, Spark notebooks | External tools |
| **Pricing** | Fabric capacity | Storage + transactions |
| **Governance** | Fabric workspace | Azure RBAC |
| **Best For** | Fabric-native workflows | General-purpose storage |

## Next Steps

- Explore existing Azure AI Search Index sources (notebook 05)
- Combine OneLake with other knowledge sources
- Set up Delta Lake tables in OneLake for structured data
- Implement data transformation pipelines before indexing
- Configure image verbalization for documents with charts/diagrams