# Fabric Scanner - SPN Read + User Write

This notebook demonstrates running the Fabric Scanner with:
- **Service Principal (SPN)** for reading from Scanner API
- **User authentication** for writing to lakehouse

This configuration provides:
- Automated scanning with SPN credentials
- Individual user accountability for data writes
- Principle of least privilege (SPN can be Viewer role)

## Prerequisites

1. **Service Principal Setup** (for Scanner API reads):
   - Create App Registration in Azure AD
   - Add API permissions: Power BI Service ‚Üí `Tenant.Read.All`, `Workspace.Read.All`
   - Enable in Power BI Admin Portal ‚Üí "Allow service principals to use Fabric APIs"
   - Store credentials as environment variables:
     - `FABRIC_SP_TENANT_ID`
     - `FABRIC_SP_CLIENT_ID`
     - `FABRIC_SP_CLIENT_SECRET`

2. **User Account** (for lakehouse writes):
   - Must have Contributor or Admin role on the lakehouse
   - Will be prompted for interactive login during notebook execution

3. **Lakehouse Configuration**:
   - Attach lakehouse to this notebook
   - Note workspace ID and lakehouse ID for upload configuration

In [None]:
# Import the scanner script
%run ./fabric_scanner_cloud_connections.py

## Configuration: SPN for Reading, User for Writing

In [None]:
# AUTHENTICATION CONFIGURATION

# Use Service Principal for Scanner API (reads)
AUTH_MODE = "spn"  # Service Principal for all API reads

# Use User Authentication for Lakehouse Uploads (writes)
import os
os.environ["UPLOAD_USE_USER_AUTH"] = "true"  # Enable user auth for uploads

# Service Principal credentials (for Scanner API reads)
# Set these as environment variables or Fabric secrets
TENANT_ID = os.getenv("FABRIC_SP_TENANT_ID", "<YOUR_TENANT_ID>")
CLIENT_ID = os.getenv("FABRIC_SP_CLIENT_ID", "<YOUR_CLIENT_ID>")
CLIENT_SECRET = os.getenv("FABRIC_SP_CLIENT_SECRET", "<YOUR_CLIENT_SECRET>")

# Verify configuration
print("‚úÖ Configuration:")
print(f"   Scanner API auth: Service Principal (Tenant: {TENANT_ID[:8]}...)")
print(f"   Lakehouse upload auth: User Account (interactive login)")
print(f"   Running in Fabric: {RUNNING_IN_FABRIC}")

## Optional: Configure Lakehouse Upload Details

If running locally and want to upload to Fabric lakehouse:

In [None]:
# Only needed for local execution with lakehouse upload
# Skip this cell if running in Fabric notebook

if not RUNNING_IN_FABRIC:
    UPLOAD_TO_LAKEHOUSE = True
    LAKEHOUSE_WORKSPACE_ID = "<YOUR_WORKSPACE_ID>"  # Workspace containing lakehouse
    LAKEHOUSE_ID = "<YOUR_LAKEHOUSE_ID>"  # Lakehouse ID
    LAKEHOUSE_UPLOAD_PATH = "Files/scanner"  # Path within lakehouse
    
    print("üì§ Lakehouse upload enabled")
    print(f"   Workspace: {LAKEHOUSE_WORKSPACE_ID}")
    print(f"   Lakehouse: {LAKEHOUSE_ID}")
else:
    print("‚ÑπÔ∏è  Running in Fabric - lakehouse attached automatically")

## Initialize Authentication

This will:
1. Authenticate SPN for Scanner API
2. Prompt for user login when first write occurs

In [None]:
# Initialize SPN authentication for Scanner API
initialize_authentication()

print("\n‚úÖ Service Principal authenticated for Scanner API")
print("‚ÑπÔ∏è  User authentication will be requested on first lakehouse write")

## Run Full Tenant Scan

This will:
1. Use SPN to call Scanner API (read all workspaces)
2. Prompt for user login when saving to lakehouse
3. Save results to lakehouse table with user credentials

In [None]:
# Full tenant scan with SPN read + User write
run_cloud_connection_scan(
    enable_full_scan=True,
    include_personal=True,
    table_name="tenant_cloud_connections"
)

# Note: On first write, you'll be prompted to authenticate as a user
# This provides individual accountability for data writes

## Alternative: Large Shared Tenant Mode

In [None]:
# For large tenants with rate limiting
run_cloud_connection_scan(
    enable_full_scan_chunked=True,
    max_batches_per_hour=450,
    include_personal=True,
    group_by_capacity=True,
    table_name="tenant_cloud_connections"
)

## Alternative: Incremental Scan (Last 24 Hours)

In [None]:
# Incremental scan with hash optimization
run_cloud_connection_scan(
    enable_incremental_scan=True,
    incremental_hours_back=24,
    enable_hash_optimization=True,
    table_name="tenant_cloud_connections"
)

## Verify Results

In [None]:
# Query results from lakehouse table
if RUNNING_IN_FABRIC and SPARK_AVAILABLE:
    df = spark.sql("SELECT * FROM tenant_cloud_connections LIMIT 10")
    display(df)
    
    # Show connection summary
    summary = spark.sql("""
        SELECT 
            connector,
            COUNT(DISTINCT workspace_id) as workspace_count,
            COUNT(*) as connection_count
        FROM tenant_cloud_connections
        GROUP BY connector
        ORDER BY connection_count DESC
    """)
    print("\nüìä Connection Summary:")
    display(summary)
else:
    print("Results saved to local files in ./scanner_output/curated/")

## Benefits of This Configuration

‚úÖ **Automated Scanning**: SPN allows unattended scheduled scans  
‚úÖ **User Accountability**: All data writes are tracked to individual users  
‚úÖ **Least Privilege**: SPN can have Viewer role (read-only), user has write permissions  
‚úÖ **Audit Trail**: User authentication provides clear audit log for data modifications  
‚úÖ **Separation of Duties**: Different credentials for read vs write operations  

## Security Notes

- Service Principal credentials should be stored in Azure Key Vault or Fabric secrets
- User authentication uses standard Microsoft login flow (MFA supported)
- Token caching minimizes login prompts during same session
- All API calls respect rate limits and include retry logic