# ☁️ Apache Iceberg Cloud Integration Tutorial

Welcome to the comprehensive Cloud Integration tutorial! In this notebook, you'll learn:

1. **Cloud Storage Integration**
2. **AWS S3 + Glue Configuration**
3. **Azure ADLS + Synapse Setup**
4. **GCP Cloud Storage + BigQuery**
5. **Multi-Cloud Strategies**
6. **Security and Permissions**
7. **Performance Optimization**

## 📋 Prerequisites

- Basic understanding of Iceberg concepts
- Cloud account access (AWS/Azure/GCP)
- Understanding of cloud storage concepts

## 1. 🚀 Initialize Environment

Set up Spark with cloud-ready Iceberg configuration.

In [None]:
import os
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
import json

# Set Python path for Spark consistency
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'

print("☁️ Cloud Integration Tutorial Environment Setup")
print("ℹ️ This tutorial demonstrates cloud integration patterns")
print("ℹ️ Some examples require actual cloud credentials to execute")
print("ℹ️ We'll provide configuration examples and best practices")

## 2. 🔧 Cloud Configuration Patterns

Learn the different ways to configure Iceberg for cloud environments.

In [None]:
# Cloud Configuration Examples (for reference)
print("☁️ CLOUD CONFIGURATION PATTERNS")
print("\n📝 Configuration patterns for different cloud providers:")

# Helper function to show configuration
def show_cloud_config(provider, config_dict):
    print(f"\n🌐 {provider} Configuration:")
    print("```python")
    for key, value in config_dict.items():
        print(f".config('{key}', '{value}')")
    print("```")

# AWS S3 Configuration
aws_config = {
    "spark.sql.catalog.prod": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.prod.type": "glue",
    "spark.sql.catalog.prod.warehouse": "s3a://my-iceberg-warehouse/",
    "spark.sql.catalog.prod.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.hadoop.fs.s3a.access.key": "YOUR_ACCESS_KEY",
    "spark.hadoop.fs.s3a.secret.key": "YOUR_SECRET_KEY",
    "spark.hadoop.fs.s3a.region": "us-west-2"
}

show_cloud_config("AWS S3 + Glue", aws_config)

# Azure ADLS Configuration
azure_config = {
    "spark.sql.catalog.prod": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.prod.type": "hadoop",
    "spark.sql.catalog.prod.warehouse": "abfss://container@account.dfs.core.windows.net/warehouse",
    "spark.hadoop.fs.azure.account.key.account.dfs.core.windows.net": "YOUR_ACCOUNT_KEY",
    "spark.hadoop.fs.azure.account.auth.type.account.dfs.core.windows.net": "SharedKey"
}

show_cloud_config("Azure ADLS Gen2", azure_config)

# GCP Cloud Storage Configuration
gcp_config = {
    "spark.sql.catalog.prod": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.prod.type": "hadoop",
    "spark.sql.catalog.prod.warehouse": "gs://my-iceberg-bucket/warehouse",
    "spark.hadoop.google.cloud.auth.service.account.json.keyfile": "/path/to/service-account.json",
    "spark.hadoop.fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
}

show_cloud_config("GCP Cloud Storage", gcp_config)

## 3. 🛡️ Security and Authentication

Learn cloud security best practices for Iceberg.

In [None]:
# Security Best Practices
print("🛡️ CLOUD SECURITY BEST PRACTICES")
print("\n🔐 Authentication Methods:")

security_patterns = {
    "AWS": {
        "IAM Roles (Recommended)": [
            "Use IAM roles instead of access keys",
            "Assign minimal required permissions",
            "Enable CloudTrail for audit logging",
            "Use KMS for encryption at rest"
        ],
        "S3 Permissions": [
            "s3:GetObject, s3:PutObject for data files",
            "s3:ListBucket for warehouse bucket",
            "s3:DeleteObject for maintenance operations"
        ],
        "Glue Permissions": [
            "glue:GetDatabase, glue:GetTable",
            "glue:CreateTable, glue:UpdateTable",
            "glue:DeleteTable (for maintenance)"
        ]
    },
    "Azure": {
        "Managed Identity (Recommended)": [
            "Use system or user-assigned managed identities",
            "Avoid storing connection strings in code",
            "Enable Azure Monitor for logging",
            "Use Azure Key Vault for secrets"
        ],
        "ADLS Permissions": [
            "Storage Blob Data Contributor role",
            "Storage Blob Data Reader for read-only access",
            "Storage Account Contributor for management"
        ]
    },
    "GCP": {
        "Service Accounts (Recommended)": [
            "Use service accounts with minimal permissions",
            "Enable audit logging with Cloud Logging",
            "Use Cloud KMS for encryption",
            "Store keys in Secret Manager"
        ],
        "Cloud Storage Permissions": [
            "Storage Object Admin for full access",
            "Storage Object Viewer for read-only",
            "Storage Legacy Bucket Reader for metadata"
        ]
    }
}

for cloud, categories in security_patterns.items():
    print(f"\n🌐 {cloud} Security:")
    for category, items in categories.items():
        print(f"  📋 {category}:")
        for item in items:
            print(f"    • {item}")

## 4. 🏗️ AWS S3 + Glue Integration

Deep dive into AWS cloud integration patterns.

In [None]:
# AWS Integration Examples
print("🏗️ AWS S3 + GLUE INTEGRATION")
print("\n📝 Complete AWS setup example:")

# AWS Spark Session Configuration
aws_spark_config = """
# Complete AWS Spark Session Setup
spark = SparkSession.builder \\
    .appName("IcebergAWS") \\
    .config("spark.jars.packages", 
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3,"
            "org.apache.hadoop:hadoop-aws:3.3.4,"
            "software.amazon.awssdk:bundle:2.20.18") \\
    .config("spark.sql.extensions", 
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \\
    .config("spark.sql.catalog.glue_catalog", 
            "org.apache.iceberg.spark.SparkCatalog") \\
    .config("spark.sql.catalog.glue_catalog.type", "glue") \\
    .config("spark.sql.catalog.glue_catalog.warehouse", 
            "s3a://my-iceberg-warehouse/") \\
    .config("spark.sql.catalog.glue_catalog.io-impl", 
            "org.apache.iceberg.aws.s3.S3FileIO") \\
    .config("spark.hadoop.fs.s3a.region", "us-west-2") \\
    .config("spark.hadoop.fs.s3a.impl", 
            "org.apache.hadoop.fs.s3a.S3AFileSystem") \\
    .getOrCreate()
"""

print(aws_spark_config)

# AWS IAM Policy Example
print("\n🔐 AWS IAM Policy Example:")
iam_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-iceberg-warehouse",
                "arn:aws:s3:::my-iceberg-warehouse/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable"
            ],
            "Resource": "*"
        }
    ]
}

print(json.dumps(iam_policy, indent=2))

In [None]:
# AWS Example Operations
print("🚀 AWS EXAMPLE OPERATIONS")
print("\n📝 Example SQL operations with AWS Glue catalog:")

aws_operations = [
    {
        "operation": "Create Database",
        "sql": "CREATE DATABASE IF NOT EXISTS glue_catalog.sales_db"
    },
    {
        "operation": "Create Table",
        "sql": """
CREATE TABLE glue_catalog.sales_db.orders (
    order_id bigint,
    customer_id bigint,
    product_name string,
    quantity int,
    price decimal(10,2),
    order_date date
) USING ICEBERG
PARTITIONED BY (bucket(16, customer_id), days(order_date))
LOCATION 's3a://my-iceberg-warehouse/sales_db/orders'
"""
    },
    {
        "operation": "Insert Data",
        "sql": """
INSERT INTO glue_catalog.sales_db.orders VALUES
    (1001, 5001, 'Laptop', 1, 999.99, DATE '2024-01-15'),
    (1002, 5002, 'Mouse', 2, 29.99, DATE '2024-01-16')
"""
    },
    {
        "operation": "Query with Partition Pruning",
        "sql": """
SELECT * FROM glue_catalog.sales_db.orders 
WHERE order_date >= DATE '2024-01-01' 
AND customer_id = 5001
"""
    }
]

for op in aws_operations:
    print(f"\n📋 {op['operation']}:")
    print(f"```sql\n{op['sql'].strip()}\n```")

print("\n💡 Benefits of AWS Glue Integration:")
aws_benefits = [
    "Centralized metadata catalog",
    "Integration with AWS analytics services",
    "Automatic schema discovery",
    "Built-in data governance features",
    "Support for multiple compute engines"
]

for benefit in aws_benefits:
    print(f"  ✅ {benefit}")

## 5. 💙 Azure ADLS + Synapse Integration

Explore Azure cloud integration patterns.

In [None]:
# Azure Integration Examples
print("💙 AZURE ADLS + SYNAPSE INTEGRATION")
print("\n📝 Complete Azure setup example:")

# Azure Spark Session Configuration
azure_spark_config = """
# Complete Azure Spark Session Setup
spark = SparkSession.builder \\
    .appName("IcebergAzure") \\
    .config("spark.jars.packages", 
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3,"
            "org.apache.hadoop:hadoop-azure:3.3.4") \\
    .config("spark.sql.extensions", 
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \\
    .config("spark.sql.catalog.azure_catalog", 
            "org.apache.iceberg.spark.SparkCatalog") \\
    .config("spark.sql.catalog.azure_catalog.type", "hadoop") \\
    .config("spark.sql.catalog.azure_catalog.warehouse", 
            "abfss://warehouse@mystorageaccount.dfs.core.windows.net/iceberg") \\
    .config("spark.hadoop.fs.azure.account.auth.type.mystorageaccount.dfs.core.windows.net", 
            "OAuth") \\
    .config("spark.hadoop.fs.azure.account.oauth.provider.type.mystorageaccount.dfs.core.windows.net", 
            "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") \\
    .config("spark.hadoop.fs.azure.account.oauth2.client.id.mystorageaccount.dfs.core.windows.net", 
            "your-client-id") \\
    .config("spark.hadoop.fs.azure.account.oauth2.client.secret.mystorageaccount.dfs.core.windows.net", 
            "your-client-secret") \\
    .config("spark.hadoop.fs.azure.account.oauth2.client.endpoint.mystorageaccount.dfs.core.windows.net", 
            "https://login.microsoftonline.com/your-tenant-id/oauth2/token") \\
    .getOrCreate()
"""

print(azure_spark_config)

print("\n🔐 Azure Service Principal Setup:")
azure_setup_steps = [
    "1. Create Azure Service Principal in Azure AD",
    "2. Assign 'Storage Blob Data Contributor' role to storage account",
    "3. Note down: Client ID, Client Secret, Tenant ID",
    "4. Create ADLS Gen2 storage account with hierarchical namespace",
    "5. Configure Synapse workspace with storage account"
]

for step in azure_setup_steps:
    print(f"  {step}")

In [None]:
# Azure Example Operations
print("🚀 AZURE EXAMPLE OPERATIONS")
print("\n📝 Example operations with Azure ADLS:")

azure_operations = [
    {
        "operation": "Create Database",
        "sql": "CREATE DATABASE IF NOT EXISTS azure_catalog.analytics_db"
    },
    {
        "operation": "Create Table with ADLS Location",
        "sql": """
CREATE TABLE azure_catalog.analytics_db.customer_events (
    event_id bigint,
    customer_id bigint,
    event_type string,
    event_time timestamp,
    properties map<string, string>
) USING ICEBERG
PARTITIONED BY (days(event_time), bucket(32, customer_id))
LOCATION 'abfss://warehouse@mystorageaccount.dfs.core.windows.net/analytics_db/customer_events'
"""
    },
    {
        "operation": "Synapse Analytics Integration",
        "sql": """
-- Query from Synapse SQL Pool
SELECT 
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT customer_id) as unique_customers
FROM azure_catalog.analytics_db.customer_events
WHERE event_time >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY event_type
"""
    }
]

for op in azure_operations:
    print(f"\n📋 {op['operation']}:")
    print(f"```sql\n{op['sql'].strip()}\n```")

print("\n💡 Benefits of Azure Integration:")
azure_benefits = [
    "Native integration with Synapse Analytics",
    "Azure Active Directory authentication",
    "Integration with Power BI for visualization",
    "Azure Monitor for observability",
    "Cost-effective storage with lifecycle policies"
]

for benefit in azure_benefits:
    print(f"  ✅ {benefit}")

## 6. 🌐 GCP Cloud Storage + BigQuery Integration

Learn Google Cloud Platform integration patterns.

In [None]:
# GCP Integration Examples
print("🌐 GCP CLOUD STORAGE + BIGQUERY INTEGRATION")
print("\n📝 Complete GCP setup example:")

# GCP Spark Session Configuration
gcp_spark_config = """
# Complete GCP Spark Session Setup
spark = SparkSession.builder \\
    .appName("IcebergGCP") \\
    .config("spark.jars.packages", 
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3,"
            "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.11") \\
    .config("spark.sql.extensions", 
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \\
    .config("spark.sql.catalog.gcp_catalog", 
            "org.apache.iceberg.spark.SparkCatalog") \\
    .config("spark.sql.catalog.gcp_catalog.type", "hadoop") \\
    .config("spark.sql.catalog.gcp_catalog.warehouse", 
            "gs://my-iceberg-bucket/warehouse") \\
    .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", 
            "/path/to/service-account-key.json") \\
    .config("spark.hadoop.fs.gs.impl", 
            "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \\
    .config("spark.hadoop.fs.AbstractFileSystem.gs.impl", 
            "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") \\
    .getOrCreate()
"""

print(gcp_spark_config)

print("\n🔐 GCP Service Account Setup:")
gcp_setup_steps = [
    "1. Create GCP Service Account in IAM & Admin",
    "2. Assign roles: Storage Admin, BigQuery Admin",
    "3. Generate and download JSON key file",
    "4. Create Cloud Storage bucket for warehouse",
    "5. Enable BigQuery API for your project"
]

for step in gcp_setup_steps:
    print(f"  {step}")

# GCP IAM Roles
print("\n🔐 Required GCP IAM Roles:")
gcp_roles = {
    "Storage Object Admin": "Full control over Cloud Storage objects",
    "Storage Legacy Bucket Reader": "List buckets and read bucket metadata",
    "BigQuery Data Editor": "Create and modify BigQuery datasets/tables",
    "BigQuery Job User": "Run BigQuery jobs"
}

for role, description in gcp_roles.items():
    print(f"  📋 {role}: {description}")

In [None]:
# GCP Example Operations
print("🚀 GCP EXAMPLE OPERATIONS")
print("\n📝 Example operations with GCP Cloud Storage:")

gcp_operations = [
    {
        "operation": "Create Database",
        "sql": "CREATE DATABASE IF NOT EXISTS gcp_catalog.data_lake_db"
    },
    {
        "operation": "Create Table with GCS Location",
        "sql": """
CREATE TABLE gcp_catalog.data_lake_db.web_analytics (
    session_id string,
    user_id bigint,
    page_url string,
    event_time timestamp,
    user_agent string,
    geo_location struct<country: string, city: string>
) USING ICEBERG
PARTITIONED BY (days(event_time), bucket(64, user_id))
LOCATION 'gs://my-iceberg-bucket/warehouse/data_lake_db/web_analytics'
"""
    },
    {
        "operation": "BigQuery External Table",
        "description": "Create BigQuery external table to query Iceberg data",
        "sql": """
-- BigQuery SQL to create external table
CREATE OR REPLACE EXTERNAL TABLE `project.dataset.web_analytics_external`
OPTIONS (
  format = 'ICEBERG',
  uris = ['gs://my-iceberg-bucket/warehouse/data_lake_db/web_analytics']
)
"""
    }
]

for op in gcp_operations:
    print(f"\n📋 {op['operation']}:")
    if 'description' in op:
        print(f"    {op['description']}")
    print(f"```sql\n{op['sql'].strip()}\n```")

print("\n💡 Benefits of GCP Integration:")
gcp_benefits = [
    "Integration with BigQuery for serverless analytics",
    "Cost-effective Cloud Storage with lifecycle management",
    "Integration with Dataflow for stream processing",
    "Cloud Monitoring for observability",
    "Data Studio for visualization"
]

for benefit in gcp_benefits:
    print(f"  ✅ {benefit}")

## 7. 🔄 Multi-Cloud Strategies

Learn patterns for multi-cloud and hybrid deployments.

In [None]:
# Multi-Cloud Patterns
print("🔄 MULTI-CLOUD STRATEGIES")
print("\n🌐 Multi-cloud deployment patterns:")

multicloud_patterns = {
    "🏢 Hybrid Cloud": {
        "description": "On-premises + one cloud provider",
        "use_cases": [
            "Gradual cloud migration",
            "Data residency requirements",
            "Cost optimization"
        ],
        "challenges": [
            "Network latency",
            "Data transfer costs",
            "Security complexity"
        ]
    },
    "☁️ Multi-Cloud": {
        "description": "Multiple cloud providers",
        "use_cases": [
            "Vendor lock-in avoidance",
            "Best-of-breed services",
            "Disaster recovery"
        ],
        "challenges": [
            "Increased complexity",
            "Data consistency",
            "Cross-cloud networking"
        ]
    },
    "🌍 Data Distribution": {
        "description": "Geographically distributed data",
        "use_cases": [
            "Global applications",
            "Compliance requirements",
            "Performance optimization"
        ],
        "challenges": [
            "Data governance",
            "Consistency models",
            "Cross-region costs"
        ]
    }
}

for pattern, details in multicloud_patterns.items():
    print(f"\n{pattern}: {details['description']}")
    print("  📈 Use Cases:")
    for use_case in details['use_cases']:
        print(f"    • {use_case}")
    print("  ⚠️ Challenges:")
    for challenge in details['challenges']:
        print(f"    • {challenge}")

In [None]:
# Multi-Cloud Configuration Example
print("🔧 MULTI-CLOUD CONFIGURATION EXAMPLE")
print("\n📝 Federated catalog setup across clouds:")

multicloud_config = """
# Multi-Cloud Spark Session with Multiple Catalogs
spark = SparkSession.builder \\
    .appName("IcebergMultiCloud") \\
    .config("spark.jars.packages", 
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3,"
            "org.apache.hadoop:hadoop-aws:3.3.4,"
            "org.apache.hadoop:hadoop-azure:3.3.4,"
            "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.11") \\
    .config("spark.sql.extensions", 
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \\
    \\
    # AWS Catalog
    .config("spark.sql.catalog.aws", "org.apache.iceberg.spark.SparkCatalog") \\
    .config("spark.sql.catalog.aws.type", "glue") \\
    .config("spark.sql.catalog.aws.warehouse", "s3a://aws-warehouse/") \\
    \\
    # Azure Catalog
    .config("spark.sql.catalog.azure", "org.apache.iceberg.spark.SparkCatalog") \\
    .config("spark.sql.catalog.azure.type", "hadoop") \\
    .config("spark.sql.catalog.azure.warehouse", 
            "abfss://warehouse@account.dfs.core.windows.net/") \\
    \\
    # GCP Catalog
    .config("spark.sql.catalog.gcp", "org.apache.iceberg.spark.SparkCatalog") \\
    .config("spark.sql.catalog.gcp.type", "hadoop") \\
    .config("spark.sql.catalog.gcp.warehouse", "gs://gcp-warehouse/") \\
    .getOrCreate()
"""

print(multicloud_config)

print("\n📊 Multi-cloud query examples:")
multicloud_queries = [
    {
        "operation": "Cross-cloud data federation",
        "sql": """
-- Join data across clouds
SELECT 
    a.customer_id,
    a.order_total,
    b.event_count
FROM aws.sales.orders a
JOIN azure.analytics.customer_events b
  ON a.customer_id = b.customer_id
WHERE a.order_date >= CURRENT_DATE - INTERVAL 30 DAYS
"""
    },
    {
        "operation": "Data migration between clouds",
        "sql": """
-- Migrate data from AWS to GCP
CREATE TABLE gcp.migration.orders AS
SELECT * FROM aws.sales.orders
WHERE order_date >= DATE '2024-01-01'
"""
    }
]

for query in multicloud_queries:
    print(f"\n📋 {query['operation']}:")
    print(f"```sql{query['sql']}```")

## 8. ⚡ Performance Optimization for Cloud

Learn cloud-specific performance optimization techniques.

In [None]:
# Cloud Performance Optimization
print("⚡ CLOUD PERFORMANCE OPTIMIZATION")
print("\n🚀 Cloud-specific optimization strategies:")

cloud_optimizations = {
    "🗄️ Storage Optimization": {
        "File Size": [
            "Target 128MB-1GB files for optimal cloud storage",
            "Use write.target-file-size-bytes config",
            "Monitor file count vs. size trade-offs"
        ],
        "Compression": [
            "Use ZSTD compression for better performance",
            "Consider GZIP for compatibility",
            "Test compression ratios for your data"
        ],
        "Partitioning": [
            "Align partitions with query patterns",
            "Avoid over-partitioning (< 1GB per partition)",
            "Use hidden partitioning for flexibility"
        ]
    },
    "🌐 Network Optimization": {
        "Data Locality": [
            "Co-locate compute and storage in same region",
            "Use regional storage classes",
            "Consider data transfer costs"
        ],
        "Caching": [
            "Enable Spark SQL adaptive query execution",
            "Use broadcast joins for small tables",
            "Cache frequently accessed metadata"
        ]
    },
    "💰 Cost Optimization": {
        "Storage Classes": [
            "Use appropriate storage tiers (Standard/IA/Archive)",
            "Implement lifecycle policies",
            "Monitor storage costs by table"
        ],
        "Compute": [
            "Right-size compute instances",
            "Use spot/preemptible instances when possible",
            "Auto-scaling for variable workloads"
        ]
    }
}

for category, subcategories in cloud_optimizations.items():
    print(f"\n{category}:")
    for subcat, tips in subcategories.items():
        print(f"  📋 {subcat}:")
        for tip in tips:
            print(f"    • {tip}")

In [None]:
# Performance Configuration Examples
print("🔧 PERFORMANCE CONFIGURATION EXAMPLES")
print("\n📝 Spark configurations for cloud optimization:")

performance_configs = {
    "File Size Optimization": {
        "write.target-file-size-bytes": "134217728",  # 128MB
        "write.delete.target-file-size-bytes": "67108864",  # 64MB
        "write.merge.target-file-size-bytes": "134217728"  # 128MB
    },
    "Compression Settings": {
        "write.parquet.compression-codec": "zstd",
        "write.parquet.compression-level": "3",
        "read.parquet.vectorization.enabled": "true"
    },
    "Memory Optimization": {
        "spark.sql.adaptive.enabled": "true",
        "spark.sql.adaptive.coalescePartitions.enabled": "true",
        "spark.sql.adaptive.skewJoin.enabled": "true",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
    }
}

for category, configs in performance_configs.items():
    print(f"\n📋 {category}:")
    for key, value in configs.items():
        print(f"  spark.sql.catalog.prod.{key} = {value}")

# Example table properties
print("\n📊 Example table properties for performance:")
table_properties_sql = """
CREATE TABLE prod.analytics.optimized_table (
    id bigint,
    data string,
    timestamp timestamp
) USING ICEBERG
PARTITIONED BY (days(timestamp))
TBLPROPERTIES (
    'write.target-file-size-bytes' = '134217728',
    'write.parquet.compression-codec' = 'zstd',
    'write.metadata.compression-codec' = 'gzip',
    'history.expire.max-snapshot-age-ms' = '2592000000'  # 30 days
)
"""
print(table_properties_sql)

## 9. 📊 Monitoring and Observability

Learn to monitor Iceberg in cloud environments.

In [None]:
# Cloud Monitoring Strategies
print("📊 CLOUD MONITORING AND OBSERVABILITY")
print("\n🔍 Key metrics to monitor:")

monitoring_metrics = {
    "📈 Performance Metrics": [
        "Query execution time",
        "Data scan rates (GB/sec)",
        "File count per table",
        "Average file size",
        "Partition pruning effectiveness"
    ],
    "💾 Storage Metrics": [
        "Total storage usage",
        "Storage growth rate",
        "Metadata size vs data size ratio",
        "Snapshot count per table",
        "Orphaned file count"
    ],
    "💰 Cost Metrics": [
        "Storage costs by table",
        "Compute costs by workload",
        "Data transfer costs",
        "API call costs",
        "Cost per query"
    ],
    "🔒 Security Metrics": [
        "Access patterns by user",
        "Failed authentication attempts",
        "Permission violations",
        "Data access audit logs",
        "Encryption status"
    ]
}

for category, metrics in monitoring_metrics.items():
    print(f"\n{category}:")
    for metric in metrics:
        print(f"  • {metric}")

# Cloud-specific monitoring tools
print("\n🛠️ Cloud-specific monitoring tools:")
cloud_tools = {
    "AWS": [
        "CloudWatch for metrics and logs",
        "CloudTrail for API audit logs",
        "Cost Explorer for cost analysis",
        "X-Ray for distributed tracing"
    ],
    "Azure": [
        "Azure Monitor for metrics",
        "Log Analytics for log analysis",
        "Cost Management for cost tracking",
        "Application Insights for performance"
    ],
    "GCP": [
        "Cloud Monitoring for metrics",
        "Cloud Logging for log management",
        "Cloud Billing for cost analysis",
        "Cloud Trace for performance"
    ]
}

for cloud, tools in cloud_tools.items():
    print(f"\n🌐 {cloud}:")
    for tool in tools:
        print(f"  • {tool}")

In [None]:
# Monitoring Queries Examples
print("📊 MONITORING QUERIES EXAMPLES")
print("\n📝 SQL queries for monitoring Iceberg tables:")

monitoring_queries = [
    {
        "name": "Table Storage Analysis",
        "sql": """
-- Analyze storage usage by table
SELECT 
    'your_table' as table_name,
    COUNT(*) as file_count,
    SUM(file_size_in_bytes) / 1024 / 1024 / 1024 as size_gb,
    AVG(file_size_in_bytes) / 1024 / 1024 as avg_file_size_mb,
    SUM(record_count) as total_records
FROM your_catalog.your_db.your_table.files
"""
    },
    {
        "name": "Snapshot Management",
        "sql": """
-- Monitor snapshot accumulation
SELECT 
    COUNT(*) as snapshot_count,
    MIN(committed_at) as oldest_snapshot,
    MAX(committed_at) as newest_snapshot,
    DATEDIFF(day, MIN(committed_at), MAX(committed_at)) as retention_days
FROM your_catalog.your_db.your_table.snapshots
"""
    },
    {
        "name": "Partition Efficiency",
        "sql": """
-- Analyze partition distribution
SELECT 
    partition,
    record_count,
    file_count,
    ROUND(record_count / file_count, 0) as avg_records_per_file
FROM your_catalog.your_db.your_table.partitions
ORDER BY record_count DESC
"""
    },
    {
        "name": "Performance Metrics",
        "sql": """
-- Query performance analysis
SELECT 
    operation,
    COUNT(*) as operation_count,
    AVG(DATEDIFF(second, lag(committed_at) OVER (ORDER BY committed_at), committed_at)) as avg_time_between_ops
FROM your_catalog.your_db.your_table.snapshots
GROUP BY operation
"""
    }
]

for query in monitoring_queries:
    print(f"\n📋 {query['name']}:")
    print(f"```sql{query['sql']}```")

print("\n💡 Monitoring Best Practices:")
monitoring_best_practices = [
    "Set up automated alerts for storage growth",
    "Monitor query performance trends",
    "Track cost metrics regularly",
    "Implement data quality checks",
    "Audit access patterns for security",
    "Regular snapshot cleanup automation"
]

for practice in monitoring_best_practices:
    print(f"  ✅ {practice}")

## 10. 🎉 Summary and Best Practices

Cloud integration summary and recommendations.

In [None]:
# Final summary
print("🎉 CLOUD INTEGRATION TUTORIAL COMPLETE!")
print("\n✅ What You've Learned:")

accomplishments = [
    "Cloud storage integration patterns for AWS, Azure, and GCP",
    "Security and authentication best practices",
    "Multi-cloud and hybrid deployment strategies",
    "Performance optimization for cloud environments",
    "Cost optimization techniques",
    "Monitoring and observability patterns",
    "Cloud-specific service integrations"
]

for i, accomplishment in enumerate(accomplishments, 1):
    print(f"   {i}. {accomplishment}")

print("\n💡 CLOUD INTEGRATION BEST PRACTICES:")

best_practices = {
    "🔒 Security First": [
        "Use IAM roles/managed identities instead of keys",
        "Implement least privilege access",
        "Enable audit logging and monitoring",
        "Encrypt data at rest and in transit"
    ],
    "⚡ Performance": [
        "Co-locate compute and storage",
        "Optimize file sizes for cloud storage",
        "Use appropriate compression algorithms",
        "Monitor and tune query performance"
    ],
    "💰 Cost Management": [
        "Choose appropriate storage tiers",
        "Implement lifecycle policies",
        "Monitor data transfer costs",
        "Right-size compute resources"
    ],
    "🔧 Operations": [
        "Automate deployment and configuration",
        "Implement proper backup strategies",
        "Set up comprehensive monitoring",
        "Plan for disaster recovery"
    ]
}

for category, practices in best_practices.items():
    print(f"\n{category}:")
    for practice in practices:
        print(f"   • {practice}")

print("\n🚀 Next Steps:")
next_steps = [
    "Set up a cloud environment for hands-on practice",
    "Implement monitoring and alerting",
    "Explore production pipeline patterns",
    "Study advanced multi-cloud architectures"
]

for step in next_steps:
    print(f"   → {step}")

print("\n🎯 Key Takeaway:")
print("   Cloud integration with Iceberg provides scalable, cost-effective,")
print("   and secure data lake solutions across all major cloud providers!")