# Iceberg Feature Groups in Seeknal

This notebook demonstrates how to use **Apache Iceberg** as the storage backend for Seeknal Feature Groups.

## What is Apache Iceberg?

**Apache Iceberg** is an open-source table format for huge analytic datasets. It provides:

- **ACID Transactions**: Atomic writes with automatic rollback
- **Time Travel**: Query features as of any point in time
- **Schema Evolution**: Add/modify features without rewrites
- **Cloud Storage**: Native support for S3, GCS, and Azure Blob
- **Compatibility**: Works with DuckDB, Spark, Trino, and more

## Prerequisites

Before running this notebook, ensure:

1. **Seeknal is installed**: `pip install seeknal`
2. **DuckDB is available**: Comes with Seeknal
3. **REST Catalog is running**: e.g., Lakekeeper at `http://localhost:8181`
4. **Storage is configured**: S3 bucket or local filesystem

### Configure Catalog

Create or update `~/.seeknal/profiles.yml`:

In [None]:
# Show example profile configuration
profile_content = """
materialization:
  catalog:
    uri: http://localhost:8181  # Lakekeeper REST catalog
    warehouse: s3://my-bucket/warehouse
    bearer_token: optional_token  # If auth required
"""

print(profile_content)

## Setup: Initialize Project and Workspace

In [None]:
import os
from datetime import datetime
import pandas as pd
import duckdb

# Set environment variables (alternative to profiles.yml)
os.environ['LAKEKEEPER_URI'] = 'http://localhost:8181'
os.environ['LAKEKEEPER_WAREHOUSE'] = 's3://iceberg/warehouse'

print("Environment configured for Iceberg storage")

## Creating Feature Groups with Iceberg Storage

In [None]:
from seeknal.featurestore import (
    FeatureGroup,
    Materialization,
    OfflineMaterialization,
    OfflineStore,
    OfflineStoreEnum,
    IcebergStoreOutput,
)
from seeknal.entity import Entity

# Create an entity (defines the join key)
customer_entity = Entity(
    name="customer",
    join_keys=["customer_id"]
)

print("Entity created:", customer_entity.name)

In [None]:
# Create sample feature data
features_df = pd.DataFrame({
    "customer_id": ["A001", "A002", "A003", "A001", "A002"],
    "event_date": ["2024-01-01", "2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"],
    "total_orders": [5, 10, 3, 2, 5],
    "total_spend": [100.0, 250.0, 75.0, 50.0, 125.0],
    "avg_order_value": [20.0, 25.0, 25.0, 25.0, 25.0],
    "days_since_last_order": [0, 0, 0, 1, 1],
})

print("Sample features:")
print(features_df.head())

In [None]:
# Create Feature Group with Iceberg storage
fg = FeatureGroup(
    name="customer_features",
    entity=customer_entity,
    materialization=Materialization(
        event_time_col="event_date",
        offline=True,
        offline_materialization=OfflineMaterialization(
            store=OfflineStore(
                kind=OfflineStoreEnum.ICEBERG,  # Use Iceberg!
                value=IcebergStoreOutput(
                    catalog="lakekeeper",        # Catalog name
                    namespace="prod",            # Namespace (database)
                    table="customer_features",   # Table name
                    mode="append"               # Write mode
                )
            ),
            mode="append"
        )
    )
)

print("Feature Group created with Iceberg storage")
print(f"Storage kind: {fg.materialization.offline_materialization.store.kind}")
print(f"Table: {fg.materialization.offline_materialization.store.value.catalog}.")
print(f"      {fg.materialization.offline_materialization.store.value.namespace}.")
print(f"      {fg.materialization.offline_materialization.store.value.table}")

## Writing Features to Iceberg

In [None]:
# Set the DataFrame and define features
fg.set_dataframe(features_df).set_features()

# Write to Iceberg table
result = fg.write(feature_start_time=datetime(2024, 1, 1))

print("Features written to Iceberg table!")
print(f"\nResult:")
print(f"  Path: {result['path']}")
print(f"  Rows: {result['num_rows']}")
print(f"  Storage: {result['storage_type']}")
print(f"  Snapshot ID: {result['snapshot_id'][:16]}...")  # First 16 chars
print(f"  Table: {result['table']}")
print(f"  Namespace: {result['namespace']}")

## Querying Iceberg Tables with DuckDB

In [None]:
# Create DuckDB connection
con = duckdb.connect(":memory:")

# Load Iceberg extension
con.install_extension("iceberg")
con.load_extension("iceberg")

print("DuckDB Iceberg extension loaded")

In [None]:
# Attach REST catalog
catalog_uri = os.getenv('LAKEKEEPER_URI', 'http://localhost:8181')
warehouse_path = os.getenv('LAKEKEEPER_WAREHOUSE', 's3://iceberg/warehouse')

con.execute(f"""
ATTACH '{catalog_uri}' AS seeknal_catalog (
    TYPE iceberg,
    WAREHOUSE '{warehouse_path}'
)
""")

print(f"Attached catalog: {catalog_uri}")
print(f"Warehouse: {warehouse_path}")

In [None]:
# Query the Iceberg table
query_result = con.execute("""
SELECT 
    customer_id,
    total_orders,
    total_spend,
    avg_order_value
FROM seeknal_catalog.prod.customer_features
ORDER BY customer_id, event_date
""").fetchall()

print("\nFeatures from Iceberg table:")
for row in query_result:
    print(f"  {row[0]}: orders={row[1]}, spend=${row[2]:.2f}, avg=${row[3]:.2f}")

## Time Travel with Iceberg

Iceberg allows querying data as of any snapshot. This is useful for:
- Debugging model performance at specific times
- Rolling back to previous feature versions
- Auditing feature changes over time

In [None]:
# Get snapshot history
snapshots = con.execute("""
SELECT snapshot_id, committed_at
FROM seeknal_catalog.prod.customer_features.snapshots
ORDER BY committed_at DESC
""").fetchall()

print("\nSnapshot history:")
for i, (snapshot_id, committed_at) in enumerate(snapshots[:5], 1):
    print(f"  {i}. {snapshot_id[:16]}... at {committed_at}")

In [None]:
# Query as of specific snapshot
if snapshots:
    first_snapshot_id = snapshots[0][0]
    
    con.execute(f"USE SNAPSHOT '{first_snapshot_id}'")
    
    # Query the table as it was at that snapshot
    historical_result = con.execute("""
        SELECT COUNT(*) as row_count
        FROM seeknal_catalog.prod.customer_features
    """).fetchone()
    
    print(f"\nRows as of snapshot {first_snapshot_id[:16]}...: {historical_result[0]}")

## Append vs Overwrite Modes

In [None]:
# Create new data for append
new_data = pd.DataFrame({
    "customer_id": ["A004", "A005"],
    "event_date": ["2024-01-03", "2024-01-03"],
    "total_orders": [7, 15],
    "total_spend": [175.0, 375.0],
    "avg_order_value": [25.0, 25.0],
    "days_since_last_order": [0, 0],
})

# Append mode - adds new rows
fg_append = FeatureGroup(
    name="customer_features_append",
    entity=customer_entity,
    materialization=Materialization(
        event_time_col="event_date",
        offline=True,
        offline_materialization=OfflineMaterialization(
            store=OfflineStore(
                kind=OfflineStoreEnum.ICEBERG,
                value=IcebergStoreOutput(
                    catalog="lakekeeper",
                    namespace="prod",
                    table="customer_features_append",
                    mode="append"  # Append mode
                )
            )
        )
    )
)

fg_append.set_dataframe(features_df).set_features()
result1 = fg_append.write(feature_start_time=datetime(2024, 1, 1))

fg_append.set_dataframe(new_data).set_features()
result2 = fg_append.write(feature_start_time=datetime(2024, 1, 3))

print(f"After first write: {result1['num_rows']} rows")
print(f"After append: {result2['num_rows']} rows (should be >= {result1['num_rows']})")

## Using Iceberg Features in ML Pipelines

In [None]:
# Query features for ML training
training_data = con.execute("""
SELECT 
    customer_id,
    total_orders,
    total_spend,
    avg_order_value,
    days_since_last_order
FROM seeknal_catalog.prod.customer_features
WHERE event_date >= '2024-01-01'
""").fetchdf()

print("\nTraining data shape:", training_data.shape)
print("\nTraining data preview:")
print(training_data.head())

In [None]:
# Use features with scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Create a simple classification target
training_data['high_value'] = (training_data['total_spend'] > 100).astype(int)

# Split data
X = training_data[['total_orders', 'avg_order_value', 'days_since_last_order']]
y = training_data['high_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=10, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\nModel accuracy: {accuracy:.2%}")

## Advanced: Schema Evolution

Iceberg allows schema evolution without rewriting data. Let's add a new feature:

In [None]:
# Create data with a new column
enhanced_data = pd.DataFrame({
    "customer_id": ["A001", "A002"],
    "event_date": ["2024-01-04", "2024-01-04"],
    "total_orders": [8, 12],
    "total_spend": [200.0, 300.0],
    "avg_order_value": [25.0, 25.0],
    "days_since_last_order": [0, 0],
    "customer_segment": ["VIP", "Regular"],  # NEW COLUMN!
})

print("Data with new 'customer_segment' column:")
print(enhanced_data)

In [None]:
# Write with schema evolution
# Note: In production, handle schema compatibility first
print("\nSchema evolution with Iceberg:")
print("  - New columns are automatically added")
print("  - Existing data gets NULL for new columns")
print("  - No table rewrite required")
print("\nThis feature is powerful for:")
print("  - Adding new features over time")
print("  - A/B testing different feature sets")
print("  - Gradual schema migration")

## Performance Considerations

In [None]:
# Performance tips for Iceberg Feature Groups:

tips = """
1. PARTITIONING: Partition by date for efficient time-based queries
   - Example: PARTITION BY days(event_date)

2. ZORDERING: Sort by frequently filtered columns
   - Example: ZORDER BY customer_id

3. SNAPSHOTS: Manage snapshot retention
   - Old snapshots can be expired to save metadata space

4. FILE SIZE: Aim for 128-256MB parquet files
   - Too small: metadata overhead
   - Too large: slow reads

5. CACHING: DuckDB caches Iceberg metadata
   - First query is slower, subsequent queries are fast
"""

print(tips)

## Summary

In [None]:
summary = """
Iceberg Feature Groups provide:

✓ ACID transactions for reliable feature writes
✓ Time travel for debugging and rollback
✓ Schema evolution without data rewrites
✓ Cloud-native storage with S3/GCS/Azure support
✓ Compatible with DuckDB, Spark, Trino, and more
✓ Point-in-time joins prevent data leakage

Key Benefits:
- Production-ready storage for feature groups
- Team collaboration through shared catalog
- Audit trail through snapshot history
- Easy integration with existing ML pipelines
"""

print(summary)

## Next Steps

- **Production Setup**: Deploy Lakekeeper and configure S3/GCS storage
- **Feature Store Docs**: See `docs/getting-started-comprehensive.md`
- **API Reference**: Check `docs/api/` for detailed API docs
- **YAML Pipelines**: Use Iceberg in workflow YAML definitions

For issues or questions:
- GitHub: https://github.com/mta-tech/seeknal/issues