# Working Iceburge Tables and DuckDB



### Why Pairing DuckDB + Iceberg Tables in Python Is a Smart Choice for Testing Structured Data

**1. Local + Lightweight = Fast Prototyping**
DuckDB runs in-process with zero setup. Iceberg lets you version structured data. Together, they let you test data lake behaviors *locally* without spinning up Spark clusters or managing Hive metastores.

**2. Iceberg Handles Table Semantics**
Iceberg brings schema evolution, partitioning, snapshotting, and data versioning to flat files — it treats your data like a database table, but on files (Parquet, Avro). Great for simulating real-world data lake operations.

**3. DuckDB Understands Iceberg**
DuckDB has native support for Iceberg catalogs, which means you can write SQL to:

* Explore metadata
* Query snapshots
* Validate schema changes
* Run full SELECTs on local Parquet/Iceberg tables

**4. Python = Friendly Glue**
Using Python lets you:

* Generate test data with Faker
* Write schemas with PyIceberg
* Query & validate with DuckDB
* Build a reproducible and modular testing framework for ML, ETL, or fraud logic

You're not just testing raw data — you're simulating schema changes, ingesting mock data, validating logic, and building toward a lakehouse pipeline. All while staying in your local dev loop.

---


# Install:  
`pip install pyiceberg`

## Prepare Your Directory:

In [1]:
import os

base_path = "data/blue_loakehouse/iceberg-tables"
table_name = "fraud_data"
table_path = os.path.join(base_path, table_name)

os.makedirs(table_path, exist_ok=True)


## Create a Table Using PyIceberg (File-based Catalog)

In [None]:
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import StringType, LongType, StructType, TimestampType
from pyiceberg.expressions import AlwaysTrue

# Define schema
schema = Schema(
    StructType(
        fields=[
            ("id", LongType()),
            ("event_time", TimestampType()),
            ("description", StringType())
        ]
    ),
    identifier_field_names=["id"]
)

# Set up catalog
catalog = load_catalog("file", uri=base_path)

# Create table
catalog.create_table(
    identifier=table_name,
    schema=schema,
    location=table_path
)
