# 2. Parquet to Iceberg

This notebook demonstrates the process of reading a Parquet file from the `grupo-2` bucket in MinIO and saving it to another bucket using the Apache Iceberg table format. The workflow utilizes `dlt` and Iceberg libraries, with data managed as a tabular dataset using the Nessie catalog for efficient querying and versioning. Note that this requires MinIO access and the Iceberg library installation. The Notebook:

* Uses the MinIO API on port 9000 with credentials inferred from .dlt/secrets.toml.
* Reads Parquet files from a specified bucket (e.g., s3://grupo-2/grupo_2_parquet/df_data).

In [1]:
%pip install pandas pyarrow fsspec dlt[filesystem] s3fs adlfs pyiceberg[s3fs,sql-sqlite] toml

Note: you may need to restart the kernel to use updated packages.


In [2]:
# General utilities
import os
import toml
import logging
from typing import Optional

# Data manipulation
import pandas as pd

# dlt: Reading from filesystem
import dlt
from dlt.sources.filesystem import filesystem, read_parquet

# PyArrow: Reading and Convertion
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.fs as fs

# PyIceberg
import pyiceberg
from pyiceberg.catalog import load_catalog
from pyiceberg.table import Table
from pyiceberg.schema import Schema, NestedField
from pyiceberg.types import (
    BooleanType, IntegerType, LongType, FloatType, DoubleType,
    StringType, TimestampType, DateType
)

In [3]:
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger("parquet_to_iceberg")

In [4]:
# Configure the pipeline
try:
    pipeline = dlt.pipeline(
        pipeline_name="sources",  
        destination="filesystem",
    )
    logger.info(f"Pipeline configured successfully with name: {pipeline.pipeline_name}")
except Exception as e:
    logger.error(f"Error configuring pipeline: {str(e)}")
    raise

2025-09-09 08:19:49,739 - INFO - Pipeline configured successfully with name: sources


In [5]:
# Cargar el archivo de configuración
config = toml.load("/home/jovyan/work/.dlt/secrets.toml")

# Extraer credenciales
creds = config["parquet_to_minio"]["destination"]["filesystem"]["credentials"]

# Exportar a variables de entorno
os.environ["AWS_ACCESS_KEY_ID"] = creds["aws_access_key_id"]
os.environ["AWS_SECRET_ACCESS_KEY"] = creds["aws_secret_access_key"]
os.environ["AWS_ENDPOINT_URL"] = creds.get("endpoint_url", "")  

In [6]:
# Si estás en local, usa la ruta al directorio donde dlt guardó los Parquet
dataset = ds.dataset(
    source="s3://grupo-2/grupo_2_parquet/df_data",  # Ajusta según tu ruta
    format="parquet"
)

# Convertir a Arrow Table
table = dataset.to_table()

In [7]:
# Show schema
print(table.schema)

vendor_id: int32
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
ratecode_id: double
store_and_fwd_flag: string
pu_location_id: int32
do_location_id: int32
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double
cbd_congestion_fee: double


In [8]:
catalog = load_catalog(
    "nessie",
    uri="http://nessie:19120/iceberg/",
    type="rest"
)

namespaces = catalog.list_namespaces()
print("Namespaces:", namespaces)

Namespaces: []


In [9]:
catalog.create_namespace("proyecto")

In [10]:
namespaces = catalog.list_namespaces()
print("Namespaces:", namespaces)

Namespaces: [('proyecto',)]


In [11]:
table_schema = dataset.schema

In [12]:
try:
    catalog.create_table(
        "proyecto.grupo2",
        schema=table_schema,
        location="s3://grupo-2/"
    )
    logger.info("Table 'proyecto.grupo2' successfully created at location 's3://grupo-2/'.")
except Exception as e:
    logger.exception("Unexpected error during table creation.")

2025-09-09 08:19:51,482 - INFO - Table 'proyecto.grupo2' successfully created at location 's3://grupo-2/'.


In [13]:
try:
    dataset = catalog.load_table("proyecto.grupo2")
    logger.info("Table 'proyecto.grupo2' loaded successfully.")
except Exception as e:
    logger.exception("Unexpected error while loading 'proyecto.grupo2'.")

2025-09-09 08:19:51,594 - INFO - Table 'proyecto.grupo2' loaded successfully.


In [14]:
try:
    dataset.append(table)
    logger.info("Data successfully appended to 'proyecto.grupo2'.")
except Exception as e:
    logger.exception("Unexpected error during append operation.")

2025-09-09 08:19:53,203 - INFO - Data successfully appended to 'proyecto.grupo2'.


In [15]:
print(catalog.list_tables("proyecto"))

[('proyecto', 'grupo2')]
