# 2. Parquet to Iceberg

This notebook demonstrates the process of reading a Parquet file from the `grupo-2` bucket in MinIO and saving it to another bucket using the Apache Iceberg table format. The workflow utilizes `dlt` and Iceberg libraries, with data managed as a tabular dataset using the Nessie catalog for efficient querying and versioning. Note that this requires MinIO access and the Iceberg library installation. The Notebook:

* Uses the MinIO API on port 9000 with credentials inferred from .dlt/secrets.toml.
* Reads Parquet files from a specified bucket (e.g., s3://grupo-2/grupo_2_parquet/df_data).

In [1]:
!pip install pandas pyarrow fsspec dlt[filesystem] s3fs adlfs pyiceberg[rest-catalog]



In [None]:
catalog = load_catalog(
    "nessie",
    **{
        "uri": "<http://nessie:19120>",
    }
)

namespaces = catalog.list_namespaces()
print("Namespaces:", namespaces)

In [None]:
catalog.create_table(
    "demo.post_2020",
    schema=post_2020.schema,
    location="s3://my-bucket/post_2020",
    properties={"write.format.default": "parquet"}
)

post_2020_table = catalog.load_table("demo.post_2020")
post_2020_table.overwrite(post_2020)


In [None]:
pre_2020_table = catalog.load_table("demo.pre_2020")
pre_2020_table.merge_into(
    source_table=pre_2020,
    merge_condition="target.Id = source.Id",
    update={"*"},
    insert={"*"}
)

In [None]:
table.append(df)  # Agrega el Arrow Table a la tabla Iceberg (escribe Parquet subyacente)
len(table.scan().to_arrow())  # Escanea la tabla, convierte a Arrow, cuenta filas
arrow_table = table.scan().to_arrow()  # Escanea todo
arrow_table.to_pandas()  # A Pandas completo
arrow_table.to_pandas().head()  # Solo las primeras filas

In [2]:
import pandas as pd
import dlt
from dlt.sources.filesystem import filesystem, readers, read_parquet
import pyarrow.parquet as pq
import pyarrow as pa
import fsspec
from pyiceberg.catalog import Catalog
from pyiceberg.io import load_file_io
from pyiceberg.schema import Schema
from pyiceberg.typedef import Properties
import logging
import fsspec
from typing import Optional

In [3]:
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger("parquet_to_iceberg")

In [4]:
# Configure the pipeline
try:
    pipeline = dlt.pipeline(
        pipeline_name="sources",  
        destination="filesystem",
        dataset_name="grupo_2_parquet"
    )
    logger.info(f"Pipeline configured successfully with name: {pipeline.pipeline_name}")
except Exception as e:
    logger.error(f"Error configuring pipeline: {str(e)}")
    raise

2025-09-08 04:50:50,397 - INFO - Pipeline configured successfully with name: sources


In [6]:
@dlt.resource(name="grupo_2_data", write_disposition="replace")
def minio_resource():
    fs = filesystem(
        bucket_url="s3://grupo-2/grupo_2_parquet/",
        file_glob="df_data/*.parquet"
    )
    for file in fs:
        yield file

In [None]:
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.table import Table

# Cargar catálogo Nessie
catalog = load_catalog("nessie", uri="http://nessie:19120")

# Crear tabla Iceberg si no existe
schema = Schema(...)  # Define tu esquema aquí
table = catalog.create_table(
    identifier="grupo_2.iceberg_table",
    schema=schema,
    location="s3://grupo-2-iceberg/"
)

# Insertar datos (requiere conversión desde DLT a formato compatible)
# Esto depende del motor: Spark, Pandas, Arrow, etc.

In [7]:
load_info = pipeline.run(minio_resource)
print(f"✅ Carga completada: {load_info}")


  - encoding

Unless type hints are provided, these columns will not be materialized in the destination.
One way to provide type hints is to use the 'columns' argument in the '@dlt.resource' decorator.  For example:

@dlt.resource(columns={'encoding': {'data_type': 'text'}})



✅ Carga completada: Pipeline sources load step completed in 0.22 seconds
1 load package(s) were loaded to destination filesystem and into dataset grupo_2_parquet
The filesystem destination used s3://grupo-2/grupo_2_parquet/df_data location to store data
Load package 1757307060.9524126 is LOADED and contains no failed jobs
