![iceberg-logo](https://www.apache.org/logos/res/iceberg/iceberg.png)

### [Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!](https://tabular.io/blog/docker-spark-and-iceberg/)

In [1]:
from pyiceberg import __version__

__version__

'0.9.1'

# Write support

This notebook demonstrates writing to Iceberg tables using PyIceberg. First, connect to the [catalog](https://iceberg.apache.org/concepts/catalog/#iceberg-catalogs), the place where tables are being tracked.

In [2]:
import os
os.environ["PYICEBERG_CATALOG__DEFAULT__URI"]="http://minio:9000"
os.environ["PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID"]="admin"
os.environ["PYICEBERG_CATALOG__DEFAULT__S3__SECRET_ACCESS_KEY"]="password"


In [7]:
from pyiceberg.catalog import load_catalog
import pyiceberg.catalog.sql as pcsql 

catalog = load_catalog(
    type="sql",
)

ValueError: URI missing, please provide using --uri, the config or environment variable PYICEBERG_CATALOG__DEFAULT__URI

# Loading data using Arrow

PyArrow is used to load a Parquet file into memory, and using PyIceberg this data can be written to an Iceberg table.

In [3]:
import pyarrow.parquet as pq

df = pq.read_table("/home/iceberg/data/yellow_tripdata_2022-01.parquet")

df

FileNotFoundError: /home/iceberg/data/yellow_tripdata_2022-01.parquet

# Create an Iceberg table

Next create the Iceberg table directly from the `pyarrow.Table`.

In [4]:
table_name = "default.taxi_dataset"

try:
    # In case the table already exists
    catalog.drop_table(table_name)
except:
    pass

table = catalog.create_table(table_name, schema=df.schema)

table

NameError: name 'catalog' is not defined

# Write the data

Let's append the data to the table. Appending or overwriting is equivalent since the table is empty. Next we can query the table and see that the data is there.

In [None]:
table.append(df)  # or table.overwrite(df)

assert len(table.scan().to_arrow()) == len(df)

table.scan().to_arrow()

In [None]:
str(table.current_snapshot())

# Append data

Let's append another month of data to the table

In [None]:
df = pq.read_table("/home/iceberg/data/yellow_tripdata_2022-02.parquet")
table.append(df)

In [None]:
str(table.current_snapshot())

# Feature generation

Consider that we want to train a model to determine which features contribute to the tip amount. `tip_per_mile` is a good target to train the model on. When we try to append the data, we need to evolve the schema first.

In [None]:
import pyarrow.compute as pc

df = table.scan().to_arrow()
df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"]))

try:
    table.overwrite(df)
except ValueError as e:
    print(f"Error: {e}")

In [None]:
with table.update_schema() as upd:
    upd.union_by_name(df.schema)

print(str(table.schema()))

In [None]:
table.overwrite(df)

table