# Getting started with PyIceberg

PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM.

## Installation

In [1]:
%pip install --upgrade pip

Collecting pip
  Using cached pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Using cached pip-25.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.0
    Uninstalling pip-24.0:
      Successfully uninstalled pip-24.0
Successfully installed pip-25.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install "pyiceberg[s3fs,hive]"
%pip install "pyiceberg[sql-sqlite]"
%pip install pyarrow

Collecting pyiceberg[hive,s3fs]
  Using cached pyiceberg-0.10.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting cachetools<7.0,>=5.5 (from pyiceberg[hive,s3fs])
  Using cached cachetools-6.2.0-py3-none-any.whl.metadata (5.4 kB)
Collecting click<9.0.0,>=7.1.1 (from pyiceberg[hive,s3fs])
  Using cached click-8.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec>=2023.1.0 (from pyiceberg[hive,s3fs])
  Using cached fsspec-2025.9.0-py3-none-any.whl.metadata (10 kB)
Collecting mmh3<6.0.0,>=4.0.0 (from pyiceberg[hive,s3fs])
  Using cached mmh3-5.2.0-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.metadata (14 kB)
Collecting pydantic!=2.4.0,!=2.4.1,<3.0,>=2.0 (from pyiceberg[hive,s3fs])
  Using cached pydantic-2.11.9-py3-none-any.whl.metadata (68 kB)
Collecting pyparsing<4.0.0,>=3.1.0 (from pyiceberg[hive,s3fs])
  Using cached pyparsing-3.2.4-py3-none-any.whl.metadata (5.0 kB)
Collecting pyroaring<2

## Connecting to a catalog

Create a temporary location for Iceberg:

```bash
mkdir /tmp/warehouse
```


Set up the catalog:

In [3]:
from pyiceberg.catalog import load_catalog

warehouse_path = "/tmp/warehouse"
catalog = load_catalog(
    "default",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

## Write a PyArrow dataframe

First download one month of data:
```bash
curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet
```

Load it into your PyArrow dataframe:

In [4]:
import pyarrow.parquet as pq

df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")

Create a new Iceberg table:

In [5]:
catalog.create_namespace("default")

table = catalog.create_table(
    "default.taxi_dataset",
    schema=df.schema,
)

Append the dataframe to the table:

In [6]:
table.append(df)
len(table.scan().to_arrow())

3066766

Now generate a tip-per-mile feature to train the model on:

In [7]:
import pyarrow.compute as pc

df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"]))

Evolve the schema of the table with the new column:

In [8]:
with table.update_schema() as update_schema:
    update_schema.union_by_name(df.schema)

And now we can write the new dataframe to the Iceberg table:

In [None]:
table.overwrite(df)

pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
RatecodeID: double
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double
tip_per_mile: double
----
VendorID: [[2,2,2,1,2,...,2,2,1,1,1],[1,2,2,2,2,...,1,1,1,2,2],...,[2,2,2,2,2,...,2,2,2,2,2],[2,2,2,2,2,...,2,2,2,2,2]]
tpep_pickup_datetime: [[2023-01-01 00:32:10.000000,2023-01-01 00:55:08.000000,2023-01-01 00:25:04.000000,2023-01-01 00:03:48.000000,2023-01-01 00:10:29.000000,...,2023-01-02 21:16:11.000000,2023-01-02 21:56:02.000000,2023-01-02 21:04:31.000000,2023-01-02 21:13:09.000000,2023-01-02 21:45:30.000000],[2023-01-02 21:49:54.000000,2023-01-02 21:17:06.000000,2023-01-02 21:35:06.000000,2023-01-02 21:18:43.000000,2

In [11]:
print(table.scan().to_arrow())

pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
RatecodeID: double
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double
tip_per_mile: double
----
VendorID: [[2,2,2,1,2,...,2,2,1,1,1],[1,2,2,2,2,...,1,1,1,2,2],...,[2,2,2,2,2,...,2,2,2,2,2],[2,2,2,2,2,...,2,2,2,2,2]]
tpep_pickup_datetime: [[2023-01-01 00:32:10.000000,2023-01-01 00:55:08.000000,2023-01-01 00:25:04.000000,2023-01-01 00:03:48.000000,2023-01-01 00:10:29.000000,...,2023-01-02 21:16:11.000000,2023-01-02 21:56:02.000000,2023-01-02 21:04:31.000000,2023-01-02 21:13:09.000000,2023-01-02 21:45:30.000000],[2023-01-02 21:49:54.000000,2023-01-02 21:17:06.000000,2023-01-02 21:35:06.000000,2023-01-02 21:18:43.000000,2

And we can see that 2371784 rows have a tip-per-mile:

In [10]:
df = table.scan(row_filter="tip_per_mile > 0").to_arrow()
len(df)

2371784

## Explore Iceberg data and metadata files

Since the catalog was configured to use the local filesystem, we can explore how Iceberg saved data and metadata files from the above operations.
```bash
find /tmp/warehouse/
```