# Getting started with PyIceberg

PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM.

## Installation

In [None]:
%pip install --upgrade pip

In [None]:
%pip install "pyiceberg[s3fs,hive]"
%pip install "pyiceberg[sql-sqlite]"
%pip install pyarrow

## Connecting to a catalog

Create a temporary location for Iceberg:

```bash
mkdir /tmp/warehouse
```


Set up the catalog:

In [None]:
from pyiceberg.catalog import load_catalog

warehouse_path = "/tmp/warehouse"
catalog = load_catalog(
    "default",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

## Write a PyArrow dataframe

First download one month of data:
```bash
curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet
```

Load it into your PyArrow dataframe:

In [None]:
import pyarrow.parquet as pq

df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")

Create a new Iceberg table:

In [None]:
catalog.create_namespace("default")

table = catalog.create_table(
    "default.taxi_dataset",
    schema=df.schema,
)

Append the dataframe to the table:

In [None]:
table.append(df)
len(table.scan().to_arrow())

Now generate a tip-per-mile feature to train the model on:

In [None]:
import pyarrow.compute as pc

df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"]))

Evolve the schema of the table with the new column:

In [None]:
with table.update_schema() as update_schema:
    update_schema.union_by_name(df.schema)

And now we can write the new dataframe to the Iceberg table:

In [None]:
table.overwrite(df)
print(table.scan().to_arrow())

And we can see that 2371784 rows have a tip-per-mile:

In [None]:
df = table.scan(row_filter="tip_per_mile > 0").to_arrow()
len(df)

## Explore Iceberg data and metadata files

Since the catalog was configured to use the local filesystem, we can explore how Iceberg saved data and metadata files from the above operations.
```bash
find /tmp/warehouse/
```