![iceberg-logo](https://www.apache.org/logos/res/iceberg/iceberg.png)

In [None]:
from pyiceberg import __version__

__version__

## Load NYC Taxi/Limousine Trip Data

For this notebook, we will use the New York City Taxi and Limousine Commision Trip Record Data that's available on the AWS Open Data Registry. This contains data of trips taken by taxis and for-hire vehicles in New York City. We'll save this into an iceberg table called `taxis`.

First, load the Parquet file using PyArrow:

In [None]:
import pyarrow.parquet as pq

tbl_taxis = pq.read_table('/home/iceberg/data/yellow_tripdata_2021-04.parquet')
tbl_taxis

## Creating the table

Next, create the namespace, and the `taxis` table from the schema that's derived from the Arrow schema:

In [None]:
from pyiceberg.catalog import load_catalog
from pyiceberg.exceptions import NamespaceAlreadyExistsError

cat = load_catalog('default')

try:
    cat.create_namespace('default')
except NamespaceAlreadyExistsError:
    pass

In [None]:
from pyiceberg.exceptions import NoSuchTableError

try:
    cat.drop_table('default.taxis')
except NoSuchTableError:
    pass

tbl = cat.create_table(
    'default.taxis',
    schema=tbl_taxis.schema
)

tbl

## Write the actual data into the table

This will create a new snapshot on the table:

In [None]:
tbl.overwrite(tbl_taxis)

tbl

## Append more data

Let's append another month of data to the table:

In [None]:
tbl.append(pq.read_table('/home/iceberg/data/yellow_tripdata_2021-05.parquet'))

tbl

## Load data into a PyArrow Dataframe

We'll fetch the table using the REST catalog that comes with the setup.

In [None]:
tbl = cat.load_table('default.taxis')

sc = tbl.scan(row_filter="tpep_pickup_datetime >= '2021-05-01T00:00:00.000000'")

In [None]:
df = sc.to_arrow().to_pandas()

In [None]:
len(df)

In [None]:
df.info()

In [None]:
df

In [None]:
df.hist(column='fare_amount')

In [None]:
import numpy as np
from scipy import stats

stats.zscore(df['fare_amount'])

# Remove everything larger than 3 stddev
df = df[(np.abs(stats.zscore(df['fare_amount'])) < 3)]
# Remove everything below zero
df = df[df['fare_amount'] > 0]

In [None]:
df.hist(column='fare_amount')

# DuckDB

Use DuckDB to Query the PyArrow Dataframe directly.

In [None]:
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
%sql duckdb:///:memory:

In [None]:
%sql SELECT * FROM df LIMIT 20

In [None]:
%%sql --save tip_amount --no-execute

SELECT tip_amount
FROM df

In [None]:
%sqlplot histogram --table df --column tip_amount --bins 22 --with tip_amount


In [None]:
%%sql --save tip_amount_filtered --no-execute

WITH tip_amount_stddev AS (
    SELECT STDDEV_POP(tip_amount) AS tip_amount_stddev
    FROM df
)

SELECT tip_amount
FROM df, tip_amount_stddev
WHERE tip_amount > 0
  AND tip_amount < tip_amount_stddev * 3

In [None]:
%sqlplot histogram --table tip_amount_filtered --column tip_amount --bins 50 --with tip_amount_filtered


# Iceberg ❤️ PyArrow and DuckDB

This notebook shows how you can load data into a PyArrow dataframe and query it using DuckDB easily. Iceberg allows you to take a slice out of the data that you need for your analysis, while reducing the time that you have to wait for the data and without polluting the memory with data that you're not going to use.