# A simple example of using DuckDB and Apache Arrow using NYC Taxi dataset

This notebook reads the NYC taxi dataset files for the year 2021 (about ~29 million rows) and runs some analytics operation on this dataset. This dataset is too big to fit into memory.  

1. We read the data from S3 using apache Arrow (pyarrow).

1. The zero-copy integration between DuckDB and Apache Arrow allows for rapid analysis of larger than memory datasets in Python and R using either SQL or relational APIs.

1. We create a DuckDB instance in memory and using the connection to this in-memory database We run some simple analytics operations using SQL syntax.

Also see [https://duckdb.org/2021/12/03/duck-arrow.html](https://duckdb.org/2021/12/03/duck-arrow.html)

In [7]:
!pip install pyarrow duckdb

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
import duckdb
import pyarrow as pa
import pyarrow.dataset as ds

nyc = ds.dataset('s3://bigdatateaching/nyctaxi-yellow-tripdata/2021/')

In [9]:
# connect to an in-memory database
con = duckdb.connect()

**Running the following line on a `ml.t3.medium` instance that has only 4GiB of RAM will cause the kernel to restart.** However, as we see in this notebook, we can work with this dataset using DuckDB since it loads only a subset of the data in memory at a time.

In [10]:
# con.execute("SELECT * FROM nyc").df()

Number of rows in this dataset?

In [11]:
%%time
con.execute("SELECT count(passenger_count) as count FROM nyc").df()

CPU times: user 734 ms, sys: 136 ms, total: 870 ms
Wall time: 930 ms


Unnamed: 0,count
0,29425613


Find the average of some fields in this dataset over some derived fields (month and day).

In [12]:
%%time
query = """
select
  month,
  day,
  AVG(trip_distance) as avg_trip_distance,
  AVG(fare_amount) as avg_fare_amount,
  AVG(mta_tax) as avg_mta_tax,
  AVG(tip_amount) as avg_tip_amount,
  AVG(total_amount) as avg_total_amount,
  AVG(congestion_surcharge) as avg_congestion_surcharge
from (
  select
    trip_distance,
    passenger_count,
    mta_tax,
    tip_amount,
    fare_amount,
    total_amount,
    congestion_surcharge,
    date_part('month', tpep_pickup_datetime) as month,
    date_part('day', tpep_pickup_datetime) as day
  from
    nyc)
  group by month, day
"""
nyc_subset = con.execute(query).df()
nyc_subset

CPU times: user 7.87 s, sys: 1.71 s, total: 9.58 s
Wall time: 6.85 s


Unnamed: 0,month,day,avg_trip_distance,avg_fare_amount,avg_mta_tax,avg_tip_amount,avg_total_amount,avg_congestion_surcharge
0,1,1,3.581000,13.536541,0.487243,1.978367,18.813154,2.094889
1,1,2,3.442048,13.502939,0.491315,2.144961,18.955158,2.156955
2,1,3,9.748679,14.787712,0.489620,2.315952,20.451181,2.080435
3,1,4,3.889088,13.051403,0.492872,2.009752,18.533799,2.182669
4,1,5,4.047790,12.346473,0.493153,1.933759,17.716130,2.214644
...,...,...,...,...,...,...,...,...
360,12,27,5.177866,14.625754,0.488190,2.468404,21.215707,2.210944
361,12,28,3.723458,14.518260,0.489409,2.438996,21.033789,2.226947
362,12,29,3.681428,14.405951,0.489325,2.461713,20.940976,2.235410
363,12,30,6.510600,14.468380,0.489143,2.460472,21.011377,2.239615
