# Analyzing the NYC Taxi dataset using RAPIDS

This notebook uses [RAPIDS](https://rapids.ai/) for accelerating analytics using GPU. Specifically we use the `ml.g4dn.xlarge` instance type that has 4 vCPUs, 16 GB RAM and 1 GPU. We run the same analytics tasks that we have run earlier on the NYC Taxi dataset that we have done previously with Spark, Dask and DuckDB. 

This notebook does require installing `%conda create -n rapids-22.08 -c rapidsai -c nvidia -c conda-forge rapids=22.08 python=3.9 cudatoolkit=11.5 dask-sql ipykernel -y` but it is best to package this conda environment in a container and attach that container to SageMaker Studio. This notebook assumes that the kernel being used is this custom kernel which has the `rapids` and `cudatoolkit` pre-installed. The details of this process are beyond the scope of this notebook [README.md](./README.md) for more details.

In [2]:
!pip install s3fs==2022.11.0 gputil==1.4.0

[0m

In [3]:
import os
import cudf
import GPUtil
import dask_cudf
import cupy as cp
import pandas as pd


In [4]:
cudf_df = cudf.read_parquet('s3://bigdatateaching/nyctaxi-yellow-tripdata/2021/*.parquet')

In [5]:
cudf_df.shape

(1369769, 19)

In [6]:
cudf_df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.1,1.0,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5,
1,1,2021-01-01 00:51:20,2021-01-01 00:52:19,1.0,0.2,1.0,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0,
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1.0,14.7,1.0,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0,
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0.0,10.6,1.0,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0,
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1.0,4.94,1.0,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5,


In [7]:
cudf_df.tpep_pickup_datetime.dtype

dtype('<M8[us]')

In [8]:
cudf_df['day'] = cudf.to_datetime(cudf_df['tpep_pickup_datetime']).dt.day
cudf_df['month'] = cudf.to_datetime(cudf_df['tpep_pickup_datetime']).dt.month
cudf_df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,day,month
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.10,1.0,N,142,43,2,...,3.00,0.5,0.00,0.00,0.3,11.80,2.5,,1,1
1,1,2021-01-01 00:51:20,2021-01-01 00:52:19,1.0,0.20,1.0,N,238,151,2,...,0.50,0.5,0.00,0.00,0.3,4.30,0.0,,1,1
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1.0,14.70,1.0,N,132,165,1,...,0.50,0.5,8.65,0.00,0.3,51.95,0.0,,1,1
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0.0,10.60,1.0,N,138,132,1,...,0.50,0.5,6.05,0.00,0.3,36.35,0.0,,1,1
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1.0,4.94,1.0,N,68,33,1,...,0.50,0.5,4.06,0.00,0.3,24.36,2.5,,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1369764,2,2021-01-31 23:03:00,2021-01-31 23:33:00,,8.89,,,229,181,0,...,0.00,0.5,7.46,0.00,0.3,38.54,,,31,1
1369765,2,2021-01-31 23:29:00,2021-01-31 23:51:00,,7.43,,,41,70,0,...,0.00,0.5,0.00,6.12,0.3,39.50,,,31,1
1369766,2,2021-01-31 23:25:00,2021-01-31 23:38:00,,6.26,,,74,137,0,...,0.00,0.5,3.90,0.00,0.3,24.05,,,31,1
1369767,6,2021-01-31 23:01:06,2021-02-01 00:02:03,,19.70,,,265,188,0,...,0.00,0.5,0.00,0.00,0.3,54.48,,,31,1


In [9]:
%time
cudf_df.groupby(['month', 'day']).agg({'trip_distance': "mean",
                                       'fare_amount': "mean",
                                       'mta_tax': 'mean',
                                       'tip_amount': 'mean',
                                       'total_amount': 'mean',
                                       'congestion_surcharge': 'mean'})

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.48 µs


Unnamed: 0_level_0,Unnamed: 1_level_0,trip_distance,fare_amount,mta_tax,tip_amount,total_amount,congestion_surcharge
month,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2,3.441793,13.502239,0.491314,2.145086,18.954313,2.156934
1,23,2.764356,11.414721,0.492569,1.86094,16.605046,2.25727
1,17,3.260376,12.720475,0.492981,2.061331,18.03245,2.205055
1,20,5.550783,11.506614,0.493568,1.806717,16.913929,2.276005
1,3,9.749994,14.789198,0.489638,2.316048,20.452838,2.080468
1,6,4.86173,11.989868,0.493319,1.912365,17.352783,2.231545
1,18,3.056818,12.305442,0.493071,2.014083,17.612088,2.232386
1,9,5.164707,12.189154,0.492308,1.995121,17.436535,2.207145
1,21,3.23162,11.967786,0.494247,1.852851,17.441953,2.287558
1,24,3.242845,12.528322,0.492266,1.969346,17.85612,2.195916


In [10]:
%%time
for i in range(1000):
    cudf_df.groupby(['month', 'day']).agg({'trip_distance': "mean",
                                           'fare_amount': "mean",
                                           'mta_tax': 'mean',
                                           'tip_amount': 'mean',
                                           'total_amount': 'mean',
                                           'congestion_surcharge': 'mean'})
    if i % 100 == 0:
        print(GPUtil.showUtilization())

| ID | GPU | MEM |
------------------
|  0 |  1% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 61% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 62% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 62% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 62% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 61% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 61% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 61% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 62% |  4% |
None
| ID | GPU | MEM |
------------------
|  0 | 62% |  4% |
None
CPU times: user 6.81 s, sys: 7.08 s, total: 13.9 s
Wall time: 14.1 s
