## An example of using RAPIDS to speed up pandas operations on Hyperplane
- The task is to groupby and sorting about 3G of data on s3 bucket 

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import os
import dask
from hyperplane import notebook_common as nc

In [4]:
client, cluster = nc.initialize_cluster(
    nprocs=1,
    nthreads=8,
    ram_gb_per_proc=7,
    cores_per_worker=2,
    num_workers = 2,
    ngpus = 1,
    scheduler_deploy_mode="local"
)

👉 Hyperplane: selecting worker node pool
👉 Hyperplane: selecting scheduler node pool
👉 Hyperplane: you can access your dask dashboard at https://jhub.ds.hyperplane.dev/hub/user-redirect/proxy/45601/status
👉 Hyperplane: to get logs from all workers, do `cluster.get_logs()`


In [5]:
client

0,1
Client  Scheduler: tcp://10.0.21.3:43959  Dashboard: http://10.0.21.3:45601/status,Cluster  Workers: 2  Cores: 16  Memory: 14.57 GiB


In [6]:
from dask.distributed import Client
client = Client(cluster)

In [7]:
import dask_cudf

In [8]:
file_path = "s3://dask-data/airline-data/*.csv"

In [9]:
flight_df = dask_cudf.read_csv(file_path, assume_missing=True,
                               usecols = ["UniqueCarrier","FlightNum","Distance"])

In [10]:
flight_df.head()

Unnamed: 0,UniqueCarrier,FlightNum,Distance
0,PS,1451.0,447.0
1,PS,1451.0,447.0
2,PS,1451.0,447.0
3,PS,1451.0,447.0
4,PS,1451.0,447.0


In [11]:
flight_df_opt = flight_df.groupby(by=["UniqueCarrier","FlightNum"]).Distance.mean()

In [12]:
%%time
flight_df_results = flight_df_opt.compute()

CPU times: user 2.08 s, sys: 223 ms, total: 2.3 s
Wall time: 1min 21s


In [13]:
flight_df_results

UniqueCarrier  FlightNum
AS             994.0        586.166667
               920.0        828.272727
XE             4089.0       583.000000
WN             3524.0       839.513333
CO             874.0        588.002579
                               ...    
PI             1809.0       195.542662
NW             1912.0       354.528543
MQ             3238.0       212.074221
UA             2563.0       271.504673
WN             25.0         298.849527
Name: Distance, Length: 50003, dtype: float64

In [None]:
cluster.close()