## **Notebook 2: Introduction to RAPIDS cuDF**

This notebook will guide you through the basics of cuDF, demonstrating its capabilities in accelerating data operations and highlighting key concepts such as automatic GPU acceleration and performance comparisons with traditional pandas operations.

Throughout this tutorial, we'll explore:
1. The similarities between pandas and cuDF, and how to use cuDF's pandas-compatible API
2. Performance comparisons between CPU-based pandas and GPU-accelerated cuDF operations
3. Techniques for working with large datasets, including grouping and aggregation
4. Practical examples using real-world data from NYC's 311 Service Request dataset

---
**GPU-Accelerated Data Manipulation with CuDF**

Within the Python ecosystem, [Pandas](https://pandas.pydata.org/) is a popular library for working with structured tabular data. This type of data is often represented in table or CSV format, which Pandas can filter, transform, aggregate, merge, visualize and  more.

Below we show how to create a Pandas DataFrame, an internal object for representing tabular data. This contains a column for the key and one for the value. 

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame()
df['key'] = [0, 0, 2, 2, 3]
df['value'] = [float(i + 10) for i in range(5)]
print(df)

   key  value
0    0   10.0
1    0   11.0
2    2   12.0
3    2   13.0
4    3   14.0


We can see where our install of pandas is by running the cell below. 

In [3]:
pd

<module 'pandas' from '/packages/envs/rapids25.02/lib/python3.12/site-packages/pandas/__init__.py'>

We can also perform operations on this data. For example, let's say we wanted to sum all values in the in the value column. We could accomplish this using the following syntax:

In [4]:
aggregation = df['value'].sum()
print("Aggregation:", aggregation)

Aggregation: 60.0


The data stored in dataframes is interoperable with major data analytics and deep learning frameworks in the Python ecosystem. Including Tensorflow, Pytorch, CuPy, and Numba. This interoperability is largely made possible by DLPack - a standardized specification for tensor structures. And shuttling the data between different applications is done without any memory copies! Below we convert NumPy data directly into a pandas dataframe. 

In [5]:
import numpy as np
from time import perf_counter

n = 50000000

data = {
    'key': np.random.randint(0, 1000, n),
    'value1': np.random.randn(n),
    'value2': np.random.randn(n)
}

pdf = pd.DataFrame(data)

start_time = perf_counter()
pandas_result = pdf.groupby('key').agg({'value1': ['sum', 'mean'], 'value2': ['min', 'max']})
stop_time = perf_counter()

pandas_time = stop_time - start_time
print(f"Pandas time: {pandas_time:.4f} seconds")

Pandas time: 1.3530 seconds


Like CuPy is to NumPy, we have [cuDF](https://github.com/rapidsai/cudf) which allows us to GPU accelerate Pandas code. Taking it a step further, when you load `cudf.pandas` you can automatically accelerate your pandas code on the GPU with zero code changes. Operations execute on the GPU where possible and on the CPU otherwise, synchronizing under the hood as needed. Operations are first attempted on the GPU (copying from CPU if necessary). If that fails, the operation is attempted on the CPU (copying from GPU if necessary).

To gain access to this zero change acceleration we load the extension.

In [6]:
%load_ext cudf.pandas

Now, when we import pandas, we see we are now using the cuDF accelerator under the hood!

In [7]:
import pandas as pd
pd

<module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))>

Let's create the same dataframe as before.

In [8]:
df = pd.DataFrame()
df['key'] = [0, 0, 2, 2, 3]
df['value'] = [float(i + 10) for i in range(5)]
print(df)

   key  value
0    0   10.0
1    0   11.0
2    2   12.0
3    2   13.0
4    3   14.0


We can run the exact same code, but under the hood we are running on the GPU!

In [9]:
aggregation = df['value'].sum()
print(aggregation)

60.0


Now let's get back to our larger example. With the exact same code as we saw above, we can compare the performance of native pandas, and the cudf.pandas extension that unlocks the GPU acceleration. There is also a profiler which can give use the information on which calls are run on the GPU and how long they take!

In [10]:
%%cudf.pandas.profile

pdf = pd.DataFrame(data)

start_time = perf_counter()
pandas_cudf_result = pdf.groupby('key').agg({'value1': ['sum', 'mean'], 'value2': ['min', 'max']})
stop_time = perf_counter()

pandas_cudf_time = stop_time - start_time

print(f"Pandas time: {pandas_time:.4f} seconds")
print(f"pandas_cudf time: {pandas_cudf_time:.4f} seconds")
print(f"Speedup: {pandas_time / pandas_cudf_time:.2f}x")

Pandas time: 1.3530 seconds
pandas_cudf time: 0.2504 seconds
Speedup: 5.40x


That is an big speed up, without making any code changes! 

---

**CuDF pandas in Practice: 311 call dataset**

We are going to explore more features of `cudf.pandas` by utilizing a larger dataset. NYC OpenData provides a multitude of open to use and up to date datasets. We are utilizing the 311 Service Request data from 2010 to present day. We start by loading the data, and keeping the columns that are useful to us.

In [12]:
%%cudf.pandas.profile
# TODO: Change path to the csv
df_311 = pd.read_csv('/home/vpatel69/python_notebooks/311_service_requests.csv')[["created_date", "agency", "complaint_type", "descriptor", "incident_zip", "location_type", "landmark", "borough"]]
df_311 = df_311.dropna()
df_311.sample(10)

Unnamed: 0,created_date,agency,complaint_type,descriptor,incident_zip,location_type,landmark,borough
957,2025-01-14T21:37:26.000,DSNY,Dead Animal,Raccoon,11427,Street,SAWYER AVENUE,QUEENS
990,2025-01-14T21:32:13.000,NYPD,Noise - Street/Sidewalk,Loud Music/Party,10454,Street/Sidewalk,EAST 143 STREET,BRONX
81,2025-01-15T00:11:31.000,NYPD,Noise - Residential,Loud Music/Party,11435,Residential Building/House,HILLSIDE AVENUE,QUEENS
799,2025-01-14T22:02:16.000,NYPD,Noise - Residential,Banging/Pounding,11214,Residential Building/House,84 STREET,BROOKLYN
176,2025-01-14T23:48:48.000,NYPD,Noise - Residential,Loud Television,10456,Residential Building/House,WASHINGTON AVENUE,BRONX
891,2025-01-14T21:48:10.000,NYPD,Noise - Residential,Loud Talking,10467,Residential Building/House,STEUBEN AVENUE,BRONX
845,2025-01-14T21:54:38.000,NYPD,Non-Emergency Police Matter,Trespassing,10452,Residential Building/House,FEATHERBED LANE,BRONX
872,2025-01-14T21:50:56.000,NYPD,Abandoned Vehicle,With License Plate,11213,Street/Sidewalk,BERGEN STREET,BROOKLYN
633,2025-01-14T22:32:46.000,NYPD,Non-Emergency Police Matter,Trespassing,11373,Residential Building/House,BROADWAY,QUEENS
58,2025-01-15T00:20:34.000,NYPD,Noise - Residential,Banging/Pounding,11225,Residential Building/House,EMPIRE BOULEVARD,BROOKLYN


We can group the data into useful categories. 

In [13]:
%%cudf.pandas.profile
grouped = df_311.groupby('borough').size()
print(grouped)

borough
BRONX            111
BROOKLYN         189
MANHATTAN        116
QUEENS           192
STATEN ISLAND     11
dtype: int64


We can take this further, let's see what the most common complaint is based on borough. Noting again we are taking advantage of the cuDF acceleration without having to change our pandas code. 

In [14]:
%%cudf.pandas.profile
grouped = df_311.groupby(['borough', 'complaint_type']).size().reset_index(name='count')
most_common_complaints = grouped.sort_values('count', ascending=False).groupby('borough').first()

print(most_common_complaints[['complaint_type', 'count']])

                    complaint_type  count
borough                                  
BRONX          Noise - Residential     48
BROOKLYN       Noise - Residential     58
MANHATTAN      Noise - Residential     51
QUEENS             Illegal Parking     72
STATEN ISLAND  Noise - Residential      4


Now take a look at other [pandas](https://pandas.pydata.org/docs/reference/frame.html#function-application-groupby-window) functions to gain additional insight about the dataset. Be sure to profile what you have to get a sense of how the GPU is being utilized!

In [None]:
# TODO: Write your own cudf.pandas code to gain additional insight from the 311 Call Dataset

---

**Please restart the kernel**

In [15]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In this notebook, we've explored the fundamentals of RAPIDS cuDF and its potential for accelerating data manipulation tasks in Python. We've seen how cuDF can provide significant speedups over traditional pandas operations, particularly for large-scale datasets. Key takeaways include:

1. cuDF offers a familiar pandas-like interface, making it easy to port existing code to GPU acceleration.
2. The cudf.pandas extension allows for automatic GPU acceleration with zero code changes.
3. Performance gains can be substantial, especially for operations like grouping and aggregation on large datasets.
4. Real-world applications, such as analyzing the NYC 311 Service Request dataset, demonstrate the practical benefits of using cuDF.
5. The built-in profiler helps identify which operations are running on the GPU and their execution times.

This GPU-accelerated library empowers you to process and analyze massive datasets at lightning speed, unlocking insights that were once out of reach due to computational limitations.