<img src='images/DLI_Header.png'>

# Accelerate Data Science Workflows with Zero Code Changes #

## 01 - 10 Minutes to RAPIDS cuDF's pandas accelerator mode (cudf.pandas) ##
cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas. cuDF now provides a pandas accelerator mode (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change. 

**Table of Contents**
<br>
This notebook is a short introduction to `cudf.pandas`. This notebook covers the below sections: 
1. [Verify your setup](#s1-1)
2. [Download the data](#s1-2)
    * [Data License and Terms](#s1-2.1)
3. [Analysis using Standard Pandas](#s1-3)
    * [Which parking violation is most commonly committed by vehicles from various U.S states](#s1-3.1)
    * [Which vehicle body types are most frequently involved in parking violations](#s1-3.2)
    * [How do parking violations vary across days of the week](#s1-3.3)
    * [Let's time it](#s1-3.4)
4. [Using cuDF's pandas accelerator mode (cudf.pandas)](#s1-4)
5. [Understanding Performance](#s1-5)
    * [Profiling Functionality](#s1-5.1)
    * [Behind the scenes: What's going on here](#s1-5.2)
6. [Using third-party libraries with cuDF's pandas accelerator mode](#s1-6)
    * [Visualizing which states have more pickup trucks relative to other vehicles](#s1-6.1)
    * [Beyond just passing data: Accelerating third-party code](#s1-6.2)
7. [Conclusion](#s1-7)

<a name='s1-1'></a>
## ⚠️ Verify your setup ## 
First, we'll verify that you are running with an NVIDIA GPU and that cuDF is available.

In [None]:
!nvidia-smi  # this should display information about available GPUs

In [None]:
import cudf  # this should work without any errors

We'll also install `plotly-express` for visualizing data.

In [None]:
!pip install plotly-express

<a name='s1-2'></a>
## Download the data ##
The data we'll be working with is the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y) dataset from NYC Open Data. We're downloading a copy of this dataset from an s3 bucket hosted by NVIDIA to provide faster download speeds. We'll start by downloading the data. This should take about 30 seconds.

<a name='s1-2.1'></a>
### Data License and Terms ###
As this dataset originates from the NYC Open Data Portal, it's governed by their license and terms of use.

**Are there restrictions on how I can use Open Data?**
> Open Data belongs to all New Yorkers. There are no restrictions on the use of Open Data. Refer to Terms of Use for more information.

**[Terms of Use](https://opendata.cityofnewyork.us/overview/#termsofuse)**
> By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

> The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

> Submitting City Agencies are the authoritative source of data available on NYC Open Data. These entities are responsible for data quality and retain version control of data sets and feeds accessed on the Site. Data may be updated, corrected, or refreshed at any time.

In [None]:
!wget https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet

<a name='s1-3'></a>
## Analysis using Standard pandas ##
First, let's use pandas to read in some columns of the dataset:

In [None]:
import pandas as pd

In [None]:
# read 5 columns data:
df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)

# view a random sample of 10 rows:
df.sample(10)

Next, we'll try to answer a few questions using the data.

<a name='s1-3.1'></a>
### Which parking violation is most commonly committed by vehicles from various U.S states? ###
Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking offence. Let's say we want to get the most common type of offence for vehicles registered in different states. We can do this in pandas using a combination of [value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and [GroupBy.head](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.head.html):

In [None]:
(df[["Registration State", "Violation Description"]]  # get only these two columns
 .value_counts()  # get the count of offences per state and per type of offence
 .groupby("Registration State")  # group by state
 .head(1)  # get the first row in each group (the type of offence with the largest count)
 .sort_index()  # sort by state name
 .reset_index()
)

The code above uses [method chaining](https://tomaugspurger.net/posts/method-chaining/) to combine a series of operations into a single statement. You might find it useful to break the code up into multiple statements and inspect each of the intermediate results!

<a name='s1-3.2'></a>
### Which vehicle body types are most frequently involved in parking violations? ###
We can also investigate which vehicle body types most commonly appear in parking violations. 

In [None]:
(df
 .groupby(["Vehicle Body Type"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
)

<a name='s1-3.3'></a>
### How do parking violations vary across days of the week? ###

In [None]:
weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

It looks like there are fewer violations on weekends, which makes sense! During the week, more people are driving in New York City.

<a name='s1-3.4'></a>
### Let's time it! ###
Loading and processing this data took a little time. Let's measure how long these pipelines take in pandas:

In [None]:
%%time

df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)

(df[["Registration State", "Violation Description"]]
 .value_counts()
 .groupby("Registration State")
 .head(1)
 .sort_index()
 .reset_index()
)

In [None]:
%%time

(df
 .groupby(["Vehicle Body Type"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
)

In [None]:
%%time

weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

<a name='s1-4'></a>
## Using cuDF's pandas accelerator mode (cudf.pandas) ##
Now, let's re-run the pandas code above with the `cudf.pandas` extension loaded. Typically, you should load the `cudf.pandas` extension as the first step in your notebook, before importing any modules. Here, we explicitly restart the kernel to simulate that behavior.

In [None]:
get_ipython().kernel.do_shutdown(restart=True)

In [None]:
%load_ext cudf.pandas

In [None]:
%%time

import pandas as pd

df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)

(df[["Registration State", "Violation Description"]]
 .value_counts()
 .groupby("Registration State")
 .head(1)
 .sort_index()
 .reset_index()
)

In [None]:
%%time

(df
 .groupby(["Vehicle Body Type"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
)

In [None]:
%%time

weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

Much faster! Operations that took 5-20 seconds can now potentially finish in just milliseconds without changing any code.

<a name='s1-5'></a>
## Understanding Performance ##
cuDF's pandas accelerator mode provides profiling utilities to help you better understand performance. With these tools, you can identify which parts of your code ran on the GPU and which parts ran on the CPU. They're accessible in the `cudf.pandas` namespace since the `cudf.pandas` extension was loaded above with `load_ext cudf.pandas`.

<a name='s1-5.1'></a>
### Profiling Functionality ###
We can generate a per-function profile:

In [None]:
%%cudf.pandas.profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis, numeric_only=True)
    axis = 1

counts = small_df.groupby("a").b.count()

In [None]:
%%cudf.pandas.line_profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis, numeric_only=True)
    axis = 1

counts = small_df.groupby("a").b.count()

<a name='s1-5.2'></a>
### Behind the scenes: What's going on here? ###
When you load cuDF's pandas accelerator mode, pandas types like `Series` and `DataFrame` are replaced by proxy objects that dispatch operations to cuDF when possible. We can verify that `cudf.pandas` is active by looking at our `pd` variable:

In [None]:
pd

As a result, all pandas functions, methods, and created objects are proxies:

In [None]:
type(pd.read_csv)

Operations supported by cuDF will be **very** fast:

In [None]:
%%time
df.count(axis=0)

Operations not supported by cuDF will be slower, as they fall back to using pandas (copying data between the CPU and GPU under the hood as needed). For example, cuDF does not currently support the `axis=` parameter to the `count` method. So, this operation will run on the CPU and be noticeably slower than the previous one.

In [None]:
%%time
df.count(axis=1) # This will use pandas, because cuDF doesn't support axis=1 for the .count() method

But the story doesn't end here. We often need to mix our own code with third-party libraries that other people have written. Many of these libraries accept pandas objects as inputs.

<a name='s1-6'></a>
## Using third-party libraries with cuDF's pandas accelerator mode ##
You can pass pandas objects to third-party libraries when using `cudf.pandas`, just like you would when using regular pandas. Below, we show an example of using [plotly-express](https://plotly.com/python/plotly-express/) to visualize the data we've been processing:

<a name='s1-6.1'></a>
### Visualizing which states have more pickup trucks relative to other vehicles? ###

In [None]:
import plotly.express as px

df = df.rename(columns={
    "Registration State": "reg_state",
    "Vehicle Body Type": "vehicle_type",
})

# vehicle counts per state:
counts = df.groupby("reg_state").size().sort_index()
# vehicles with type "PICK" (Pickup Truck)
pickup_counts = df.where(df["vehicle_type"] == "PICK").groupby("reg_state").size()
# percentage of pickup trucks by state:
pickup_frac = ((pickup_counts / counts) * 100).rename("% Pickup Trucks")
del pickup_frac["MB"]  # (Manitoba is a huge outlier!)

# plot the results:
pickup_frac = pickup_frac.reset_index()
px.choropleth(pickup_frac, locations="reg_state", color="% Pickup Trucks", locationmode="USA-states", scope="usa")

<a name='s1-6.2'></a>
### Beyond just passing data: **Accelerating** third-party code ###
Being able to pass these proxy objects to libraries like Plotly is great, but the benefits don't end there. When you enable cuDF's pandas accelerator mode, pandas operations running **inside the third-party library's functions** will also benefit from GPU acceleration where possible! Below, you can see an image illustrating how `cudf.pandas` can accelerate the pandas backend in Ibis, a library that provides a unified DataFrame API to various backends. We ran this example on a system with an NVIDIA H100 GPU and an Intel Xeon Platinum 8480CL CPU. By loading the `cudf.pandas` extension, pandas operations within Ibis can use the GPU with zero code change. It just works.

<p><img src='images/cuDF.png' width=720></p>

<a name='s1-7'></a>
## Conclusion ##
With cuDF's pandas accelerator mode, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the cuDF extension and run your existing code on a GPU! To learn more about cuDF's pandas accelerator mode we encourage you to visit [rapids.ai/cudf-pandas](https://rapids.ai/cudf-pandas).

**Well Done!** Let's move to the [next notebook](02_cuML.ipynb). 

<img src='images/DLI_Header.png'>