<a href="https://colab.research.google.com/drive/1-e6wcr7ehTQEsYYhf_Hj8uiNzMzLPmvb?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAPIDS Demo

Today, we're going to explore how we can do common data analytic tasks on GPUs using the RAPIDS libraries. Specifically we'll explore cudf and cuml for performing common DataFrame and Machine Learning tasks. Note that the setup portion of this notebook draws on [a setup notebook](https://colab.research.google.com/drive/13sspqiEZwso4NYTbsflpPyNFaVAAxUgr) linked in the RAPIDS documentation and is meant to be run in a Colab notebook.

The cudf and cuml demos are built off of the notebooks provided in the [RAPIDS notebook repositories](https://github.com/rapidsai/notebooks) on GitHub (and you can explore them further if you are interested! There are many other relevant libraries in the RAPIDS ecosystem -- e.g. `cugraph` which allows you to perform network analyses on GPUs). Note that all of this code is intended to be run on a single GPU, but can be further parallelized in a multi-GPU cluster using either [Dask](https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html#when-to-use-cudf-and-dask-cudf) or [Spark 3.0](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/apache-spark-3/) (which we'll talk about later in the course).

## Setup

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [None]:
!nvidia-smi

Tue Apr  4 21:51:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Then we run the setup script below, which:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip and **will complete in about 3-4 minutes**

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

At this point, our RAPIDS libraries are now installed on Colab and we can import them into our session.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import cudf
import numpy as np

## GPU DataFrames: `cudf`

Let's take a look at our AirBnB listing data that we were looking at last week and load this in as a `cudf` GPU DataFrame to demonstrate some of capabilities we can expect.

In [None]:
df = cudf.read_csv('listings_chi.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2384,"Hyde Park - Walk to UChicago, 10 min to McCormick",2613,Rebecca,,Hyde Park,41.7879,-87.5878,Private room,65,2,182,2021-03-28,2.38,1,0
1,4505,394 Great Reviews. 127 y/o House. 40 yds to tr...,5775,Craig & Kathleen,,South Lawndale,41.85373,-87.6954,Entire home/apt,113,2,395,2020-07-14,2.67,1,180
2,7126,Tiny Studio Apartment 94 Walk Score,17928,Sarah,,West Town,41.90166,-87.68021,Entire home/apt,65,2,394,2021-04-11,2.74,1,267
3,9811,Barbara's Hideaway - Old Town,33004,At Home Inn,,Lincoln Park,41.91943,-87.63898,Entire home/apt,120,5,54,2021-01-15,0.63,11,1
4,10945,The Biddle House (#1),33004,At Home Inn,,Lincoln Park,41.91196,-87.63981,Entire home/apt,175,4,22,2021-03-25,0.26,11,125


Once we have that data, we can perform many of the standard DataFrame operations we perform on CPUs -- just accelerated by our GPU!

In [None]:
df.groupby(['neighbourhood', 'room_type']) \
  .price \
  .mean()

neighbourhood           room_type      
Hermosa                 Entire home/apt    110.642857
West Lawn               Entire home/apt    143.250000
Greater Grand Crossing  Private room        44.857143
Ashburn                 Private room        57.500000
Rogers Park             Shared room         54.333333
                                              ...    
East Side               Private room        20.000000
Calumet Heights         Shared room         15.750000
Forest Glen             Entire home/apt    194.500000
New City                Private room        39.222222
Douglas                 Private room        69.760000
Name: price, Length: 178, dtype: float64

One thing to note, though, is that not all of the functionality we might expect out of CPU clusters is available yet in the `cudf` DataFrame implementation.

For instance (and of particular note!), our ability to apply custom functions is still pretty limited. `cudf` uses Numba's CUDA compiler to translate this code for the GPU and [many standard `numpy` operations are not supported](https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html#numpy-support) (for instance, if you try to apply the distance calculation with performed in the CPU Vectorization/Multithreading demonstration notebook for last week, it will fail to compile correctly for the GPU).

That being said, we can perform many base-Python operations inside of custom functions, so if you can express your custom functions in this way, it might be worth your while to do this work on a GPU. For example, let's create a custom price index that indicates whether an AirBnB is \"Cheap\" (0), \"Moderately Expensive\" (1), or \"Very Expensive\" (2) using `cudf`'s [`apply_rows` method](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#numba-kernels-for-dataframes):

In [None]:
def expensive(x, price_index):
    # passed through Numba's CUDA compiler and `for`
    # loop is automatically parallelized for GPU
    for i, price in enumerate(x):
        if price < 50:
            price_index[i] = 0
        elif price < 100:
            price_index[i] = 1
        else:
            price_index[i] = 2

# Use cudf's `apply_rows` API for applying function to every row in DataFrame
df = df.apply_rows(expensive,
                   incols={'price':'x'},
                   outcols={'price_index': int},
                   kwargs={})

# Confirm that price index created correctly
df[['price', 'price_index']].head()

Unnamed: 0,price,price_index
0,65,1
1,113,2
2,65,1
3,120,2
4,175,2


## Training Machine Learning Models with `cuml`

In addition to preprocessing and analyzing data on GPUs, we can also train (a limited set of) Machine Learning models directly on our GPU using the cuml library in the RAPIDS ecoystem as well. This can give us a significant speedup in training time over libraries like `sklearn` on CPUs for large datasets.

For instance, let's train a linear regression model to predict the price of an AirBnB based on other values in its listing information (e.g. "reviews per month" and "minimum nights"). We'll fit the model both in scikit-learn on our CPU host and cuml on our GPU device and compare how long it takes.

In [None]:
from cuml.linear_model import LinearRegression as cuLinearRegression
from sklearn.linear_model import LinearRegression as skLinearRegression

We can see that for large datasets, `cuml` is quite a bit faster:

In [None]:
# subset and tile dataset to mimic larger dataset
df_sub = df[['reviews_per_month', 'minimum_nights', 'price']].astype(np.float32).dropna()
df_big = df_sub.tile(1000)
X = df_big[['reviews_per_month', 'minimum_nights']]
y = df_big[['price']]

%timeit fit = cuLinearRegression().fit(X, y)

21.8 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
# Copy dataset from GPU device memory to CPU host memory.
# to compare CPU and GPU results.
X_cpu = X.to_pandas()
y_cpu = y.to_pandas()

%timeit fit_cpu = skLinearRegression().fit(X_cpu, y_cpu)

509 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


If we take a look at other standard machine learning algorithms in the documentation (for instance [k means clustering](https://github.com/rapidsai/cuml/blob/branch-23.04/notebooks/kmeans_demo.ipynb)) as well, we can see significant speedups over performing the same operations on large datasets in scikit-learn on a CPU.

Note, though, that this is only true of larger data. If we use our original, smaller dataset, we have basically the same performance on CPU and GPU.

In [None]:
# smaller data
X = df_sub[['reviews_per_month', 'minimum_nights']]
y = df_sub[['price']]

%timeit fit = cuLinearRegression().fit(X, y)

2.69 ms ± 60.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
X_cpu = X.to_pandas()
y_cpu = y.to_pandas()

%timeit fit_cpu = skLinearRegression().fit(X_cpu, y_cpu)

2.74 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Discussion:

1. Based on what you know about CPUs and GPUs, why do you think there is a disparity between the GPU's relative performance on large vs. small data?
2. Now that you've trained your model, you want to be able to generate predicted prices based on individual observations of 'reviews_per_month' and 'minimum_nights' (e.g. from a single listing with 1 review per month and 1 minimum night, you will make a single prediction via code like `fit.predict(np.array([[1, 1]]))`), so that you can provide recommendations via API to hosts as to how they should price their listings based on the features of their property. 

  Would it be better to run this prediction service on a CPU or continue running it on the GPU that you trained the model on (note that you can [pickle a model](https://docs.rapids.ai/api/cuml/stable/pickling_cuml_models/) that you have trained on a GPU and employ it on a CPU if desired)? Why? What are the tradeoffs?

