# Objective

This notebook showcases 
- the speed comparison between Rapids cudf/cp VS Pandas/Numpy
- how similar is Rapids and Pandas in API usage
- Rapids cuml vs sklearn (speed & accuracy)

In [17]:
!nvidia-smi

Tue Jan  3 02:47:27 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

If you are using colab, please run the following steps to set up your Rapids environment. 

- Updates gcc in Colab
- Installs Conda
- Install RAPIDS' current stable version of its libraries
- Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.

In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

In [1]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:21
🔁 Restarting kernel...


In [1]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

In [59]:
!pip install pycountry

import cudf
import cupy as cp

import pandas as pd
import numpy as np

import pycountry
import random

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

## Reading Data

Purposely create a million large dataset. Read the generated csv file directly into GPU memory.

In [61]:
# make a large dataset
n_samples = 2**20
n_features = 10
data_type = np.float32

X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
X.columns = ['feature' + str(x) for x in range(n_features)]

# randomly generate countries for the dataframe
X['country'] = pd.Series(
    random.choices([x.name for x in pycountry.countries], k=len(df)), 
    index=X.index
)

pd.concat([X, pd.Series(y.astype(np.int32))], axis=1).to_csv('./test.csv', index=False)

del X, y

In [62]:
%%time 
df = pd.read_csv('./test.csv')
df.shape

CPU times: user 1.97 s, sys: 567 ms, total: 2.54 s
Wall time: 2.6 s


(1048576, 12)

Here for comparison we read the same data into a Rapids dataframe:

In [63]:
%%time 
gdf = cudf.read_csv('./test.csv')
gdf.shape == df.shape

CPU times: user 91.5 ms, sys: 18 ms, total: 109 ms
Wall time: 117 ms


True

In [64]:
gdf.head()

Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,country,0
0,-0.535363,0.315857,-0.910635,1.166296,0.90993,-0.819358,0.755552,-0.013121,0.776556,0.002218,Syrian Arab Republic,1
1,-0.014249,1.654942,1.218648,0.592086,-0.08425,0.44154,0.843079,-1.560887,-1.205743,-0.583176,Tunisia,0
2,2.585151,-2.49061,0.784548,-1.266088,0.766932,-0.846633,-0.376109,2.247664,1.456157,-1.370838,Albania,0
3,-0.784457,1.620303,0.535078,0.01785,-0.822449,-0.137021,-0.234215,-1.680125,-1.717486,-0.076106,Gibraltar,0
4,-0.42547,-0.748202,-0.907098,-0.854806,0.480172,-1.076239,-0.919844,0.546936,-0.016022,0.809858,Congo,0


## Data Transformation

In [65]:
%%time 
df['feature0'] = df['feature0'].astype('float32')

CPU times: user 17.1 ms, sys: 25.1 ms, total: 42.1 ms
Wall time: 46.5 ms


In [66]:
%%time 
gdf['feature0'] = gdf['feature0'].astype('float32')

CPU times: user 1.74 ms, sys: 996 µs, total: 2.74 ms
Wall time: 1.98 ms


## Data Aggregation

In [67]:
%%time 
df['feature0'].mean()

CPU times: user 4.87 ms, sys: 0 ns, total: 4.87 ms
Wall time: 3.64 ms


-0.00045146846

In [68]:
%%time 
gdf['feature0'].mean()

CPU times: user 1.17 ms, sys: 2.01 ms, total: 3.18 ms
Wall time: 3.7 ms


-0.00045146862501266186

## Data Slicing

In [69]:
%%time 
e_countries_pd = df.loc[df['country'].str.startswith('E')]
e_countries_pd.head()

CPU times: user 295 ms, sys: 9 µs, total: 295 ms
Wall time: 295 ms


Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,country,0
33,0.088731,1.919771,-0.84066,2.117903,1.453341,-0.881644,-0.588299,-1.423739,-0.03095,1.318564,El Salvador,1
42,0.078306,-1.075666,1.036882,-0.288157,-0.994951,0.518903,-0.300352,1.040673,0.876103,0.464713,Equatorial Guinea,1
85,-0.142626,2.02586,0.127316,1.574368,1.727989,0.663708,-0.726141,-1.681019,-0.664001,0.222185,Estonia,1
90,1.465856,-1.135237,1.077965,0.804901,-0.235604,-1.324354,-0.366701,1.398157,1.984561,1.183827,Egypt,1
99,-0.453122,-1.757094,-0.57645,-0.406361,-1.098253,-0.778733,1.761517,1.71733,1.492604,-0.110965,Ethiopia,1


In [70]:
%%time
e_countries = gdf.loc[gdf['country'].str.startswith('E')]
e_countries.head()

CPU times: user 9.68 ms, sys: 995 µs, total: 10.7 ms
Wall time: 14.6 ms


Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,country,0
33,0.088731,1.919771,-0.84066,2.117903,1.453341,-0.881644,-0.588299,-1.423739,-0.03095,1.318564,El Salvador,1
42,0.078306,-1.075666,1.036882,-0.288157,-0.994951,0.518903,-0.300352,1.040673,0.876103,0.464713,Equatorial Guinea,1
85,-0.142626,2.02586,0.127316,1.574368,1.727989,0.663708,-0.726141,-1.681019,-0.664001,0.222185,Estonia,1
90,1.465856,-1.135237,1.077965,0.804901,-0.235604,-1.324354,-0.366701,1.398157,1.984561,1.183827,Egypt,1
99,-0.453122,-1.757094,-0.57645,-0.406361,-1.098253,-0.778733,1.761517,1.71733,1.492604,-0.110965,Ethiopia,1


In [73]:
%%time 
e_countries_pd = df.loc[np.logical_and(df['country'].str.startswith('U'), df['country'].str.endswith('s'))]
e_countries_pd.tail()

CPU times: user 518 ms, sys: 3.02 ms, total: 521 ms
Wall time: 548 ms


Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,country,0
1048392,1.647939,-0.94176,1.908531,0.768186,-2.233725,0.184583,0.940199,1.187034,1.742352,-0.602549,United Arab Emirates,1
1048411,-2.230785,0.453977,0.454118,0.959416,0.108926,-0.693497,1.594232,-0.212688,0.430973,0.45605,United States Minor Outlying Islands,1
1048431,-0.42127,2.016044,-1.579364,1.28384,-1.909855,0.076988,-0.180474,-1.749363,-0.931163,1.083131,United Arab Emirates,0
1048444,0.088414,-0.008242,1.221826,1.279175,-0.247247,2.036476,1.897682,0.354428,1.23139,1.8446,United States Minor Outlying Islands,1
1048493,-0.137712,2.568535,-1.644909,0.573583,0.10455,0.438326,-1.485963,-2.515934,-2.201435,-1.054947,United States Minor Outlying Islands,0


In [74]:
%%time
e_countries = gdf.loc[cp.logical_and(gdf['country'].str.startswith('U'), gdf['country'].str.endswith('s'))]
e_countries.tail()

CPU times: user 228 ms, sys: 12 ms, total: 240 ms
Wall time: 244 ms


Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,country,0
1048392,1.647939,-0.94176,1.908531,0.768186,-2.233725,0.184582,0.940199,1.187034,1.742352,-0.602549,United Arab Emirates,1
1048411,-2.230785,0.453977,0.454118,0.959416,0.108926,-0.693497,1.594232,-0.212688,0.430973,0.45605,United States Minor Outlying Islands,1
1048431,-0.42127,2.016044,-1.579364,1.28384,-1.909855,0.076988,-0.180474,-1.749363,-0.931163,1.083131,United Arab Emirates,0
1048444,0.088414,-0.008242,1.221826,1.279175,-0.247247,2.036476,1.897682,0.354428,1.23139,1.8446,United States Minor Outlying Islands,1
1048493,-0.137712,2.568535,-1.644909,0.573583,0.10455,0.438326,-1.485963,-2.515934,-2.201435,-1.054947,United States Minor Outlying Islands,0


## Summary Statistics

### Grouping

In [77]:
%%time
countries_pd = df[['country', 'feature0']].groupby(['country'])
avg_feature0_pd = countries_pd.mean()
print(avg_feature0_pd[:5])

                feature0
country                 
Afghanistan    -0.024402
Albania         0.009175
Algeria         0.015253
American Samoa -0.006133
Andorra        -0.028711
CPU times: user 135 ms, sys: 10.1 ms, total: 145 ms
Wall time: 187 ms


In [78]:
%%time
countries = gdf[['country', 'feature0']].groupby(['country'])
avg_feature0 = countries.mean()
print(avg_feature0[:5])

                   feature0
country                    
Grenada            0.020730
Indonesia          0.018211
Sweden             0.002776
Wallis and Futuna -0.003760
Puerto Rico        0.014092
CPU times: user 10.5 ms, sys: 1.01 ms, total: 11.5 ms
Wall time: 12.2 ms


### Sorting

In [79]:
%%time 
df_feature0 = df['feature0'].sort_values()
print(df_feature0[:5])
print(df_feature0[-5:])

68195     -4.935226
1015392   -4.737764
573151    -4.571545
755089    -4.535323
156861    -4.533626
Name: feature0, dtype: float32
736820    4.425749
559937    4.458586
457167    4.502169
862358    4.759808
402546    5.256614
Name: feature0, dtype: float32
CPU times: user 203 ms, sys: 1.98 ms, total: 205 ms
Wall time: 207 ms




In [80]:
%%time 
gdf_feature0 = gdf['feature0'].sort_values()
print(gdf_feature0[:5])
print(gdf_feature0[-5:])

68195     -4.935226
1015392   -4.737764
573151    -4.571545
755089    -4.535323
156861    -4.533626
Name: feature0, dtype: float32
736820    4.425749
559937    4.458586
457167    4.502169
862358    4.759808
402546    5.256614
Name: feature0, dtype: float32
CPU times: user 14.6 ms, sys: 3 ms, total: 17.6 ms
Wall time: 17.7 ms


## Machine Learning
cuml vs scikit-learn

In [81]:
%%time
X = df.drop(columns=['0', 'country'])
# cuML Random Forest Classifier requires the labels to be integers
y = df['0']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.3,
                                                    random_state=0)

CPU times: user 222 ms, sys: 1.97 ms, total: 224 ms
Wall time: 225 ms


In [82]:
%%time
X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)

y_cudf_train = cudf.Series(y_train.values)

CPU times: user 71.4 ms, sys: 5.98 ms, total: 77.4 ms
Wall time: 77.4 ms


### Scikit-learn Fit

In [83]:
%%time
sk_model = skrfc(n_estimators=40,
                 max_depth=10,
                 max_features=1.0,
                 random_state=10)

sk_model.fit(X_train, y_train)

CPU times: user 5min 17s, sys: 394 ms, total: 5min 17s
Wall time: 5min 22s


In [84]:
%%time
sk_predict = sk_model.predict(X_test)
sk_acc = accuracy_score(y_test, sk_predict)

CPU times: user 1.11 s, sys: 4.99 ms, total: 1.12 s
Wall time: 1.11 s


### cuML model

In [88]:
%%time
cuml_model = curfc(n_estimators=40,
                   max_depth=10,
                   max_features=1.0,
                   random_state=10)

cuml_model.fit(X_cudf_train, y_cudf_train)

CPU times: user 2.65 s, sys: 1.61 s, total: 4.27 s
Wall time: 2.26 s


RandomForestClassifier()

In [89]:
%%time
fil_preds_orig = cuml_model.predict(X_cudf_test)

fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)

CPU times: user 48.2 ms, sys: 11 ms, total: 59.2 ms
Wall time: 42.6 ms


In [90]:
print("SKL accuracy: %s" % sk_acc)
print("CUML accuracy: %s" % fil_acc_orig)

SKL accuracy: 0.9360275864601135
CUML accuracy: 0.9360307455062866
