NVIDIA Rapids is a software package to enable data science computations on GPUs.  It offers dataframes that are mostly compatible with pandas but that can outperform the latter by a substantial factor for some operations.  Rapids dataframes are defined in the `cudf` module.

## Requiremnts

To clearly make the distinction between pandas and Rapids dataframes, we import the modules without aliasing in this notebook.

In [2]:
import pandas
import cudf
import numpy as np

## Creating dataframes

In order to do some performance tests, we create some dataframes with a substantial amount of data.  The pandas dataframe will be assigned to the variable `df`, the Rapids dataframe in `cf`.

In [3]:
nr_rows = 20_000_000
data = {
    'A': np.random.uniform(0.0, 1.0, size=nr_rows),
    'B': np.random.uniform(0.0, 1.0, size=nr_rows),
    'C': np.random.uniform(0.0, 1.0, size=nr_rows),
}

In [4]:
df = pandas.DataFrame(data)

The Rapids dataframe is created in exactly the same way.

In [5]:
cf = cudf.DataFrame(data)

It is interesting to check the datatypes used in the pandas and the Rapids dataframe, they are identical.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000000 entries, 0 to 19999999
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
dtypes: float64(4)
memory usage: 610.4 MB


In [16]:
cf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 20000000 entries, 0 to 19999999
Data columns (total 4 columns):
 #   Column  Dtype
---  ------  -----
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
dtypes: float64(4)
memory usage: 610.4 MB


## Applying functions

We can time the difference between the pandas and cudf when computing the columnwise average.

In [6]:
%timeit df.A.mean()

31 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [7]:
%timeit cf.A.mean()

2.15 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Likely you will see a warning that the slowest run took substantially longer than other runs.  This is due to data movement from the host memory to the GPU device memory.  This will of course also happen in practice and illustrates the importance of ensuring that the data required for the computations resides on the GPU.  However, it is clear that on average the computation on a Rapids dataframe is a factor of 5 faster.

Computing the rowwise average is substantially slower due to the data structure used to represent the dataframe.

In [8]:
%timeit df.mean(axis=1)

356 ms ± 35.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%timeit cf.mean(axis=1)

The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached.
60.8 ms ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Still, Rapids outperforms the pandas implementation.  However, in fairness, one should comparee to the latest version of pandas which has substantial performance improvements with respect to the 1.x releases.

In [10]:
pandas.__version__

'1.5.3'

## Creating new columns

We can create a new column in a dataframe using an arithmetic expression on other columns.

In [13]:
%timeit df['D'] = 2.1*df.A + 3.5*df.B

141 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%timeit cf['D'] = 2.1*cf.A + 3.5*cf.B

30 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Again, the Rapids dataframe outperformns the pandas version by a substantial factor.

## Categorical data

We can introduce a column that contains categorical data by using the `cut` function.  In this case, we bin column A, using 5 categories.  First we time the operation without storing the resulting values, then we store them in a new column, `'label'` for further testing.

In [17]:
%timeit pandas.cut(df.A, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], ['c1', 'c2', 'c3', 'c4', 'c5'])

455 ms ± 41.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%timeit cudf.cut(cf.A, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], ['c1', 'c2', 'c3', 'c4', 'c5'])

44.5 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [35]:
df['label'] = pandas.cut(df.A, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
                         labels=['c1', 'c2', 'c3', 'c4', 'c5'])

In [40]:
cf['label'] = cudf.cut(cf.A, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
                       labels=['c1', 'c2', 'c3', 'c4', 'c5'])

## Group-by

Group-by operations are the bread and butter of data science, so we can check the performance using the column of categorical data we just added to the dataframes.

In [41]:
%timeit df[['label', 'A']].groupby('label').mean()

126 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [42]:
%timeit cf[['label', 'A']].groupby('label').mean()

74.2 ms ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Pivots

In [54]:
pandas.pivot_table(df, index='label', values=['A', 'B', 'C'], aggfunc=np.mean)

Unnamed: 0_level_0,A,B,C
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c1,0.099957,0.499932,0.500068
c2,0.299946,0.500151,0.499933
c3,0.499957,0.500055,0.499867
c4,0.700043,0.499943,0.500076
c5,0.899999,0.499828,0.500103


Unfortunately, for pivot tables, the `cudf` implementation is quite different from the pandas implementation.  However, the pivot operation on the pandas dataframe can be implemented using a simple `groupby`, folloed by applying `mean`.

In [53]:
cf[['label', 'A', 'B', 'C']].groupby('label').mean()

Unnamed: 0_level_0,A,B,C
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c3,0.499957,0.500055,0.499867
c5,0.899999,0.499828,0.500103
c1,0.099957,0.499932,0.500068
c2,0.299946,0.500151,0.499933
c4,0.700043,0.499943,0.500076


In [46]:
%timeit pandas.pivot_table(df, index='label', values=['A', 'B', 'C'], aggfunc=np.mean)

225 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [52]:
%timeit cf[['label', 'A', 'B', 'C']].groupby('label').mean()v

The slowest run took 4.11 times longer than the fastest. This could mean that an intermediate result is being cached.
99.6 ms ± 66.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
