## Requiremnts

To clearly make the distinction between pandas and Rapids dataframes, we import the modules without aliasing in this notebook.

In [1]:
import pandas as pd
import polars as pl
import numpy as np

## Creating dataframes

In order to do some performance tests, we create some dataframes with a substantial amount of data.  The pandas dataframe will be assigned to the variable `df`, the Rapids dataframe in `cf`.

In [2]:
nr_rows = 20_000_000
data = {
    'A': np.random.uniform(0.0, 1.0, size=nr_rows),
    'B': np.random.uniform(0.0, 1.0, size=nr_rows),
    'C': np.random.uniform(0.0, 1.0, size=nr_rows),
}

In [3]:
df_pd = pd.DataFrame(data)

The Rapids dataframe is created in exactly the same way.

In [4]:
df_pl = pl.DataFrame(data)

It is interesting to check the datatypes used in the pandas and the Rapids dataframe, they are identical.

In [5]:
df_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000000 entries, 0 to 19999999
Data columns (total 3 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
dtypes: float64(3)
memory usage: 457.8 MB


In [6]:
df_pl.describe()

statistic,A,B,C
str,f64,f64,f64
"""count""",20000000.0,20000000.0,20000000.0
"""null_count""",0.0,0.0,0.0
"""mean""",0.499988,0.499994,0.499855
"""std""",0.288707,0.288642,0.288642
"""min""",5.8407e-08,8.308e-08,4.3475e-08
"""25%""",0.249857,0.250089,0.249803
"""50%""",0.499964,0.499873,0.499878
"""75%""",0.750029,0.749927,0.749737
"""max""",1.0,1.0,1.0


In [7]:
df_pl.estimated_size()

480000000

The size of the dataframes is almost identical.

## Applying functions

We can time the difference between the pandas and cudf when computing the columnwise average.

In [8]:
%timeit df_pd.A.mean()

27.6 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [9]:
%timeit df_pl['A'].mean()

7.79 ms ± 300 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


It is clear that on average the computation on a polars dataframe is a factor of 3 faster.

Computing the rowwise average is substantially slower due to the data structure used to represent the dataframe.

In [10]:
%timeit df_pd.mean(axis=1)

1.37 s ± 53.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
%timeit df_pl.select(pl.mean_horizontal(pl.all().alias('mean')))

424 ms ± 78.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Again, polars outperforms the pandas implementation by a factor of 4.

## Creating new columns

We can create a new column in a dataframe using an arithmetic expression on other columns.

In [12]:
%timeit df_pd['D'] = 2.1*df_pd.A + 3.5*df_pd.B

132 ms ± 7.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [13]:
%timeit df_pl.select((2.1*pl.col('A') + 3.5*pl.col('B').alias('D')))

90.3 ms ± 8.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Again, the polars dataframe outperformns the pandas version by a substantial factor.

## Categorical data

We can introduce a column that contains categorical data by using the `cut` function.  In this case, we bin column A, using 5 categories.  First we time the operation without storing the resulting values, then we store them in a new column, `'label'` for further testing.

In [14]:
%timeit pd.cut(df_pd.A, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], ['c1', 'c2', 'c3', 'c4', 'c5'])

392 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%timeit df_pl.select(pl.col('A').cut([0.2, 0.4, 0.6, 0.8, ], labels=['c1', 'c2', 'c3', 'c4', 'c5']))

468 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Interestingly, the `cut` operation is less efficient for polars.

In [16]:
df_pd['label'] = pd.cut(df_pd.A, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
                         labels=['c1', 'c2', 'c3', 'c4', 'c5'])

In [17]:
df_pl = df_pl.select(pl.all(), pl.col('A').cut([0.2, 0.4, 0.6, 0.8, ],
                                               labels=['c1', 'c2', 'c3', 'c4', 'c5']).alias('label'))

## Group-by

Group-by operations are the bread and butter of data science, so we can check the performance using the column of categorical data we just added to the dataframes.

In [18]:
%timeit df_pd[['label', 'A']].groupby('label', observed=False).mean()

149 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [19]:
%timeit df_pl['label', 'A'].group_by('label').mean()

53.8 ms ± 817 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Pivots

In [20]:
pd.pivot_table(df_pd, index='label', values=['A', 'B', 'C'], aggfunc='mean', observed=False)

Unnamed: 0_level_0,A,B,C
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c1,0.1,0.500254,0.499845
c2,0.299966,0.50013,0.50001
c3,0.500008,0.499618,0.499957
c4,0.699962,0.5,0.499635
c5,0.899959,0.499968,0.499828


In [21]:
df_pl.pivot(['A', 'B', 'C'], index='label', aggregate_function='mean')

Unfortunately, for pivot tables, the `cudf` implementation is quite different from the pandas implementation.  However, the pivot operation on the pandas dataframe can be implemented using a simple `groupby`, folloed by applying `mean`.

In [22]:
df_pl['label', 'A', 'B', 'C'].group_by('label', maintain_order=True).mean()

label,A,B,C
cat,f64,f64,f64
"""c2""",0.299966,0.50013,0.50001
"""c4""",0.699962,0.5,0.499635
"""c3""",0.500008,0.499618,0.499957
"""c1""",0.1,0.500254,0.499845
"""c5""",0.899959,0.499968,0.499828


In [23]:
%timeit pd.pivot_table(df_pd, index='label', \
                       values=['A', 'B', 'C'], \
                       aggfunc='mean', \
                       observed=False)

352 ms ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
%timeit df_pl['label', 'A', 'B', 'C'].group_by('label').mean()

133 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Conclusion

Polars outperforms pandas on most operations, though not all.  However, pandas has more features and it is easier to find examples and help.