## Speed up Pandas' GroupBy with Cython

Scenario: we have a large dataframe [comprising](https://en.wikipedia.org/wiki/User:Giraffedata/comprised_of) several smaller tables concatenated together.

We've done all our data cleaning and munging on the columns of the large dataframe, and we need to break it up into small dataframes.  This scenario may, hypothetically, come up when doing sequence modeling with recurrent neural networks, and we need to prepare mini-batches of sequences for our model to train against.

The obvious way to do it is `list(df.groupby(level=0, sort=False))`.  Unfortunately, Pandas can be really slow in this case -- making lots of little dataframes is an expensive operation.

Since we can make certain guarantees of our data, can we speed this up?

In [2]:
import pandas as pd
import numpy as np
from uuid import uuid4
import toolz

In [3]:
%load_ext Cython

In [4]:
nblocks = 10000
rowsperblock = np.random.randint(1, 7, size=nblocks)
ncols = 3

idx = list(toolz.concat([[uuid4().hex] * rpb for rpb in rowsperblock]))
df = pd.DataFrame(np.random.standard_normal(size=(rowsperblock.sum(), ncols)), index=idx)
print(df.info())
df.head(15)

<class 'pandas.core.frame.DataFrame'>
Index: 35241 entries, 4b8b61db1fc544e5b2f119e8ca15b5a6 to 0bbfa74e284c469db5b40c6dbc2f8cc2
Data columns (total 3 columns):
0    35241 non-null float64
1    35241 non-null float64
2    35241 non-null float64
dtypes: float64(3)
memory usage: 1.1+ MB
None


Unnamed: 0,0,1,2
4b8b61db1fc544e5b2f119e8ca15b5a6,0.810006,1.992543,0.872308
4b8b61db1fc544e5b2f119e8ca15b5a6,-0.817873,-0.522829,0.383479
4b8b61db1fc544e5b2f119e8ca15b5a6,-1.212758,-0.130093,0.029334
4b8b61db1fc544e5b2f119e8ca15b5a6,0.131763,0.375906,1.685487
3e9dea6b183f477f9d5886f0faf0fd9d,0.715063,-0.92354,-1.677322
3e9dea6b183f477f9d5886f0faf0fd9d,-0.284031,0.377469,-0.811794
3e9dea6b183f477f9d5886f0faf0fd9d,-0.750274,1.117333,-0.69158
3e9dea6b183f477f9d5886f0faf0fd9d,0.334399,0.510409,-1.145321
3e9dea6b183f477f9d5886f0faf0fd9d,0.315334,-0.695746,-0.185402
9f52d78519d24cf38a1ab2465f783f20,0.554443,-0.496387,-0.464667


In [5]:
df.index.is_monotonic

False

In [6]:
list(df.groupby(level=[0], sort=False))[:2]

[('4b8b61db1fc544e5b2f119e8ca15b5a6',
                                           0         1         2
  4b8b61db1fc544e5b2f119e8ca15b5a6  0.810006  1.992543  0.872308
  4b8b61db1fc544e5b2f119e8ca15b5a6 -0.817873 -0.522829  0.383479
  4b8b61db1fc544e5b2f119e8ca15b5a6 -1.212758 -0.130093  0.029334
  4b8b61db1fc544e5b2f119e8ca15b5a6  0.131763  0.375906  1.685487),
 ('3e9dea6b183f477f9d5886f0faf0fd9d',
                                           0         1         2
  3e9dea6b183f477f9d5886f0faf0fd9d  0.715063 -0.923540 -1.677322
  3e9dea6b183f477f9d5886f0faf0fd9d -0.284031  0.377469 -0.811794
  3e9dea6b183f477f9d5886f0faf0fd9d -0.750274  1.117333 -0.691580
  3e9dea6b183f477f9d5886f0faf0fd9d  0.334399  0.510409 -1.145321
  3e9dea6b183f477f9d5886f0faf0fd9d  0.315334 -0.695746 -0.185402)]

In [7]:
%timeit list(df.groupby(level=[0], sort=False))

954 ms ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%prun list(df.groupby(level=[0], sort=False))

 

### Creating lots of little dataframes is expensive
* Pandas doesn't have a fast path for DataFrame creation with data that's already clean.

## Speedup version 1: use NumPy

* For this problem, we're willing to adjust the output and generate a sequence of NumPy arrays rather than a sequence of dataframes.
* With this adjustment, we'll see we can get a substantial speedup using NumPy operations.
* Once it's rewritten to use NumPy, then we can get *further* speedups with Cython.

In [9]:
def splitby(df):
    idx = df.index
    # NumPy array of "posts" that delineate the row indices
    # with which to split the dataframe.
    posts = np.where(idx[1:] != idx[:-1])[0] + 1
    split_labels = idx[np.concatenate([[0], posts, [-1]])]
    split_data = np.split(df.values, posts, 0)
    return list(zip(split_labels, split_data))

In [10]:
splitby(df)[:2]

[('4b8b61db1fc544e5b2f119e8ca15b5a6',
  array([[ 0.81000581,  1.99254348,  0.87230783],
         [-0.81787326, -0.52282892,  0.38347882],
         [-1.21275775, -0.13009344,  0.02933438],
         [ 0.13176327,  0.37590582,  1.68548682]])),
 ('3e9dea6b183f477f9d5886f0faf0fd9d',
  array([[ 0.71506328, -0.9235403 , -1.67732214],
         [-0.28403052,  0.37746918, -0.81179384],
         [-0.75027367,  1.11733273, -0.69158037],
         [ 0.33439915,  0.51040889, -1.14532109],
         [ 0.31533402, -0.69574639, -0.18540204]]))]

In [11]:
%timeit splitby(df)

17.5 ms ± 529 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Questions
* Why is this faster?
* What about `DataFrame.iloc`?  Can it give us fast slicing without the overhead?  Why or why not?
* If we double the number of columns, how do you expect the two versions to scale?

## Speedup version 2: Cython

In [81]:
%%cython -a

cimport cython
cimport numpy as cnp
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def splitby_cython(df):
    cdef:
        cnp.ndarray[double, ndim=2, mode="c"] cols = df.values
        cnp.ndarray[object, mode="c"] idx = df.index.values
        int n = idx.shape[0]
        list result = []
        int i, thispost = 0
        
    for i in range(1, n):
        if idx[i] != idx[i-1]:
            result.append((idx[i-1], cols[thispost:i]))
            thispost = i
    result.append((idx[i-1], cols[thispost:]))
    return result