## Speed up Pandas' GroupBy with Cython

Scenario: we have a large dataframe [comprising](https://en.wikipedia.org/wiki/User:Giraffedata/comprised_of) several smaller tables concatenated together.

We've done all our data cleaning and munging on the columns of the large dataframe, and we need to break it up into small dataframes.  This scenario may, hypothetically, come up when doing sequence modeling with recurrent neural networks, and we need to prepare mini-batches of sequences for our model to train against.

A straightforward way to do it is `list(df.groupby(level=0, sort=False))`.  Unfortunately, Pandas can be really slow in this case -- making lots of little dataframes is an expensive operation.

Since we know certain constraints are satisfied, can we speed this up with NumPy and Cython?

In [None]:
import pandas as pd
import numpy as np
from uuid import uuid4
import toolz

In [None]:
%load_ext Cython

In [None]:
nblocks = 10000
rowsperblock = np.random.randint(1, 7, size=nblocks)
ncols = 3

idx = list(toolz.concat([[uuid4().hex] * rpb for rpb in rowsperblock]))
df = pd.DataFrame(np.random.standard_normal(size=(rowsperblock.sum(), ncols)), index=idx)
print(df.info())
df.head(15)

In [None]:
df.index.is_monotonic

In [None]:
list(df.groupby(level=[0], sort=False))[:2]

In [None]:
%timeit list(df.groupby(level=[0], sort=False))

In [None]:
%prun list(df.groupby(level=[0], sort=False))

### Creating lots of little dataframes is expensive
* Pandas doesn't have a fast path for DataFrame creation with data that's already clean.

## NumPy version

* For this problem, we're willing to adjust the output and generate a sequence of NumPy arrays rather than a sequence of dataframes.
* With this adjustment, we'll see we can get a substantial speedup using NumPy operations.
* Once it's rewritten to use NumPy, then we can get *further* speedups with Cython.

In [None]:
def splitby(df):
    idx = df.index
    # NumPy array of "posts" that delineate the row indices
    # with which to split the dataframe.
    posts = np.where(idx[1:] != idx[:-1])[0] + 1
    split_labels = idx[np.concatenate([[0], posts, [-1]])]
    split_data = np.split(df.values, posts, 0)
    return list(zip(split_labels, split_data))

In [None]:
splitby(df)[:2]

In [None]:
%timeit splitby(df)

### Questions
* Why is this faster?
* What about `DataFrame.iloc`?  Can it give us fast slicing without the overhead?  Why or why not?
* If we double the number of columns, how do you expect the two versions to scale?

## First Cython version

* With room for improvement...

In [None]:
%%cython -a

cimport cython
cimport numpy as cnp
import numpy as np

def splitby_cython(df):
    cols = df.values
    idx = df.index.values
    n = idx.shape[0]
    result = []
    thispost = 0
        
    for i in range(1, n):
        if idx[i] != idx[i-1]:
            result.append((idx[i-1], cols[thispost:i]))
            thispost = i
            
    result.append((idx[i-1], cols[thispost:]))
    return result

In [None]:
%timeit splitby_cython(df)

### Exercise

* Use constructs and patterns from previous notebooks to improve this result.

Some pointers:
* Think about how you can give Cython more type information to convert Python objects into C equivalents.
* Good candidates:
  * Loop indexing vars, arguments to `range()`.
  * NumPy arrays with `cdef cnp.ndarray[double] xyz = ...`.
  * Statically declaring `list` and other Python types.
* Remember the `@cython.boundcheck(False)` and `@cython.wraparound(False)` decorators.