# cuDF

Now let's move onto some more high level APIs, starting with [cuDF](https://github.com/rapidsai/cudf). Similar to `pandas` the `cudf` library is a dataframe package for working with tabular datasets.

Data is loaded onto the GPU and all operations are performed with GPU compute, but the API of `cudf` should feel very familiar to `pandas` users.

In [None]:
import cudf

## Data loading

In this tutorial we have some data stored in `data/`. Most of this data is too small to really benefit from GPU acceleration, but let's explore it anyway.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("https://github.com/jacobtomlinson/gpu-python-tutorial/raw/main/data/pageviews_small.csv", sep=" ")
df.head()

In [None]:
pageviews = cudf.read_csv("https://github.com/jacobtomlinson/gpu-python-tutorial/raw/main/data/pageviews_small.csv", sep=" ")
pageviews.head()

This `pageviews.csv` file contains just over `1M` records of pageview counts from Wikipedia in various languages.

Let's rename the columns and drop the unused `x` column.

In [None]:
pageviews.columns = ['project', 'page', 'requests', 'x']

pageviews = pageviews.drop('x', axis=1)

pageviews

Next let's count how many english record are in this dataset.

In [None]:
print(pageviews[pageviews.project == 'en'].count())

Then let's perform a groupby where we count all of the pages by language.

In [None]:
grouped_pageviews = pageviews.groupby('project').count().reset_index()
grouped_pageviews

And finally let's have a look at the results for English, French, Chinese and Polish specificallty.

In [None]:
print(grouped_pageviews[grouped_pageviews.project.isin(['en', 'fr', 'zh', 'pl'])])

If you have used `pandas` before then all fo this syntax should be very familiar to you. In the same way that `cupy` implements a large portion of the `numpy` API, `cudf` implements a large portion of the `pandas` API.

The only difference is that all of our filtering and groupby operations happened on the GPU instead of the CPU giving much better performance.

### Strings

GPUs historically are well known for numerical work and have not been used for working with more complex objects. With cuDF string operations are also acellerated thanks to leveraging cuStrings under the hood.

This means operations like capitalizing strings can be parallelised on the GPU.

In [None]:
pageviews[pageviews.project == 'en'].page.str.upper()

In [None]:
pageviews_en = pageviews[pageviews.project == 'en']
print(pageviews_en.page.str.upper().head())

### UDFs

cuDF also has support for user defined functions (UDFs) that can be mapped over a Series or DataFrame in parallel on the GPU.

UDFs can be defined as pure Python functions that take a single value, these will be compiled down by Numba at runtime into something that can run on the GPU when we call `.apply()`.

In [None]:
def udf(x):
    if x < 5:
        return 0
    return x

In [None]:
pageviews.requests = pageviews.requests.apply(udf)

In [None]:
pageviews.requests

It is also possible to use Numba directly to write kernels that take pointers to an input column and an output column along with additional arguments. The kernel can then use `cuda.grid` the same way we did in chapters 2/3 to get an index to operate on.

We then use `.forall()` to map our kernel over a column.

In [None]:
pageviews['mul_requests'] = 0.0

In [None]:
from numba import cuda

In [None]:
@cuda.jit
def multiply(in_col, out_col, multiplier):
    i = cuda.grid(1)
    if i < in_col.size: # boundary guard
        out_col[i] = in_col[i] * multiplier


In [None]:
multiply.forall(len(pageviews))(pageviews['requests'], pageviews['mul_requests'], 10.0)

In [None]:
print(pageviews.head())

## Rolling windows

In cuDF there is also support for applying kernels over rolling windows. This is effectively a 1D stencil and can allow us to perform operations based on our neigbors.

![](images/rolling-windows.png)

In [None]:
def neigborhood_mean(window):
    c = 0
    for val in window:
        c += val
    return c / len(window)

In [None]:
pageviews.requests.rolling(3, 1, True).apply(neigborhood_mean)