# cuDF

Now let's move onto some more high level APIs, starting with [cuDF](https://github.com/rapidsai/cudf). Similar to `pandas` the `cudf` library is a dataframe package for working with tabular datasets.

Data is loaded onto the GPU and all operations are performed with GPU compute, but the API of `cudf` should feel very familiar to `pandas` users.

In [None]:
import cudf

In this tutorial we have some data stored in `data/`. Most of this data is too small to really benefit from GPU acceleration, but let's explore it anyway.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("https://github.com/jacobtomlinson/gpu-python-tutorial/raw/main/data/pageviews_small.csv", sep=" ")
df.head()

In [None]:
pageviews = cudf.read_csv("https://github.com/jacobtomlinson/gpu-python-tutorial/raw/main/data/pageviews_small.csv", sep=" ")
pageviews

This `pageviews.csv` file contains just over `1M` records of pageview counts from Wikipedia in various languages.

Let's rename the columns and drop the unused `x` column.

In [None]:
pageviews.columns = ['project', 'page', 'requests', 'x']

pageviews = pageviews.drop('x', axis=1)

pageviews

Next let's count how many english record are in this dataset.

In [None]:
print(pageviews[pageviews.project == 'en'].count())

Then let's perform a groupby where we count all of the pages by language.

In [None]:
grouped_pageviews = pageviews.groupby('project').count().reset_index()
grouped_pageviews

And finally let's have a look at the results for English, French, Chinese and Polish specificallty.

In [None]:
print(grouped_pageviews[grouped_pageviews.project.isin(['en', 'fr', 'zh', 'pl'])])

If you have used `pandas` before then all fo this syntax should be very familiar to you. In the same way that `cupy` implements a large portion of the `numpy` API, `cudf` implements a large portion of the `pandas` API.

The only difference is that all of our filtering and groupby operations happened on the GPU instead of the CPU giving much better performance.

In [None]:
pageviews[pageviews.project == 'en'].page.str.upper()

In [None]:
pageviews_en = pageviews[pageviews.project == 'en']
print(pageviews_en.page.str.upper().head())

In [None]:
import cupy as cp

In [None]:
pageview_array = pageviews.requests.values
print(type(pageview_array))

In [None]:
def udf(x):
    if x < 5:
        return 0
    return x

In [None]:
pageviews.requests = pageviews.requests.applymap(udf)

In [None]:
pageviews['mul_requests'] = 0.0

In [None]:
from numba import cuda

In [None]:
@cuda.jit
def multiply(in_col, out_col, multiplier):
    i = cuda.grid(1)
    if i < in_col.size: # boundary guard
        out_col[i] = in_col[i] * multiplier


In [None]:
multiply.forall(len(pageviews))(pageviews['requests'], pageviews['mul_requests'], 10.0)

In [None]:
print(pageviews.head())