In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## Dataframe processing with cuDF

<!--
01234567890123456789012345678901234567890123456789012345678901234567890123456789
-->

[cuDF](https://github.com/rapidsai/cudf) provides a drop in replacement for the 
[pandas](https://pandas.pydata.org/) data analysis package.

cuDF makes the compute resourcesin GPGPUs accessible when manipulating
time-series and matrix data. It is build on and compatible with cupy and numpy 
and supports both numerical and textual data.

This section assumes that you are already familar with pandas and aims to
demonstrate how cuDF can be used instead.

A very nice, extended presentation similar to this was presented at
[NERSC](https://www.nersc.gov/users/training/events/rapids-hackathon/) and parts
of this material are based on that prior presentation.

### cuDF as a drop in replacement for Pandas

<!--
01234567890123456789012345678901234567890123456789012345678901234567890123456789
-->

cuDF implements many of pandas' interfaces and in many cases it can be used as a
drop-in replacement for Pandas by simply changing from `import pandas` to
`import cudf`, but see its [compatibility notes](https://docs.rapids.ai/api/cudf/stable/basics/PandasCompat.html).

First let's create a simple dataframe with two columns named "key" and "value"

In [2]:
import pandas as pd

df = pd.DataFrame()
df['key'] = [0, 0, 2, 2, 3]
df['value'] = [float(i + 10) for i in range(5)]
print(df)

   key  value
0    0   10.0
1    0   11.0
2    2   12.0
3    2   13.0
4    3   14.0


and compute some reduction over a column of data

In [3]:
df['value'].sum()

60.0

Next the same code using cuDF

In [4]:
import cudf as cudf

df = cudf.DataFrame()
df['key'] = [0, 0, 2, 2, 3]
df['value'] = [float(i + 10) for i in range(5)]
print(df)

   key  value
0    0   10.0
1    0   11.0
2    2   12.0
3    2   13.0
4    3   14.0


In [5]:
df['value'].sum()

60.0

However cudf data frames are stored in GPU memory and use `cupy` under the hood:

In [6]:
print(type(df))
print(type(df['value'].values))

<class 'cudf.core.dataframe.DataFrame'>
<class 'cupy.core.core.ndarray'>


### Conversion to pandas or numpy arrays

If needed cuDF objects can be converted and from `pandas` and `numpy`.

In [7]:
pandas_df = df.to_pandas()
print(type(pandas_df))
print(pandas_df)

<class 'pandas.core.frame.DataFrame'>
   key  value
0    0   10.0
1    0   11.0
2    2   12.0
3    2   13.0
4    3   14.0


In [8]:
df = cudf.from_pandas(pandas_df)
print(type(df))
print(df)

<class 'cudf.core.dataframe.DataFrame'>
   key  value
0    0   10.0
1    0   11.0
2    2   12.0
3    2   13.0
4    3   14.0


In [9]:
cupy_ndarray =  df.values
numpy_ndarray = df.values.get()
print(type(cupy_ndarray))
print(cupy_ndarray)
print(type(numpy_ndarray))
print(numpy_ndarray)

<class 'cupy.core.core.ndarray'>
[[ 0. 10.]
 [ 0. 11.]
 [ 2. 12.]
 [ 2. 13.]
 [ 3. 14.]]
<class 'numpy.ndarray'>
[[ 0. 10.]
 [ 0. 11.]
 [ 2. 12.]
 [ 2. 13.]
 [ 3. 14.]]


### Operating on cuDF data

<!--
01234567890123456789012345678901234567890123456789012345678901234567890123456789
-->
cuDF supports custom, user supplied operations on data that use `numba` jit
compiler to translate Python code to GPU code.

In [10]:
from numba import cuda
import numpy as np

In [53]:
np.random.seed(42)

df = cudf.DataFrame()
data_len = 1000
df['x'] = np.random.normal(10, 1, data_len)
df['y'] = np.random.normal(10, 1, data_len)
df['z'] = np.random.normal(10, 1, data_len)

df.head(10)

Unnamed: 0,x,y,z
0,10.496714,11.399355,9.324822
1,9.861736,10.924634,9.855481
2,10.647689,10.05963,9.20758
3,11.52303,9.353063,9.692038
4,9.765847,10.698223,8.106385
5,9.765863,10.393485,10.213294
6,11.579213,10.895193,10.001205
7,10.767435,10.635172,9.182911
8,9.530526,11.049553,10.659246
9,10.54256,9.464765,10.93757


### Point wise operations using cuDF

In [52]:
def my_pow2(x):
    return x**2

out = df['x'].applymap(my_pow2)[:10]

print(out[:10])

0    110.181008
1     97.253831
2    113.373271
3    132.780217
4     95.371760
5     95.372081
6    134.078169
7    115.937651
8     90.830918
9    111.145572
Name: x, dtype: float64


### Row wise operations using cuDF

<!--
01234567890123456789012345678901234567890123456789012345678901234567890123456789
-->
A simpleway to process data is to apply a given function row-wise to data and
append the result to the dataframe.

In [12]:
def my_sqrt(in1, in2, in3, out, thread, n):
    for i, (x,y,z) in enumerate(zip(in1, in2, in3)):
        out[i] = (x**n+y**n+z**n)**(1./n)
        thread[i] = cuda.threadIdx.x
    
out = df.apply_rows(my_sqrt,
                    incols={'x':'in1', 'y':'in2', 'z':'in3'},
                    outcols=dict(out=np.float64, thread=np.int32),
                    kwargs=dict(n=2))
print(out[:10])

           x          y          z          u        out  thread
0  10.496714  11.399355   9.324822   8.092192  18.085315       0
1   9.861736  10.924634   9.855481   9.139615  17.712480       1
2  10.647689  10.059630   9.207580   9.586394  17.301704       2
3  11.523030   9.353063   9.692038  11.887688  17.725564       3
4   9.765847  10.698223   8.106385  10.556553  16.599314       4
5   9.765863  10.393485  10.213294   8.664518  17.541607       5
6  11.579213  10.895193  10.001205  10.486036  18.783171       6
7  10.767435  10.635172   9.182911   8.452696  17.702271       7
8   9.530526  11.049553  10.659246  11.082691  18.070502       8
9  10.542560   9.464765  10.937570   9.528875  17.898541       9


<!--
01234567890123456789012345678901234567890123456789012345678901234567890123456789
-->
Depending on how the number of threads available on the GPU and the size of the
array, a thread may end up processing multiple rows, that themselves may or may
not be coongiuous.

In [13]:
print(out[out['thread'] == 0])
print(out[out['thread'] == 511])

             x          y          z          u        out  thread
0    10.496714  11.399355   9.324822   8.092192  18.085315       0
512   9.761052   9.283178  10.548884  11.712040  17.109485       0
            x         y          z         u        out  thread
511  9.949762  9.806341  10.547265  9.123226  17.504481     511


### Row wise operations using numba directly

In [47]:
@cuda.jit
def my_sqrt(x, y, z, out, thread, n):
    i = cuda.grid(1)
    if i < x.size: # boundary guard
        out[i] = (x[i]**n+y[i]**n+z[i]**n)**(1./n)
        thread[i] = cuda.threadIdx.x

out    = cudf.DataFrame()
out['out'] = np.zeros(len(df['x']))
out['thread'] = np.zeros(len(df['x']), dtype=np.int32)

my_sqrt.forall(len(df['x']))(df['x'], df['y'], df['z'], out['out'], out['thread'], 2.)

print(out[:10])

         out  thread
0  18.085315       0
1  17.712480       1
2  17.301704       2
3  17.725564       3
4  16.599314       4
5  17.541607       5
6  18.783171       6
7  17.702271       7
8  18.070502       8
9  17.898541       9


In [46]:
print(out[out['thread'] == 0])
print(out[out['thread'] == 511])

           out  thread
0    18.085315       0
640  18.257649       0
           out  thread
511  17.504481     511


### More complex manipulations

#### Rolling average using apply_chunks

In [59]:
def my_rolling_avg(x, out, thread, window_half_width):
    for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
        if i < window_half_width or i >= len(x) - window_half_width:
            # If there is not enough data to fill the window,
            # take the average to be NaN
            out[i] = np.nan
        else:
            total = 0
            for j in range(i - window_half_width, i + window_half_width + 1):
                total += x[j]
            out[i] = total / (2.*window_half_width+1)
        thread[i] = cuda.threadIdx.x
    
out = df.apply_chunks(my_rolling_avg,
                      incols=['x'],
                      outcols=dict(out=np.float64, thread=np.int32),
                      chunks = 6,
                      kwargs=dict(window_half_width=1))
print(out[:10])

           x          y          z        out  thread
0  10.496714  11.399355   9.324822        NaN       0
1   9.861736  10.924634   9.855481  10.335379       1
2  10.647689  10.059630   9.207580  10.677485       2
3  11.523030   9.353063   9.692038  10.645522       3
4   9.765847  10.698223   8.106385  10.351580       4
5   9.765863  10.393485  10.213294        NaN       5
6  11.579213  10.895193  10.001205        NaN       0
7  10.767435  10.635172   9.182911  10.625724       1
8   9.530526  11.049553  10.659246  10.280173       2
9  10.542560   9.464765  10.937570   9.869889       3


## Counting unique words in text

cudf has support to handle strings in dataframes, patterend after Pandas. For details please consult the [documentation](https://docs.rapids.ai/api/cudf/stable/api.html#strings).

In [None]:
df = cudf.DataFrame()
df['string'] = ['Mary', 'had', 'a', 'little', 'lamb']
df['data'] = [68.534, 35.5, 4., 9. , -5174.42050]
df[df['string'].str.contains('^[a-z]*$')]

In [None]:
# get text of Hamlet
url = 'https://www.gutenberg.org/ebooks/1787.txt.utf-8'
content = requests.get(url).content.decode('utf-8')

In [None]:
# strip out project Gutenberg header and footer
lines = content.split('\r\n')
# strip out license etc...
for first,line in enumerate(lines):
    if line == "ACT I. Scene I.":
        break
for last,line in enumerate(lines):
    if line == "THE END":
        break
lines = lines[first:last+1]

In [None]:
# stolen from https://gist.github.com/VibhuJawa/df3583ed553ac84b990619d7c49f2a73
# which is used in https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801
def get_word_count(text):
    """
        returns the count of input strings
    """ 
    ## Tokenize: convert sentences into a long list of words
    ## Get counts: Groupby each token to get value counts

    df = cudf.DataFrame()
    # tokenize sentences  into a nvstrings instance using nvtext.tokenize()
    # converting it into a single tall data-frame
    df['string'] = text.str.filter_alphanum(' ').str.tokenize()
    # Using Group by to do a value count for string columns
    # This will be natively supported soon
    # See: issue https://github.com/rapidsai/cudf/issues/1951

    df['counts'] = np.dtype('int32').type(0)
    
    res = df.groupby('string').count()
    res = res.reset_index(drop=False).sort_values(by='counts', ascending=False)
    return res
text = cudf.Series(lines)
%time get_word_count(text)

In [None]:
import re
def naive_cpu_get_word_count(text):
    words = {}
    for line in text:
        for w in re.sub("[^a-zA-Z0-9]", " ", line).split():
            try:
                words[w] += 1
            except KeyError:
                words[w] = 1
    return sorted(words.items(), key=lambda w: w[1], reverse=True)
%time naive_cpu_get_word_count(lines)

In [None]:
!pwd -P