# Introduction to cuDF

You will begin your accelerated data science training with an introduction to [cuDF](https://github.com/rapidsai/cudf), the RAPIDS API that enables you to create and manipulate GPU-accelerated dataframes. cuDF implements a very similar interface to Pandas so that Python data scientists can use it with very little ramp up. Throughout this notebook we will provide Pandas counterparts to the cuDF operations you perform to build your intuition about how much faster cuDF can be, even for seemingly simple operations.

## Objectives

By the time you complete this notebook you will be able to:

- Read and write data to and from disk with cuDF
- Perform basic data exploration and cleaning operations with cuDF

## Imports

Here we import cuDF and CuPy for GPU-accelerated dataframes and math operations, plus the CPU libraries Pandas and NumPy on which they are based and which we will use for performance comparisons:

In [1]:
import cudf
import cupy as cp

import pandas as pd
import numpy as np

## Reading and Writing Data

Using [cuDF](https://github.com/rapidsai/cudf), the RAPIDS API providing a GPU-accelerated dataframe, we can read data from [a variety of formats](https://rapidsai.github.io/projects/cudf/en/0.10.0/api.html#module-cudf.io.csv), including csv, json, parquet, feather, orc, and Pandas dataframes, among others.

For the first part of this workshop, we will be reading almost 60 million records (corresponding to the entire population of England and Wales) which were sythesized from official UK census data. Here we read this data from a local csv file directly into GPU memory:

In [20]:
%time gdf = cudf.read_csv('./data/pop_1-03.csv')
gdf.shape

CPU times: user 540 ms, sys: 316 ms, total: 856 ms
Wall time: 857 ms


(58479894, 6)

In [19]:
gdf.dtypes

age         int64
sex        object
county     object
lat       float64
long      float64
name       object
dtype: object

Here for comparison we read the same data into a Pandas dataframe:

In [5]:
%time df = pd.read_csv('./data/pop_1-03.csv')
gdf.shape == df.shape

CPU times: user 29.9 s, sys: 4.51 s, total: 34.4 s
Wall time: 34.4 s


True

Because of the sophisticated GPU memory management behind the scenes in cuDF, the first data load into a fresh RAPIDS memory environment is sometimes substantially slower than subsequent loads. The RAPIDS Memory Manager is preparing additional memory to accommodate the array of data science operations that you may be interested in using on the data, rather than allocating and deallocating the memory repeatedly throughout your workflow.

We will be using `gdf` regularly in this workshop to represent a GPU dataframe, as well as `df` for a CPU dataframe when comparing performance.

### Writing to File

cuDF also provides methods for writing data to files. Here we create a new dataframe specifically containing residents of Blackpool county and then write it to `blackpool.csv`, before doing the same with Pandas for comparison.

#### cuDF

In [6]:
%time blackpool_residents = gdf.loc[gdf['county'] == 'BLACKPOOL']
print(f'{blackpool_residents.shape[0]} residents')

CPU times: user 364 ms, sys: 24 ms, total: 388 ms
Wall time: 1.46 s
139305 residents


In [7]:
%time blackpool_residents.to_csv('blackpool.csv')

CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 19.1 ms


#### Pandas

In [8]:
%time blackpool_residents_pd = df.loc[df['county'] == 'BLACKPOOL']

CPU times: user 3.18 s, sys: 232 ms, total: 3.41 s
Wall time: 3.39 s


In [9]:
%time blackpool_residents_pd.to_csv('blackpool_pd.csv')

CPU times: user 656 ms, sys: 12 ms, total: 668 ms
Wall time: 666 ms


## Exercise: Initial Data Exploration

Now that we have some data loaded, let's do some initial exploration.

Use the `head`, `dtypes`, and `columns` methods on `gdf`, as well as the `value_counts` on individual `gdf` columns, to orient yourself to the data. If you're interested, use the `%time` magic command to compare performance against the same operations on the Pandas `df`.

You can create additional interactive cells by clicking the `+` button above, or by switching to command mode with `Esc` and using the keyboard shortcuts `a` (for new cell above) and `b` (for new cell below).

If you fill up the GPU memory at any time, don't forget that you can restart the kernel and rerun the cells up to this point quite quickly.

In [14]:
# Begin your initial exploration here. Create more cells as needed.
gdf.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0,m,DARLINGTON,54.533644,-1.524401,FRANCIS
1,0,m,DARLINGTON,54.426256,-1.465314,EDWARD
2,0,m,DARLINGTON,54.5552,-1.496417,TEDDY
3,0,m,DARLINGTON,54.547906,-1.572341,ANGUS
4,0,m,DARLINGTON,54.477639,-1.605995,CHARLIE


In [25]:
res = gdf.loc[gdf["age"] == 35]
print(res)

          age sex      county        lat      long     name
12953107   35   m  DARLINGTON  54.539490 -1.582799    CALEB
12953108   35   m  DARLINGTON  54.528647 -1.565891   FINLEY
12953109   35   m  DARLINGTON  54.554245 -1.619266  TRISTAN
12953110   35   m  DARLINGTON  54.531760 -1.533673     JAKE
12953111   35   m  DARLINGTON  54.554819 -1.538520   HASEEB
...       ...  ..         ...        ...       ...      ...
41736279   35   f     NEWPORT  51.595090 -2.850952  CAITLYN
41736280   35   f     NEWPORT  51.542537 -3.050010  JESSICA
41736281   35   f     NEWPORT  51.583407 -2.928823    EMILY
41736282   35   f     NEWPORT  51.619775 -2.968337     EVIE
41736283   35   f     NEWPORT  51.547167 -2.854412   EVELYN

[775898 rows x 6 columns]


In [41]:
res = gdf.loc[gdf["age"] == 35]
res = res[["sex","name"]]
res.head()

Unnamed: 0,sex,name
12953107,m,CALEB
12953108,m,FINLEY
12953109,m,TRISTAN
12953110,m,JAKE
12953111,m,HASEEB


## Basic Operations with cuDF

Except for being much more performant with large datasets, cuDF looks and feels a lot like Pandas. In this section we highlight a few very simple operations. When performing data operations on cuDF dataframes, column operations are typically much more performant than row-wise operations.

### Converting Data Types

For machine learning later in this workshop, we will sometimes need to convert integer values into floats. Here we convert the `age` column from `int64` to `float32`, comparing performance with Pandas:

#### cuDF

In [42]:
%time gdf['age'] = gdf['age'].astype('float32')

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3 ms


#### Pandas

In [43]:
%time df['age'] = df['age'].astype('float32')

CPU times: user 152 ms, sys: 212 ms, total: 364 ms
Wall time: 362 ms


### Column-Wise Aggregations

Similarly, column-wise aggregations take advantage of the GPU's architecture and RAPIDS' memory format.

#### cuDF

In [44]:
%time gdf['age'].mean()

CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 6.5 ms


40.12419336806595

#### Pandas

In [45]:
%time df['age'].mean()

CPU times: user 196 ms, sys: 144 ms, total: 340 ms
Wall time: 337 ms


40.12419

### String Operations

Although strings are not a datatype traditionally associated with GPUs, cuDF supports powerful accelerated string operations.

#### cuDF

In [46]:
%time gdf['name'] = gdf['name'].str.title()

CPU times: user 12 ms, sys: 28 ms, total: 40 ms
Wall time: 41.8 ms


In [47]:
gdf.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0.0,m,DARLINGTON,54.533644,-1.524401,Francis
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
2,0.0,m,DARLINGTON,54.5552,-1.496417,Teddy
3,0.0,m,DARLINGTON,54.547906,-1.572341,Angus
4,0.0,m,DARLINGTON,54.477639,-1.605995,Charlie


#### Pandas

In [48]:
%time df['name'] = df['name'].str.title()

CPU times: user 25.1 s, sys: 2.47 s, total: 27.6 s
Wall time: 27.5 s


In [49]:
df.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0.0,m,DARLINGTON,54.533644,-1.524401,Francis
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
2,0.0,m,DARLINGTON,54.5552,-1.496417,Teddy
3,0.0,m,DARLINGTON,54.547906,-1.572341,Angus
4,0.0,m,DARLINGTON,54.477639,-1.605995,Charlie


## Data Subsetting with `loc` and `iloc`

cuDF also supports the core data subsetting tools `loc` (label-based locator) and `iloc` (integer-based locator).

### Range Selection

Our data's labels happen to be incrementing numbers, though as with Pandas, `loc` will include every value it is passed whereas `iloc` will give the half-open range (omitting the final value).

In [50]:
gdf.loc[100:105]

Unnamed: 0,age,sex,county,lat,long,name
100,0.0,m,DARLINGTON,54.519527,-1.557723,Samuel
101,0.0,m,DARLINGTON,54.530248,-1.500405,Alden
102,0.0,m,DARLINGTON,54.51597,-1.628573,Samuel
103,0.0,m,DARLINGTON,54.543373,-1.664323,Muhammad
104,0.0,m,DARLINGTON,54.554589,-1.507385,Isaac
105,0.0,m,DARLINGTON,54.487209,-1.541073,Jayden


In [51]:
gdf.iloc[100:105]

Unnamed: 0,age,sex,county,lat,long,name
100,0.0,m,DARLINGTON,54.519527,-1.557723,Samuel
101,0.0,m,DARLINGTON,54.530248,-1.500405,Alden
102,0.0,m,DARLINGTON,54.51597,-1.628573,Samuel
103,0.0,m,DARLINGTON,54.543373,-1.664323,Muhammad
104,0.0,m,DARLINGTON,54.554589,-1.507385,Isaac


### `loc` with Boolean Selection

We can use `loc` with boolean selections:

#### cuDF

In [52]:
%time e_names = gdf.loc[gdf['name'].str.startswith('E')]
e_names.head()

CPU times: user 12 ms, sys: 12 ms, total: 24 ms
Wall time: 23 ms


Unnamed: 0,age,sex,county,lat,long,name
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
6,0.0,m,DARLINGTON,54.501872,-1.667874,Eamonn
34,0.0,m,DARLINGTON,54.483065,-1.501312,Ethan
45,0.0,m,DARLINGTON,54.640205,-1.558986,Elvin
49,0.0,m,DARLINGTON,54.57545,-1.600592,Edward


#### Pandas

In [53]:
%time e_names_pd = df.loc[df['name'].str.startswith('E')]

CPU times: user 22.5 s, sys: 736 ms, total: 23.2 s
Wall time: 23.2 s


### Combining with NumPy Methods

We can combine cuDF methods with NumPy methods, just like Pandas. Here we use `np.logical_and` for elementwise boolean selection.

#### cuDF

In [54]:
%time ed_names = gdf.loc[np.logical_and(gdf['name'].str.startswith('E'), gdf['name'].str.endswith('d'))]
ed_names.head()

CPU times: user 344 ms, sys: 24 ms, total: 368 ms
Wall time: 367 ms


Unnamed: 0,age,sex,county,lat,long,name
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
49,0.0,m,DARLINGTON,54.57545,-1.600592,Edward
106,0.0,m,DARLINGTON,54.488042,-1.640927,Edward
145,0.0,m,DARLINGTON,54.49281,-1.509049,Edward
170,0.0,m,DARLINGTON,54.57792,-1.436109,Edward


For better performance at scale, we can use CuPy instead of NumPy, thereby performing the elementwise boolean `logical_and` operation on GPU.

In [55]:
%time ed_names = gdf.loc[cp.logical_and(gdf['name'].str.startswith('E'), gdf['name'].str.endswith('d'))]
ed_names.head()

CPU times: user 256 ms, sys: 12 ms, total: 268 ms
Wall time: 266 ms


Unnamed: 0,age,sex,county,lat,long,name
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
49,0.0,m,DARLINGTON,54.57545,-1.600592,Edward
106,0.0,m,DARLINGTON,54.488042,-1.640927,Edward
145,0.0,m,DARLINGTON,54.49281,-1.509049,Edward
170,0.0,m,DARLINGTON,54.57792,-1.436109,Edward


#### Pandas

In [56]:
%time ed_names_pd = df.loc[np.logical_and(df['name'].str.startswith('E'), df['name'].str.endswith('d'))]

CPU times: user 34.7 s, sys: 736 ms, total: 35.5 s
Wall time: 35.4 s


## Exercise: Basic Data Cleaning

For this exercise we ask you to perform two simple data cleaning tasks using several of the techniques described above:

1. Modifying the data type of a couple columns
2. Transforming string data into our desired format

### 1. Modify `dtypes`

Examine the `dtypes` of `gdf` and convert any 64-bit data types to their 32-bit counterparts.

In [63]:
gdf["lat"] = gdf["lat"].astype("float32")
gdf["long"] = gdf["long"].astype("float32")
gdf.dtypes

age       float32
sex        object
county     object
lat       float32
long      float32
name       object
dtype: object

#### Solution

In [65]:
# %load solutions/modify_dtypes
gdf['lat'] = gdf['lat'].astype('float32')
gdf['long'] = gdf['long'].astype('float32')


### 2. Title Case the Counties

As it stands, all of the counties are UPPERCASE:

In [66]:
gdf['county'].head()

0    DARLINGTON
1    DARLINGTON
2    DARLINGTON
3    DARLINGTON
4    DARLINGTON
Name: county, dtype: object

Convert them to title case as we have already done with the `name` column.

In [68]:
gdf["county"] = gdf["county"].str.title()
gdf.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0.0,m,Darlington,54.533646,-1.524401,Francis
1,0.0,m,Darlington,54.426254,-1.465314,Edward
2,0.0,m,Darlington,54.555199,-1.496417,Teddy
3,0.0,m,Darlington,54.547905,-1.572341,Angus
4,0.0,m,Darlington,54.477638,-1.605994,Charlie


#### Solution

In [70]:
# %load solutions/title_case_counties
gdf['county'] = gdf['county'].str.title()


## Exercise: Counties North of Sunderland

This exercise will require to use the `loc` method, and several of the techniques described above. Identify the latitude of the northernmost resident of Sunderland county (the person with the maximum `lat` value), and then determine which counties have any residents north of this resident. Use the `unique` method of a cudf `Series` to deduplicate the result.

In [99]:
nSud= gdf.loc[gdf["county"] == "Sunderland"]
maxLat = nSud["lat"].max()
north= gdf.loc[gdf["lat"] > maxLat]
res = north["county"].unique()
res.head()

0          County Durham
1                Cumbria
2              Gateshead
3    Newcastle Upon Tyne
4         North Tyneside
Name: county, dtype: object

#### Solution

In [93]:
# %load solutions/counties_north_of_sunderland
sunderland_residents = gdf.loc[gdf['county'] == 'Sunderland']
northmost_sunderland_lat = sunderland_residents['lat'].max()
counties_with_pop_north_of = gdf.loc[gdf['lat'] > northmost_sunderland_lat]['county'].unique()


<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [100]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

## Next

In the next section, you will do some actual data preparation for use in our machine learning models later. As part of your work, we will create custom functions with CuPy, which can be used as a GPU-accelerated drop-in replacement for NumPy with drastic performance benefits.