# Introduction to cuDF

You will begin your accelerated data science training with an introduction to [cuDF](https://github.com/rapidsai/cudf), the RAPIDS API that enables you to create and manipulate GPU-accelerated dataframes. cuDF implements a very similar interface to Pandas so that Python data scientists can use it with very little ramp up. Throughout this notebook we will provide Pandas counterparts to the cuDF operations you perform to build your intuition about how much faster cuDF can be, even for seemingly simple operations.

## Objectives

By the time you complete this notebook you will be able to:

- Read and write data to and from disk with cuDF
- Perform basic data exploration and cleaning operations with cuDF

## Imports

Here we import cuDF and CuPy for GPU-accelerated dataframes and math operations, plus the CPU libraries Pandas and NumPy on which they are based and which we will use for performance comparisons:

In [None]:
import cudf
import cupy as cp

import pandas as pd
import numpy as np

## Reading and Writing Data

Using [cuDF](https://github.com/rapidsai/cudf), the RAPIDS API providing a GPU-accelerated dataframe, we can read data from [a variety of formats](https://rapidsai.github.io/projects/cudf/en/0.10.0/api.html#module-cudf.io.csv), including csv, json, parquet, feather, orc, and Pandas dataframes, among others.

For the first part of this workshop, we will be reading almost 60 million records (corresponding to the entire population of England and Wales) which were synthesized from official UK census data. Here we read this data from a local csv file directly into GPU memory:

In [None]:
%time gdf = cudf.read_csv('../data/data_pop.csv')
gdf.shape

In [None]:
gdf.drop(gdf.columns[0], axis=1, inplace=True)

In [None]:
gdf.dtypes

Here for comparison we read the same data into a Pandas dataframe:

In [None]:
%time df = pd.read_csv('../data/data_pop.csv')

In [None]:
df.drop(df.columns[0], axis=1, inplace=True)
gdf.shape == df.shape

Because of the sophisticated GPU memory management behind the scenes in cuDF, the first data load into a fresh RAPIDS memory environment is sometimes substantially slower than subsequent loads. The RAPIDS Memory Manager is preparing additional memory to accommodate the array of data science operations that you may be interested in using on the data, rather than allocating and deallocating the memory repeatedly throughout your workflow.

We will be using `gdf` regularly in this workshop to represent a GPU dataframe, as well as `df` for a CPU dataframe when comparing performance.

### Writing to File

cuDF also provides methods for writing data to files. Here we create a new dataframe specifically containing residents of Blackpool county and then write it to `blackpool.csv`, before doing the same with Pandas for comparison.

#### cuDF

In [None]:
%time blackpool_residents = gdf.loc[gdf['county'] == 'Blackpool']
print(f'{blackpool_residents.shape[0]} residents')

In [None]:
%time blackpool_residents.to_csv('blackpool.csv')

#### Pandas

In [None]:
%time blackpool_residents_pd = df.loc[df['county'] == 'Blackpool']

In [None]:
%time blackpool_residents_pd.to_csv('blackpool_pd.csv')

## Exercise: Initial Data Exploration

Now that we have some data loaded, let's do some initial exploration.

Use the `head`, `dtypes`, and `columns` methods on `gdf`, as well as the `value_counts` on individual `gdf` columns, to orient yourself to the data. If you're interested, use the `%time` magic command to compare performance against the same operations on the Pandas `df`.

You can create additional interactive cells by clicking the `+` button above, or by switching to command mode with `Esc` and using the keyboard shortuts `a` (for new cell above) and `b` (for new cell below).

If you fill up the GPU memory at any time, don't forget that you can restart the kernel and rerun the cells up to this point quite quickly.

In [None]:
# Begin your initial exploration here. Create more cells as needed.


## Basic Operations with cuDF

Except for being much more performant with large datasets, cuDF looks and feels a lot like Pandas. In this section we highlight a few very simple operations. When performing data operations on cuDF dataframes, column operations are typically much more performant than row-wise operations.

### Converting Data Types

For machine learning later in this workshop, we will sometimes need to convert integer values into floats. Here we convert the `age` column from `int64` to `float32`, comparing performance with Pandas:

#### cuDF

In [None]:
%time gdf['age'] = gdf['age'].astype('float32')

#### Pandas

In [None]:
%time df['age'] = df['age'].astype('float32')

### Column-Wise Aggregations

Similarly, column-wise aggregations take advantage of the GPU's architecture and RAPIDS' memory format.

#### cuDF

In [None]:
%time gdf['age'].mean()

#### Pandas

In [None]:
%time df['age'].mean()

### String Operations

Although strings are not a datatype traditionally associated with GPUs, cuDF supports powerful accelerated string operations.

#### cuDF

In [None]:
%time gdf['name'] = gdf['name'].str.title()

In [None]:
gdf.head()

#### Pandas

In [None]:
%time df['name'] = df['name'].str.title()

In [None]:
df.head()

## Data Subsetting with `loc` and `iloc`

cuDF also supports the core data subsetting tools `loc` (label-based locator) and `iloc` (integer-based locator).

### Range Selection

Our data's labels happen to be incrementing numbers, though as with Pandas, `loc` will include every value it is passed whereas `iloc` will give the half-open range (omitting the final value).

In [None]:
gdf.loc[100:105]

In [None]:
gdf.iloc[100:105]

### `loc` with Boolean Selection

We can use `loc` with boolean selections:

#### cuDF

In [None]:
# as of version 0.10, the startswith method returns a list, so we convert it back to a Series for efficiency
# in a future version, that method and other string methods will return a Series when appropriate
%time e_names = gdf.loc[cudf.Series(gdf['name'].str.startswith('E'))]
e_names.head()

#### Pandas

In [None]:
%time e_names_pd = df.loc[df['name'].str.startswith('E')]

### Combining with NumPy Methods

We can combine cuDF methods with NumPy methods. Here we use `np.logical_and` for elementwise boolean selection.

#### cuDF

In [None]:
%time ed_names = gdf.loc[np.logical_and(gdf['name'].str.startswith('E'), gdf['name'].str.endswith('d'))]
ed_names.head()

For better performance, we can use CuPy instead of NumPy, thereby performing the elementwise boolean `logical_and` operation on GPU.

In [None]:
%time ed_names = gdf.loc[cudf.Series(cp.logical_and(cudf.Series(gdf['name'].str.startswith('E')), cudf.Series(gdf['name'].str.endswith('d'))))]
ed_names.head()

#### Pandas

In [None]:
%time ed_names_pd = df.loc[np.logical_and(df['name'].str.startswith('E'), df['name'].str.endswith('d'))]

## Exercise 1: Basic Data Cleaning

For this exercise we ask you to modify the data type of a couple columns:

### Modify `dtypes`

Examine the `dtypes` of `gdf` and convert any 64-bit data types to their 32-bit counterparts.

## Exercise 2: Counties North of Sunderland

This exercise will require to use the `loc` method, and several of the techniques described above. Identify the latitude of the northernmost resident of Sunderland county (the person with the maximum `lat` value), and then determine which counties have any residents north of this resident. Use the `unique` method of a cudf `Series` to deduplicate the result.

## Next

In the next notebook we will return to fundamental cuDF operations, focusing on data analysis with grouping and sorting.