# Grouping and Sorting with cuDF

In this notebook you will be introduced to grouping and sorting with cuDF, with performance comparisons to Pandas, before integrating what you learned in a short data analysis exercise.

## Objectives

By the time you complete this notebook you will be able to:

- Perform GPU-accelerated group and sort operations with cuDF

## Imports

In [1]:
import cudf
import pandas as pd

## Read Data

We once again read the UK population data, returning to timed comparisons with Pandas.

In [2]:
%time gdf = cudf.read_csv('./data/pop_1-04.csv', dtype=['float32', 'str', 'str', 'float32', 'float32', 'str'])

CPU times: user 4.53 s, sys: 5.51 s, total: 10 s
Wall time: 11.1 s


In [3]:
%time df = pd.read_csv('./data/pop_1-04.csv')

CPU times: user 31 s, sys: 3.84 s, total: 34.9 s
Wall time: 34.9 s


In [31]:
gdf.dtypes

age       float32
sex        object
county     object
lat       float32
long      float32
name       object
dtype: object

In [32]:
gdf.shape

(58479894, 6)

In [None]:
gdf.head()

## Grouping and Sorting

### Record Grouping

Record grouping with cuDF works the same way as in Pandas.

#### cuDF

In [None]:
%%time
counties = gdf[['county', 'age']].groupby(['county'])
avg_ages = counties.mean()
print(avg_ages[:5])

#### Pandas

In [None]:
%%time
counties_pd = df[['county', 'age']].groupby(['county'])
avg_ages_pd = counties_pd.mean()
print(avg_ages_pd[:5])

## Sorting

Sorting is also very similar to Pandas, though cuDF does not support in-place sorting.

#### cuDF

In [None]:
%time gdf_names = gdf['name'].sort_values()
print(gdf_names[:5]) # yes, "A" is an infrequent but correct given name in the UK, according to census data
print(gdf_names[-5:])

#### Pandas

This operation takes a while with Pandas. Feel free to start the next exercise while you wait.

In [None]:
%time df_names = df['name'].sort_values()
print(df_names[:5])
print(df_names[-5:])

## Exercise: Youngest Names

For this exercise you will need to use both `groupby` and `sort_values`.

We would like to know which names are associated with the lowest average age and how many people have those names. Using the `mean` and `count` methods on the data grouped by name, identify the three names with the lowest mean age and their counts.

In [33]:
name_group = gdf[['name','age']].groupby('name')
count_group = name_group.size()

avg_age = name_group.mean()['age'].sort_values()
lowest_mean = avg_age[:3]
lowest_mean.to_frame().join(count_group.to_frame())

Unnamed: 0,age,0
Leart,34.911197,259
Luke-Junior,35.313725,255
Nameer,35.479675,246


#### Solution

In [None]:
%load solutions/youngest_names

<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

As part of our larger data science goal for this workshop, we will be working with data reflecting the entire road network of Great Britain. In the next notebook you will be exposed to additonal cuDF techniques that you will use to transform columnar data into graph edge data that we will be using to construct a GPU-accelerated graph using the `cuGraph` library.