# Grouping and Sorting with cuDF

In this notebook you will be introduced to grouping and sorting with cuDF, with performance comparisons to Pandas, before integrating what you learned in a short data analysis exercise.

## Objectives

By the time you complete this notebook you will be able to:

- Perform GPU-accelerated group and sort operations with cuDF

## Imports

In [2]:
import cudf
import pandas as pd

## Read Data

We once again read the UK population data, returning to timed comparisons with Pandas.

In [4]:
%time gdf = cudf.read_csv('./data/uk_pop.csv', dtype=['float32', 'str', 'str', 'float32', 'float32', 'str'])

CPU times: user 1.15 s, sys: 1.65 s, total: 2.79 s
Wall time: 6.41 s


In [6]:
%time df = pd.read_csv('./data/uk_pop.csv')

CPU times: user 29.1 s, sys: 4.71 s, total: 33.8 s
Wall time: 33.9 s


In [7]:
gdf.dtypes

age       float32
sex        object
county     object
lat       float32
long      float32
name       object
dtype: object

In [8]:
gdf.shape

(58479894, 6)

In [6]:
gdf.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0.0,m,Darlington,54.533638,-1.5244,Francis
1,0.0,m,Darlington,54.426254,-1.465314,Edward
2,0.0,m,Darlington,54.555199,-1.496417,Teddy
3,0.0,m,Darlington,54.547905,-1.572341,Angus
4,0.0,m,Darlington,54.477638,-1.605995,Charlie


## Grouping and Sorting

### Record Grouping

Record grouping with cuDF works the same way as in Pandas.

#### cuDF

In [7]:
%%time
counties = gdf[['county', 'age']].groupby(['county'])
avg_ages = counties.mean()
print(avg_ages[:5])

                                    age
county                                 
Barking And Dagenham          33.056845
Barnet                        37.629770
Barnsley                      41.201061
Bath And North East Somerset  39.822837
Bedford                       39.715300
CPU times: user 1.55 s, sys: 288 ms, total: 1.84 s
Wall time: 2.9 s


#### Pandas

In [8]:
%%time
counties_pd = df[['county', 'age']].groupby(['county'])
avg_ages_pd = counties_pd.mean()
print(avg_ages_pd[:5])

                                    age
county                                 
Barking And Dagenham          33.056845
Barnet                        37.629770
Barnsley                      41.201061
Bath And North East Somerset  39.822837
Bedford                       39.715300
CPU times: user 3.52 s, sys: 888 ms, total: 4.4 s
Wall time: 4.4 s


## Sorting

Sorting is also very similar to Pandas, though cuDF does not support in-place sorting.

#### cuDF

In [9]:
%time gdf_names = gdf['name'].sort_values()
print(gdf_names[:5]) # yes, "A" is an infrequent but correct given name in the UK, according to census data
print(gdf_names[-5:])

CPU times: user 904 ms, sys: 952 ms, total: 1.86 s
Wall time: 1.85 s
26850     A
154537    A
165578    A
211428    A
236972    A
Name: name, dtype: object
58060377    Zyrah
58289490    Zyrah
58363665    Zyrah
58388727    Zyrah
58394184    Zyrah
Name: name, dtype: object


#### Pandas

This operation takes a while with Pandas. Feel free to start the next exercise while you wait.

In [10]:
%time df_names = df['name'].sort_values()
print(df_names[:5])
print(df_names[-5:])

CPU times: user 2min 7s, sys: 1.34 s, total: 2min 8s
Wall time: 2min 8s
10811041    A
17931460    A
5060367     A
1842288     A
24866365    A
Name: name, dtype: object
47008072    Zyrah
47953653    Zyrah
31838209    Zyrah
53669567    Zyrah
54557840    Zyrah
Name: name, dtype: object


## Exercise: Youngest Names

For this exercise you will need to use both `groupby` and `sort_values`.

We would like to know which names are associated with the lowest average age and how many people have those names. Using the `mean` and `count` methods on the data grouped by name, identify the three names with the lowest mean age and their counts.

In [11]:
youngest_names = gdf[['name', 'age']].groupby(['name'])

In [18]:
name_mean = youngest_names.mean()
name_cnt  = youngest_names.count()

#### Solution

In [1]:
%load solutions/072_solution_01.py

<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

As part of our larger data science goal for this workshop, we will be working with data reflecting the entire road network of Great Britain. In the next notebook you will be exposed to additonal cuDF techniques that you will use to transform columnar data into graph edge data that we will be using to construct a GPU-accelerated graph using the `cuGraph` library.