# Grouping and Sorting with cuDF

In this notebook you will be introduced to grouping and sorting with cuDF, with performance comparisons to Pandas, before integrating what you learned in a short data analysis exercise.

## Objectives

By the time you complete this notebook you will be able to:

- Perform GPU-accelerated group and sort operations with cuDF

## Imports

In [None]:
import cudf
import pandas as pd

## Read Data

We once again read the UK population data, returning to timed comparisons with Pandas.

In [None]:
%time gdf = cudf.read_csv('../data/data_pop.csv')

In [None]:
gdf.drop(gdf.columns[0], axis=1, inplace=True)

In [None]:
%time df = pd.read_csv('../data/data_pop.csv')

In [None]:
df.drop(df.columns[0], axis=1, inplace=True)
gdf.shape == df.shape

In [None]:
gdf.dtypes

In [None]:
gdf.shape

In [None]:
gdf.head()

## Grouping and Sorting

### Record Grouping

Record grouping with cuDF works the same way as in Pandas.

#### cuDF

In [None]:
%%time
counties = gdf[['county', 'age']].groupby(['county'])
avg_ages = counties.mean()
print(avg_ages[:5])

#### Pandas

In [None]:
%%time
counties_pd = df[['county', 'age']].groupby(['county'])
avg_ages_pd = counties_pd.mean()
print(avg_ages_pd[:5])

## Sorting

Sorting is also very similar to Pandas, though cuDF does not support in-place sorting.

#### cuDF

In [None]:
%time gdf_names = gdf['name'].sort_values()
print(gdf_names[:5]) # yes, "A" is an infrequent but correct given name in the UK, according to census data
print(gdf_names[-5:])

#### Pandas

This operation takes a while with Pandas. Feel free to start the next exercise while you wait.

In [None]:
%time df_names = df['name'].sort_values()
print(df_names[:5])
print(df_names[-5:])

## Exercise 3: Youngest Names

For this exercise you will need to use both `groupby` and `sort_values`.

We would like to know which names are associated with the lowest average age and how many people have those names. Using the `mean` and `count` methods on the data grouped by name, identify the three names with the lowest mean age and their counts.

# Visualize the Population

- Use Bokeh to visualize the population data

In [None]:
import cupy as cp

from bokeh import plotting as bplt
from bokeh import models as bmdl

## Setup Visualizations

RAPIDS can be used with a wide array of visualizations, both open source and proprietary. We won't teach to a specific visualization option in this workshop but will just use the open source [Bokeh](https://bokeh.pydata.org/en/latest/index.html) to illustrate the results of some machine learning algorithms. 

As such, please feel free to make a light pass over this section, which enables visualizations to be output in this notebook, and creates a visualization helper function `base_plot` we will use below.

In [None]:
# Turn on in-Jupyter viz
bplt.output_notebook()

In [None]:
# Helper function for visuals
def base_plot(data=None, padding=None,
              tools='pan,wheel_zoom,reset', plot_width=500, plot_height=500, x_range=(0, 100), y_range=(0, 100), **plot_args):
    
    # if we send in two columns of data, we can use them to auto-size the scale
    if data is not None and padding is not None:
        x_range = (min(data.iloc[:, 0]) - padding, max(data.iloc[:, 0]) + padding)
        y_range = (min(data.iloc[:, 1]) - padding, max(data.iloc[:, 1]) + padding)
        
    p = bplt.figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, 
        **plot_args)

    p.axis.visible = True
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None

    p.add_tools(bmdl.BoxZoomTool(match_aspect=True))

    return p

## Subset Data for Vizualizations

Bokeh, [DataShader](http://datashader.org/), and other open source visualization projects are being connected with RAPIDS via the [cuXfilter](https://github.com/rapidsai/cuxfilter) framework. For simplicity in this workshop, we will use the standard CPU Bokeh. CPU performance can be a real bottleneck to our workflows, so the typical approach is to select subsets of our data to visualize, especially during initial iterations.

Here we make a subset of our data, and use the `to_pandas` method on that subset so that we can pass the pandas Dataframe for visualizations:

In [None]:
plot_subset = gdf.take(cp.random.choice(gdf.shape[0], size=100000, replace=True))
df_subset = plot_subset.to_pandas()

In [None]:
df_subset.head()

## Visualize Population Density and Distribution

To avoid overplotting, we shrink the `alpha` value and reduce the `size` of each pixel.

In [None]:
options = dict(line_color=None, 
               fill_color='blue', 
               size=2,    # Reduce size to make points more distinct
               alpha=.05) # Reduce alpha to avoid overplotting

We give the `easting` and `northing` columns of our data subset to our visualization helper function...

In [None]:
p = base_plot(data=df_subset[['easting', 'northing']], 
              padding=10000)

...plot circles for each datapoint...

In [None]:
p.circle(x=list(df_subset['easting']), y=list(df_subset['northing']), **options)

...and display.

In [None]:
bplt.show(p)

## Next

In the next notebook, you will begin your use of GPU-accelerated machine learning algorithms, using K-means to identify the best locations for supply depots and then visualizing the results.