# Visualize the Population

In this notebook we visualize our UK population data with cuXfilter.

## Objectives

By the time you complete this notebook you will be able to:

- Use cuXfilter to visualize scatterplot data
- Use a cuXfilter widget to filter subsets of the data

## Imports

RAPIDS can be used with a wide array of visualizations, both open source and proprietary. Bokeh, [DataShader](http://datashader.org/), and other open source visualization projects are connected with RAPIDS via the [cuXfilter](https://github.com/rapidsai/cuxfilter) framework.

We import `cudf` as usual, plus `cuxfilter`, which we will be using to visualize the UK population data.

As `cuxfilter` loads, we will see empty rows appear underneath this cell; that is expected behavior.

In [1]:
import cudf

import cuxfilter as cxf

## Load Data

Here we load into a cuDF dataframe the grid coordinates and county from the England/Wales population data, already transformed from your work in the first section of the workshop.

In [2]:
gdf = cudf.read_csv('./data/pop_2-02.csv', usecols=['easting', 'northing', 'county'])
print(gdf.dtypes)
gdf.shape

county       object
northing    float64
easting     float64
dtype: object


(58479894, 3)

In [3]:
gdf.head()

Unnamed: 0,county,northing,easting
0,Darlington,515491.5313,430772.1875
1,Darlington,503572.4688,434685.875
2,Darlington,517903.6563,432565.5313
3,Darlington,517059.9063,427660.625
4,Darlington,509228.6875,425527.7813


## Factorize Counties

cuXfilter widgets enable us to select entries from an integer column. To make that column, we can use cuDF's `factorize` method to convert a string column into an integer column, while keeping a map from the new integers to the old strings.

`factorize` produces the integer column and corresponding map as output, and we overwrite the old string column with the new integer column here. Alternatively, we could have appended the integer column as a new column, but the integer column is much more memory-efficient than the equivalent string column, and so we will often prefer to overwrite.

In [4]:
gdf['county'], county_names = gdf['county'].factorize()
gdf.head()

Unnamed: 0,county,northing,easting
0,37,515491.5313,430772.1875
1,37,503572.4688,434685.875
2,37,517903.6563,432565.5313
3,37,517059.9063,427660.625
4,37,509228.6875,425527.7813


The `county_names` Series is indexed by the integers that `factorize` created. The cuXfilter widget requires a dictionary to map from the integers to the strings, so we convert the index to a list, zip it together with the strings, and make the result into a dictionary. We use the `values_host` representation to make both the index and values iterable for the Python `zip` function. Since they should be small lists (to be a useful widget) we aren't worried about this taking significant time; alternatively, we could use the `to_arrow()` and `as_py()` methods in sequence to convert them to standard python objects before zipping.

In [6]:
county_map = dict(enumerate(county_names.to_array().flatten()))
county_map[37]

'Darlington'

## Visualize Population Density and Distribution

Using cuXfilter has three main steps:
1. Associate a data source with cuXfilter
2. Define the charts and widgets to use with the data
3. Create and show a dashboard containing those charts and widgets

### Associate a Data Source with cuXfilter

We need to know what our data source's column names will be for the next step, so it is usually helpful to define the data source first. However, note that the data source is not used in the *Define Charts and Widgets* step below--the same charts and widgets can be used with different data sources with the same column names!

In [7]:
cxf_data = cxf.DataFrame.from_dataframe(gdf)

### Define Charts and Widgets

We fix the chart `width` and then use the fact that `easting` and `northing` are both in meters to scale the chart `height` appropriately.

We also see how we will use the `county_map` made in the last step: it lets cuXfilter know how to display the selection widget, which is operating on the integer column behind the scenes.

In [8]:
chart_width = 600
scatter_chart = cxf.charts.datashader.scatter(x='easting', y='northing', 
                                              width=chart_width, 
                                              height=int((gdf['northing'].max() - gdf['northing'].min()) / 
                                                         (gdf['easting'].max() - gdf['easting'].min()) *
                                                          chart_width))

county_widget = cxf.charts.panel_widgets.multi_select('county', label_map=county_map)

### Create and Show the Dashboard

At this point, we provide a list of the elements that we want on the dashboard and can provide it parameters that determine its appearance.

In [9]:
dash = cxf_data.dashboard(charts=[scatter_chart],sidebar=[county_widget], theme=cxf.themes.dark, data_size_widget=True)

Then, we can view individual charts non-interactively as a preview...

In [10]:
scatter_chart.view()

And finally, push the whole dashboard up for interactive cross-filtering. To do so, we will need the IP address of the machine we are working on, which you can get by executing the next cell...

In [11]:
%%js
var host = window.location.host;
element.innerText = "'http://"+host+"'";

<IPython.core.display.Javascript object>

Set `my_url` in the next cell to the value just printed, making sure to include the quotes. 

**Note**: due to the cloud environment we are working in, you will need to ignore the "open cuxfulter dashboard" button that will appear, and instead, execute the following cell to generate a working link.

In [None]:
my_url = # TODO: Set this value to the print out of the cell above, including the quotes.
dash.show(my_url + "/lab", port=8789)

... and you can run the next cell to generate a link to the dashboard:

In [None]:
%%js
var host = window.location.host;
var url = 'http://'+host+'/lab/proxy/8789/';
element.innerHTML = '<a style="color:blue;" target="_blank" href='+url+'>Open Dashboard</a>';

Finally, once you are done with the dashboard, run the next cell to end it:

In [None]:
dash.stop()

<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

In the next notebook, you will begin your use of GPU-accelerated machine learning algorithms, using K-means to identify the best locations for supply depots and then visualizing the results.