<img src="https://github.com/jupytercon/2020-exactlyallan/raw/master/images/RAPIDS-header-graphic.png">

<p style="background-color:red;color:white;font:2em"> Note: It is advised to close the previous notebook kernels and clear GPU Memory in order to avoid GPU Resource Out of Memory errors</p

# Data Inspection and Validation
***Loading data, vetting its quality, and understanding its shape***

## Overview
This intro notebook will use cuDF and hvplot (with bokeh charts) to load a public bike share dataset and get a general sense of what it contains, then run some cursory visualization to validate that the data is free of issues.

### cuDF and hvplot
- [cuDF](https://docs.rapids.ai/api/cudf/stable/), the core of RAPIDS, is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data in a pandas-like API.
- [hvplot](https://hvplot.holoviz.org/) is a high-level plotting API for the PyData ecosystem built on [HoloViews](http://holoviews.org/).

## Imports
Let's first make sure the necessary imports are present to load.

In [None]:
import cudf
import hvplot.cudf
import cupy
import pandas as pd

## Data Size and GPU Speedups
This tutorial's dataset size is about `2.1GB` unzipped and contains about `9 million rows`. While this will do for a tutorial, its still too small to get a sense of the speed up possible with GPU acceleration. We've created a larger `300 million row` [2010 Census Visualization](https://github.com/rapidsai/plotly-dash-rapids-census-demo) application available through the RAPIDS [GitHub page](https://github.com/rapidsai) as another demo. 

## Loading Data into cuDF
We need to download and extract the sample data we will use for this tutorial. This notebook uses the Kaggle [Chicago Divvy Bicycle Sharing Data](https://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data) dataset. Once the `data.csv` file is downloaded and unzipped, point the paths below at the location *(Make sure to set DATA_DIR to the path you saved that data file to)*:


In [None]:
from pathlib import Path

DATA_DIR = Path("../data")

In [None]:
# Download and Extract the dataset
! wget -N -P {DATA_DIR} https://rapidsai-data.s3.us-east-2.amazonaws.com/viz-data/data.tar.xz
! tar -xf {DATA_DIR}/data.tar.xz -C {DATA_DIR}

In [None]:
FILENAME = Path("data.csv")

We now read the .csv file into the GPU cuDF Dataframe (which behaves similar to a Pandas dataframe). 

In [None]:
df = cudf.read_csv(DATA_DIR / FILENAME)

## Mapping out the Data Shape
CuDF supports all the standard Pandas operations for a quick look at the data e.g. to see the total number of rows...

In [None]:
len(df)

Or to inspect the column headers and first few rows...

In [None]:
df.head()

Or to see the full list of columns...

In [None]:
df.columns

Or see how many trips were made by subscribers.

In [None]:
df.groupby("usertype").size()

## Improving Data Utility
Now that we have a basic idea of how big our dataset is and what it contains, we want to start making the data more meaningful. This task can vary from removing unnecessary columns, mapping values to be more human readable, or formatting them to be understood by our tools.  

Having looked at the `df.head()` above, the first thing we might want is to re-load the data, parsing the start-stop time columns as more usable datetimes types: 

In [None]:
df = cudf.read_csv(DATA_DIR / FILENAME, parse_dates=('starttime', 'stoptime'))

One thing we will want to do is to look at trips by day of week. Now that we have real datetime columns, we can use `dt.weekday` to add a `weekday` column to our `cudf` Dataframe:

In [None]:
df["weekday"] = df['starttime'].dt.weekday

## Inspecting Data Quality and Distribution
Another important step is getting a sense of the quality of the dataset. As these datasets are often larger than is feasible to look through row by row, mapping out the distribution of values early on helps find issuse that can derail an analysis later.

Some examples are gaps in data, unexpected or empty value types, infeasible values, or incorrect projections. 

## Gender and Subsriber Columns
We could do this in a numerical way, such as getting the totals from the 'gender' data column as a table:

In [None]:
mf_counts = df.groupby("gender").size().rename("count").reset_index()
mf_counts

While technically functional as a table, taking values and visualizating them as bars help to intuitively show the scale of the difference faster (hvplot's API makes this very simple):

In [None]:
mf_counts.hvplot.bar("gender","count").opts(title="Total trips by gender")

### A Note on Preattentive Attributes
This subconcious ability to quickly recognize patterns is due to our brain's natural ability to find [preattentive attributes](http://daydreamingnumbers.com/blog/preattentive-attributes-example/), such as height, orientation, or color. Imagine 100 values in a table and 100 in a bar chart and how quickly you would be albe to find the smallest and largest values in either.

### Try It out
Now try using [hvplot's user guide](https://hvplot.holoviz.org/user_guide/Plotting.html) and our examples to create a hvplot that shows the distribution of `Subscriber` types:

In [None]:
# code here

The above data columns maybe show some potentially useful disparities, but without supplimental data, it would be hard to have a follow up question.


## Trip Starts
Instead, another question we might want to ask is how many trip starts are there per day of the week? We can group the `cudf` Dataframe and call `hvplot.bar` directly the result:

In [None]:
day_counts = df.groupby("weekday").size().rename("count").reset_index()
day_counts.hvplot.bar("weekday", "count").opts(title="Trip starts, per Week Day", yformatter="%0.0f")

With 0-4 being a weekday, and 5-6 being a weekend, there is a clear drop off of ridership on the weekends. Lets note that!


## Trips by Duration
Another quick look we can generate is to see the overall distribution of trip durations, this time using `hvplot.hist`:

In [None]:
# We selected an arbitrary 50 for bin size, try and see patterns with other sizes
df.hvplot.hist(y="tripduration").opts(
    title="Trips Duration Histrogram", yformatter="%0.0f"
)

Clearly, most trips are less than 15 minuites long. 

`hvplot` also makes it simple to interrogate different dimensions. For example, we can add `groupby="month"` to our call to `hvplot.hist`, and automatically get a slider to see a histogram specific to each month:

In [None]:
df.hvplot.hist(y="tripduration", bins=50, groupby="month").opts(
    title="Trips Duration Histrogram by Month", yformatter="%0.0f", width=400
)

By scrubbing between the months we can start to see a pattern of slightly longer trip durations emerge during the summer months.



## Trips vs Temperatures
Lets follow up on this by using `hvplot` to generate a KDE distributions using our `cudf` Dataframes for 9 million trips:

In [None]:
df.hvplot.kde(y="temperature").opts(title="Distribution of trip temperatures")

Clearly most trips occur around a temperature sweet spot of around 65-80 degrees.


The `hvplot.heatmap` method can group in two dimensions and colormap according to aggregations on those groups. Here we see *average* trip duration by year and month: 

In [None]:
df.hvplot.heatmap(x='month', y='year', C='tripduration', 
                  reduce_function=cudf.DataFrame.mean , colorbar=True, cmap="Viridis")

So what we saw hinted at with the trip duration slider is much more clearly shown in this literal heatmap *(ba-dom-tss)*. 



## Trip Geography
Temperature and months aside, we might also want to bin the data geographically to check for anomalies. The `hvplot.hexbin` can show the counts for trip starts overlaid on a tile map:

In [None]:
df.hvplot.hexbin(x='longitude_start', y='latitude_start', geo=True, tiles="OSM").opts(width=600, height=600)

Interestingly there seems to be a strong concentration of trips in a core area that radiate outwards. Lets take note of that. 

The location of the data compared to a current system map also seems to show that everything is where it should be, without any extraneous data points or off map projections:

<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master/images/DivvyBikesStation_ map.png" />

## Data Cleanup
Based on our inspection, this dataset is uncommonly well formatted and of high quality. But a little cleanup and formatting aids will make some things simpler in future notebooks. 

One thing that is missing is a list of just station id's and their coordinates. Let's generate that and save it for later. First, let's group by all the unique "from" and "to" station id values, and take a representative from each group:

In [None]:
from_ids = df.groupby("from_station_id")
to_ids = df.groupby("to_station_id")

It's possible (but unlikely) that a particular station is only a sink or source for trips. For good measure, let's make sure the group keys are identical:

In [None]:
all(from_ids.size().index.values  == to_ids.size().index.values)

Each group has items for a single station, which all have the same lat/lon. So let's make a new DataFrame by taking a representative from each group, then rename some columns:

In [None]:
stations = from_ids.nth(1).to_pandas()
stations.index.name = "station_id"
stations.rename(columns={"latitude_start": "lat", "longitude_start": "lon"}, inplace=True)
stations = stations.reset_index().filter(["station_id", "lat", "lon"])
stations

Finally write the results to "stations.csv" in our data directory:

In [None]:
stations.to_csv(DATA_DIR / "stations.csv", index=False)

## Summary of the Data
Overall this is an interesting and useful dataset. Our preliminary vetting found no issues with quality and already started to hint at areas to investigate:

- Weekday vs Weekend trip counts
- Bike trips vs weather correlation 
- Core vs Outward trip concentrations 

We will follow up with these findings in our next notebook.