# Data Inspection and Validation
**Successfully loading data and prepairing it for visualization**

NOTE: Goal is to interject how best to format data (dataframe, proper size etc) to take the most advantage of RAPIDS

## Overview and Requirements
Super short version of intro notebook and restate requirments

## Imports
In this section we will show how RAPIDs tools like [CuDF](https://docs.rapids.ai/api/cudf/stable/) can be used with existing Python datavis tools like  [`hvplot`](https://hvplot.holoviz.org/) to load and get a quick look at your data sets. Let's first make sure the necessary imports are present.

In [None]:
import cudf
import hvplot.cudf
import cupy

## Loading Data
We need to download and extract the sample data we will use for this tutorial. This notebook uses the Kaggle [Chicago Divvy Bicycle Sharing Data](https://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data) dataset. Once the `data.csv` file is downloaded and unzipped, point the paths below at the location.

In [None]:
from pathlib import Path

DATA_DIR = Path("../data")
FILENAME = Path("data.csv")

We read the CSV file into a CuDF Dataframe

In [None]:
df = cudf.read_csv(DATA_DIR / FILENAME)

## Data Shape
CuDF supports all the standard Pandas operations for a quick look at the data, e.g. to see the number of rows

In [None]:
len(df)

Or to inspect the first few rows

In [None]:
df.head()

Or to see the full list of columns

In [None]:
df.columns

Or see how many trips were made by subscribers:

In [None]:
df.groupby("usertype").size()

## Data Utility
What are useful cols, what do they mean in the real world, is it useful for my problem, do I need to suppliment the data?

Having looked at the `df.head()` above, the first thing we might want is to re-load the data, parsing the start end end time columns as datetimes

In [None]:
df = cudf.read_csv(DATA_DIR / FILENAME, parse_dates=('starttime', 'stoptime'))

One thing we will want to do is to look at trips by day of week. Now that we have real datetime columns, we can use `dt.weekday` to add a `weekday` column to our `cudf` Dataframe:

In [None]:
df["weekday"] = df['starttime'].dt.weekday

## Inspection
Various visualization techniques to validate if this data makes sense ( e.g. map for proper geospatial encoding, line for time series...)

In order to get a very quick look at things, we will use [`hvplot`](https://hvplot.holoviz.org/). 

A basic question we might want to ask is how many trip starts are there per day of the week? We can group the `cudf` Dataframe and call `hvplot.bar` directly the result:

In [None]:
day_counts = df.groupby("weekday").size().rename("count").reset_index()
day_counts.hvplot.bar("weekday", "count").opts(title="Trip starts, per Week Day", yformatter="%0.0f")

Another quick look we can generate is to see the overall distribution of trip durations, this time using `hvplot.hist`:

In [None]:
df.hvplot.hist(y="tripduration", bins=50).opts(
    title="Trips Duration Histrogram", yformatter="%0.0f"
)

`hvplot` makes it simple to interrogate different dimensions. For example we can add `groupby="month"` to our call to `hvplot.hist`, and automatically get a slider to see a histogram specific to each month.

In [None]:
df.hvplot.hist(y="tripduration", bins=50, groupby="month").opts(
    title="Trips Duration Histrogram by Month", yformatter="%0.0f", width=600
)

`hvplot` can also generate KDE distributions, and since we are operating on `cudf` Dataframes, it can do so quickly

In [None]:
df.hvplot.kde(y="temperature").opts(title="Distribution of trip temperatures")

The `hvplot.heatmap` method can group in two dimensions and colormap according to aggregations on those groups. Here we see *average* trip duration by year and month. 

In [None]:
df.hvplot.heatmap(x='month', y='year', C='tripduration', 
                  reduce_function=cudf.DataFrame.mean , colorbar=True, cmap="Viridis")

We might also want to bin the data geographically. The `hvplot.hexbin` can show the counts for trip starts overlaid on a tile map:

In [None]:
df.hvplot.hexbin(x='longitude_start', y='latitude_start', geo=True, tiles="OSM").opts(width=800, height=800)

## Cleanup

A little cleanup will make some things simpler in future notebooks. 

One thing that is missing is a list of just station id's and their coordinates. Let's generate that, and save it for later. First, let's group by all the unique "from" and "to" station id values

In [None]:
from_ids = df.groupby("from_station_id")
to_ids = df.groupby("to_station_id")

It's possible (but unlikely) that a particular station is only a sink or source for trips. For good measure, let's make sure the group keys are identical.

In [None]:
set(from_ids.groups) == set(to_ids.groups)

Each group has items for a single station, which all have the same lat/lon. So let's make a new DataFrame by taking a representative from each group, then rename some columns

In [None]:
stations = from_ids.nth(1).to_pandas()
stations.index.name = "station_id"
stations.rename(columns={"latitude_start": "lat", "longitude_start": "lon"}, inplace=True)
stations = stations.reset_index().filter(["station_id", "lat", "lon"])
stations

Finally write the results to "stations.csv" in our data directory:

In [None]:
stations.to_csv(DATA_DIR / 
                "stations.csv", index=False)

## Summary of Data