# Data Analysis with Visual Analytics
**Combining some basic analytics with visualization**

NOTE: goal is to show benifit of easy RAPIDS framework compatability, and speed of which can do iterations on large(ish) dataset

## Overview and Requirements
Super short version of intro notebook and restate requirments

## Import

In [None]:
import cudf
import cugraph
import cupy
import cuspatial

import pandas as pd

import hvplot.cudf
import hvplot.pandas

## Load Cleaned Data / Check

In [None]:
from pathlib import Path

DATA_DIR = Path("./data")

In [None]:
df = cudf.read_csv(DATA_DIR / "data.csv", parse_dates=('starttime', 'stoptime'))

In [None]:
stations = pd.read_csv(DATA_DIR / "stations.csv")

## Run some analysis -> ? = profit?
Mix of cuML / cuspatial analysis to then visualize

## Intro to cuGraph / cuSpatial / cupy

We will analyze the data using some RAPIDS tools:

* [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames. We can use to compute pagerank or degree measure on the trip graph. 

* [cuSpatial](https://docs.rapids.ai/api/cuspatial/nightly/) is a collection of GPU accelerated algorithms for computing geo-spatial measures. We can use to to compute trips with given bounding regions.

## Spatial

Let's take a look at some spatial measures and see if there are any interesting features.

We might start with the first station, and see what the min/max trip length from it are.

In [None]:
r0 = df.iloc[0]
origin_lon, origin_lat = r0["longitude_start"], r0["latitude_start"]

The cuSpatial function `lonlat_to_cartesian` will let us quicly compute the x/y distances for every ending trip location:

In [None]:
dist = cuspatial.lonlat_to_cartesian(origin_lon[0], origin_lat[0], df["longitude_end"], df["latitude_end"])

CuPy functions can compute derived values on these GPU dataframes:

In [None]:
cupy.sqrt(cupy.max(dist.x**2 + dist.y**2))

What if we want to compute this max/min trip interval for every station as a starting point?

In [None]:
from_intervals = []

for idx, row in stations.iterrows():
    station_id, origin_lon, origin_lat = int(row["station_id"]), row["lon"], row["lat"]
    dist = cuspatial.lonlat_to_cartesian(origin_lon, origin_lat, df["longitude_end"], df["latitude_end"])
    from_intervals.append((station_id, float(cupy.sqrt(cupy.max(dist.x**2 + dist.y**2)))))

from_intervals = pd.DataFrame(from_intervals, columns=["station_id", "dist"])

In [None]:
from_longest = from_intervals.nlargest(25, "dist")

In [None]:
from_longest_loc = stations[stations.station_id.isin(from_longest.station_id)]

In [None]:
from_longest_loc.hvplot.points(x='lon', y='lat', size=300, geo=True, tiles="OSM").opts(width=800, height=800)

## Pagerank

We will use the `cugraph.pagerank` function to see if there are patterns for the "most popular" stations. 

In [None]:
df["weekday"] = df['starttime'].dt.weekday

### Single hour 

Let's see what it looks like to compute page range for a single hour of the day, e.g. 5PM. First subset the data to only look at data for trips starting at that hour:

In [None]:
d17 = df[df["hour"]==17]

Next group by (from_station_id, to_station_id) and then take the group size to get all the unique indivual routes between stations that hour, and also the number of trips that took each of those routes:/

In [None]:
g17 = df.groupby(by=["from_station_id", "to_station_id"])
routes17 = g17.size().reset_index()
routes17.head()

Now we can create a `cugraph.Graph` 

In [None]:
G = cugraph.Graph()

In [None]:
G.from_cudf_edgelist(d17, source='from_station_id', destination='to_station_id')

In [None]:
d17_page = cugraph.pagerank(G)
d17_page.head()

In [None]:
d17_top = d17_page.nlargest(10, "pagerank").to_pandas()
d17_top.head()

In [None]:
d17_page_locs = stations[stations.station_id.isin(d17_top.vertex)]
d17_page_locs.hvplot.points(x='lon', y='lat', size=300, geo=True, tiles="OSM").opts(width=800, height=800)

Let's repeat the same  process and see the highest ranked stations at noon

d8 = df[df["hour"]==12]
g8 = df.groupby(by=["from_station_id", "to_station_id"])
routes17 = g8.size().reset_index()

G = cugraph.Graph()
G.from_cudf_edgelist(d8, source='from_station_id', destination='to_station_id')
d8_page = cugraph.pagerank(G)
d8_page.head()

d8_top = d8_page.nlargest(10, "pagerank").to_pandas()

d8_page_locs = stations[stations.station_id.isin(d8_top.vertex)]
d8_page_locs.hvplot.points(x='lon', y='lat', size=300, geo=True, tiles="OSM").opts(width=800, height=800)

Now let's look at how stations rank week on weekdays vs weekends. The code below computes the pagerank broken out by individual day of the week.

In [None]:
results = {}
for w in range(7):
    dfw = df[df["weekday"]==w]
    G = cugraph.Graph()
    G.from_cudf_edgelist(dfw, source='from_station_id', destination='to_station_id')
    df_page = cugraph.pagerank(G).nlargest(20, "pagerank")
    results[w] = set(df_page.to_pandas()["vertex"])

Now let's find out what stations were highest ranked among all weekdays and weekend days

In [None]:
weekday = set.intersection(*[results[i] for i in range(5)])
weekend = set.intersection(results[5], results[6])

In [None]:
weekend

In [None]:
weekday

Finally we can plot these quickly using `hvplot`. Let's add a column to denote weekday/weekend so that we can group by that

In [None]:
r1 = stations[stations.station_id.isin(weekend)]
r1 = r1.assign(type="Weekend")

r2 = stations[stations.station_id.isin(weekday)]
r2 = r2.assign(type="Weekday")

result = pd.concat([r1, r2])

In [None]:
result.hvplot.points(x='lon', y='lat', by='type', alpha=0.5, size=485, geo=True, tiles="OSM").opts(width=800, height=800)

## Summary of interesting analytics results 
Does not have to be significant but noteable