# Data Analysis with Visual Analytics

**Combining some basic analytics with visualization**


## Overview and Requirements

In thi notebook we will continue to explore the Divvy bikes dataset using a few new tools like [cugraph](https://docs.rapids.ai/api/cugraph/stable/), [cuspatial](https://github.com/rapidsai/cuspatial), and [cudf](https://docs.rapids.ai/api/cudf/stable/) and see how these results can easily feed direcly into visualization tools like [hvplot](https://hvplot.holoviz.org/user_guide/index.html) and [datashader](https://datashader.org/index.html).

## Import

In addition to the libraries mentioned above, we will also make use of libraries [cupy](https://docs.cupy.dev/en/stable/), NumPy, and Pandas directly.

In [None]:
import cudf
import cugraph
import cupy
import cuspatial

import numpy as np
import pandas as pd

import datashader as ds
import datashader.transfer_functions as tf

import hvplot.cudf
import hvplot.pandas

## Load Cleaned Data / Check

First let's load the data. In addition to the main Divvy `data.csv` file, we will also load the small `stations.csv` file that we prepared in the first notebook. 

In [None]:
from pathlib import Path

DATA_DIR = Path("./data")

In [None]:
df = cudf.read_csv(DATA_DIR / "data.csv", parse_dates=('starttime', 'stoptime'))

In [None]:
stations = pd.read_csv(DATA_DIR / "stations.csv")

We will eventually want to also look for any patterns in weekday vs weekend. so let's first add a column for that:

In [None]:
df["weekday"] = df['starttime'].dt.weekday

## Run some analysis -> ? = profit?

Mix of cuspatial / cugraph analysis to then visualize

## Intro to cuGraph / cuSpatial / cupy

We will analyze the data using some RAPIDS tools:

* [cuSpatial](https://docs.rapids.ai/api/cuspatial/nightly/) is a collection of GPU accelerated algorithms for computing geo-spatial measures. We can use to to compute trips with given bounding regions.

* [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames. We can use to compute pagerank or degree measure on the trip graph. 


## CuSpatial

Let's take a look at some spatial measures and see if there are any interesting features.

We might start with the first station, and see what the max trip length from it is.

In [None]:
r0 = df.iloc[0]
station_id, origin_lon, origin_lat = r0["from_station_id"], r0["longitude_start"], r0["latitude_start"]

The cuSpatial function `lonlat_to_cartesian` will let us quicly compute the x/y distances for every ending trip location. (Note that distances will in *kilometers*).

In [None]:
sub_df = df[df["from_station_id"]==station_id[0]]
dist = cuspatial.lonlat_to_cartesian(origin_lon[0], origin_lat[0], sub_df["longitude_end"], sub_df["latitude_end"])

CuPy functions can compute derived values on these GPU dataframes:

In [None]:
cupy.sqrt(cupy.max(dist.x**2 + dist.y**2))

What if we want to compute this all trip distances? We can compute the distances using every station as a starting point:

In [None]:
def trip_dists(df):
    results = []

    for idx, row in stations.iterrows():
        station_id, origin_lon, origin_lat = int(row["station_id"]), row["lon"], row["lat"]
        sub_df = df[df["from_station_id"]==station_id]
        res = cuspatial.lonlat_to_cartesian(origin_lon, origin_lat, sub_df["longitude_end"], sub_df["latitude_end"])
        res["dist"] = cupy.sqrt(res.x**2 + res.y**2)
        results.append(res)
        
    return cudf.concat(results)

In [None]:
all_from_dists = trip_dists(df)

In [None]:
all_from_dists.hvplot.hist(y="dist", normed=True)

It might also be instesting to break the distribution of trips down weekday vs weekend

In [None]:
weekend_trips = df[df["weekday"].isin([5, 6])] # weekend days = 5, 6 
weekday_trips = df[df["weekday"].isin(list(range(5)))]  # weekday days = 0..4

In [None]:
weekend_dists = trip_dists(weekend_trips)
weekday_dists = trip_dists(weekday_trips)

In [None]:
all_combined_dists =  cudf.concat([weekday_dists, weekend_dists])
all_combined_dists.head()

Plotting these two distributions together we can see the weekday (orange) trips peak more at shorter distances and the weekend distributions has more longer trips.

In [None]:
weekend_hist = weekend_dists.hvplot.hist(y="dist", alpha=0.3, bin_range=(0, 20), normed=True, color="blue")
weekday_hist = weekday_dists.hvplot.hist(y="dist", alpha=0.3, bin_range=(0, 20), normed=True, color="orange")
weekend_hist * weekday_hist

## CuDF

Let's use CuDF direclty to group and aggregate our data to look for anyting intersting about the flow of trips in and out stations. 

We want to look at the daily net flow of trips at each station, i.e. how many more (or less) trips *started* at a station vs *ended* at a station in a given day.

In order to group by day, we first take the "floor" of each timestamp divided by one day

In [None]:
one_day = np.datetime64(1, 'D').astype('datetime64[ns]').astype('int64') 

df['from_day'] = df['starttime'].astype('int64') // one_day
df['to_day'] = df['stoptime'].astype('int64') // one_day

Now we can group by the station id and hour for both the departing and arriving cases. We name the columns from the size DataFrame `out` and `in` respectively:

In [None]:
df_out = df.groupby(by=["from_station_id", "from_day"]).size().to_frame('out').reset_index()
df_in = df.groupby(by=["to_station_id", "to_day"]).size().to_frame('in').reset_index()

Let's re name the columns to be the same in both DataFrames

In [None]:
df_out.rename(columns={"from_station_id": "station_id", "from_day": "day"}, inplace=True)
df_in.rename(columns={"to_station_id": "station_id", "to_day": "day"}, inplace=True)

And re-set the index to be the (station id, hour) pair

In [None]:
df_out = df_out.set_index(["station_id", "day"])
df_in = df_in.set_index(["station_id", "day"])

Now we can join these two DataFrames to compute an `flow = out - in` column

In [None]:
full_df = df_in.join(df_out, how="outer").fillna(0).reset_index()
full_df["flow"] = full_df["out"] - full_df["in"]

Let's also convert our "day" values back to proper timestamps:

In [None]:
full_df["time"] = (full_df["day"] * one_day).astype('datetime64[ns]')
full_df = full_df[["station_id", "time", "flow"]]

Now we can take a glimpse at the resulting DataFrame which has the net trip flow by station per day. A positive number means there was an excess of trips *starting* at station that day. A negative number indicates an excess of trips *ending* at a station that day.

In [None]:
full_df.head()

We might like to look at the maximal behaviour. What is a high number of excess arrivals or departures at a station? Let's pull out individual timeseries for each station id, and look a the max/min for each station:

In [None]:
flows = []
for i in stations.station_id:
    subdf = full_df[full_df.station_id==i].drop("station_id").set_index("time")
    flows.append((i, subdf.flow.max(), subdf.flow.min()))
flows = pd.DataFrame(flows, columns=["station_id", "max_out", "max_in"])

In [None]:
flows

With this information, we can see what stations had the larges ever excess departures (station 192) or arrivals (station 77):

In [None]:
flows.iloc[flows.max_out.argmax()]

In [None]:
flows.iloc[flows.max_in.argmin()]

Knowing about execess arrivals vs departures is probably import for Divvy to be able to manually re-allocate bikes. We could ask what fraction of stations ever have a max of more than 30 excess trips:

In [None]:
len(flows[flows.max_out > 30])

In [None]:
len(flows[flows.max_in < -30])

We would try to look at all of these trip flow series for each station at once, using Datashader. First we need to prepare a new Dataframe that has all the series as columns:

In [None]:
series = []

for i in stations.station_id:
    s = full_df[full_df.station_id==i][["time", "flow"]]
    s.rename(columns={"flow": f"s{i}"}, inplace=True)
    s = s.set_index("time")
    series.append(s)
    
df_wide = cudf.concat(series, axis=1).fillna(0)

The resulting Dataframe has a time series for every column, one for each station:

In [None]:
df_wide

It's simple to pull out individual stations for comparison using `hvplot`:

In [None]:
df_wide.hvplot(y=["s81", "s287"], alpha=0.3)

Lastly, lets take a look at the data with datashader. First we make a funtion `series_shade` that can take a wide dataframe of timeseries like we make above, and render *all* of the series at once using datashaher:

In [None]:
def series_shade(df):
    cols = list(df.columns)
    
    itime = cudf.to_datetime(df.index).astype('int64')
    x_range = (itime[0], itime[-1])
    
    y_range = (df.min().min(), df.max().max())
    
    temp = cudf.DataFrame(df)
    temp["itime"] = itime
    
    cvs = ds.Canvas(plot_height=400, plot_width=1000)
    agg = cvs.line(temp, x="itime", y=cols, agg=ds.count(), axis=1)
    
    print(f"y range: ({y_range[0]}, {y_range[1]})")
    return tf.shade(agg, how='eq_hist')

Now let's pass in out daily net excess data to get a rough datashder plot:

In [None]:
series_shade(df_wide)

It's not completely clear what we can see here but it points to some ideas for future exporation. 

As a last experiment, let's make the same plot, but with *cumulative* excess trips:

In [None]:
df_cumulative = df_wide.cumsum()

In [None]:
series_shade(df_cumulative)

This a bit more interesting and points to the notion that Divvy must be engaging in a lot of continual re-allocation of its bikes to offset these excess trips. 

## Pagerank

Now we will use the `cugraph.pagerank` function to see if there are patterns for the "most popular" stations.

### Single hour 

First, let's see what it looks like to compute page range for a single hour of the day, e.g. 5PM. First subset the data to only look at data for trips starting at that hour:

In [None]:
d17 = df[df["hour"]==17]

Next group by (from_station_id, to_station_id) and then take the group size to get all the unique indivual routes between stations that hour, and also the number of trips that took each of those routes:/

In [None]:
g17 = df.groupby(by=["from_station_id", "to_station_id"])
routes17 = g17.size().reset_index()
routes17.head()

Now we can create a `cugraph.Graph` 

In [None]:
G = cugraph.Graph()

In [None]:
G.from_cudf_edgelist(d17, source='from_station_id', destination='to_station_id')

In [None]:
d17_page = cugraph.pagerank(G)
d17_page.head()

Now that we have computed pagerank on the network of trips, let's see which stations rank as most important at 5PM (on any day)

In [None]:
d17_top = d17_page.nlargest(10, "pagerank").to_pandas()
d17_top.head()

Plotting these stations we can see that at 5PM the most important stations are all downtown:

In [None]:
d17_page_locs = stations[stations.station_id.isin(d17_top.vertex)]
d17_page_locs.hvplot.points(x='lon', y='lat', size=300, geo=True, tiles="OSM").opts(width=800, height=800)

Now let's look at how stations rank week on weekdays vs weekends. The code below computes the pagerank broken out by individual day of the week.

In [None]:
results = {}
for w in range(7):
    dfw = df[df["weekday"]==w]
    G = cugraph.Graph()
    G.from_cudf_edgelist(dfw, source='from_station_id', destination='to_station_id')
    df_page = cugraph.pagerank(G).nlargest(20, "pagerank")
    results[w] = set(df_page.to_pandas()["vertex"])

Now let's find out what stations were highest ranked among all weekdays and weekend days

In [None]:
weekday = set.intersection(*[results[i] for i in range(5)]) # days 1..5 are weekdays
weekend = set.intersection(results[5], results[6])  # days 5 and 6 are the weekend

Now we can see the stations that are all import on weekdays, and all important on weekends (and that there is not much overlap):

In [None]:
weekend

In [None]:
weekday

Finally we can plot these quickly using `hvplot`. Let's add a column to denote weekday/weekend so that we can group by that

In [None]:
r1 = stations[stations.station_id.isin(weekend)]
r1 = r1.assign(type="Weekend")

r2 = stations[stations.station_id.isin(weekday)]
r2 = r2.assign(type="Weekday")

result = pd.concat([r1, r2])

Looking at the plot, nearly all the important weekday stations are downtown, and on the weekend the important stations are furhter out, in popular districts around downtown:

In [None]:
result.hvplot.points(x='lon', y='lat', by='type', 
                     alpha=0.5, size=485, geo=True, tiles="OSM").opts(width=800, height=800)

We can note that the important weekday stations (for all hours) are similarly clustered downtown as the top rush-hour stations at 5pm.

What looking at our previous trip flow data with respect to weekday vs weekend? 

In [None]:
flows[flows.station_id.isin(weekday)]

In [None]:
flows[flows.station_id.isin(weekday)]

E.g. we can look at all the weekday stations where we can see some stations are consistent in having more excess departures or arrivals:

In [None]:
df_wide.hvplot(y=[f"s{n}" for n in weekday], alpha=0.2)

We can also look at the weekend stations. Here there is one station with a clear excess of arrivals. 

In [None]:
df_wide.hvplot(y=[f"s{n}" for n in weekend], alpha=0.2)

Hovering over the plot we can see that it is station 268 which turns out to be right by the waterfont museum district:

In [None]:
df[df["to_station_id"]==268].iloc[0]

## Summary of interesting analytics results 
Does not have to be significant but noteable