bart-passenger-heatmap.Rmd

---
title: "BART Passenger Heatmap"
author: "Justin Perona"
date: "2019-03-20"

# code courtesy of R Markdown: The Definitive Guide by Xie, Allaire, and Grolemund
# https://bookdown.org/yihui/rmarkdown/html-document.html
output:
  html_document:
    toc: true
    toc_depth: 4
    toc_float:
      collapsed: true
      smooth_scroll: false
---

<style>
body {
text-align: justify}
</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Introduction

In Winter Quarter 2018, I took ECI 254 (Exploring Data from Built Environment Using R) with Professor Deb Niemeier.
My final project for that class was to create a heatmap of passengers that traveled through the San Francisco Bay Area Rapid Transit (BART) system.

### Previous Work and Motivation

In Spring Quarter 2018, I took MAE 253 (Network Theory) with Professor Raissa D'Souza.
My final project for that class involved analyzing the network formed by the BART system with a team.
One of the papers I read for that project, [*Weighted complex network analysis of travel routes on the Singapore public transportation system*](https://doi.org/10.1016/j.physa.2010.08.015.) by Soh et al., argued that a dynamic (weighted) network analysis of a metro rail network like BART could give different insights into a network than a topological (unweighted) analysis.
In order to analyze the BART network dynamically, I wanted to create visualizations of the BART network at different times and days.
To do so, I ended up writing a Python script that would parse the BART hourly passenger origin-destination data [available on their website](https://www.bart.gov/about/reports/ridership) into a format that could be easily plotted in an external tool.
This allowed me to create the visualizations my team needed for the project.
That Python script and documentation on how to use it is available publicly on GitHub, in my repository [`jlperona/bart-hourly-dataset-parser`](https://github.com/jlperona/bart-hourly-dataset-parser).

After the class, I had a conversation with one of my teammates about how to take this work further.
I thought about creating a heatmap of the BART network using the same data I worked with.
The heatmap would make it easy to see at a glance where the majority of the passengers were travelling.
My teammate's suggestion was to use a visualization library like [D3.js](https://d3js.org/).
However, I didn't have experience with JavaScript, so I ended up shelving the idea.

When the final project for ECI 254 was announced, I started thinking of datasets I could use.
Professor Niemeier had already put Shiny in my mind, since a previous student had used Shiny for their final project.
After some consideration, I remembered my previous work in MAE 253, and my heatmap idea.
I decided that I wanted to try to implement the heatmap in Shiny and see how far I could get.

### Objective

My objective was to create a Shiny application that would display BART passenger data on an actual map.
The application should show where the BART stations and tracks were, as well as color each track by how many passengers traveled on it.
I also wanted users to be able to subset the input data by date and hour, so they could see how many passengers traveled at certain points in time.

### Changes Made After the Presentation

I demonstrated an incomplete version of the application in class on 2019-03-15.
Since then, I have made the following changes (as shown by [my commit history for the repository](https://github.com/jlperona/bart-passenger-heatmap/commits/master)):

* Modify the sidebar to be more useful, and actually use the options that the user input rather than ignoring it.
* Use Shiny's reactives to calculate and update the displayed data based on the user's input. I also coded it so only the tracks are re-rendered on an update, instead of the entire map.
* Validate the user's input and display errors if there's anything wrong. This makes the application more robust. Previously, it would crash if incorrect input was given.

For more detail on the topics listed above, see the *Additional Features* subsection in the *Methodology* section below.

## Methodology

There are two major portions to this application:

1. Using Python, parsing through the BART data to create an input file.
2. Using Shiny, collating the input file and rendering the heatmap.

### Parsing

The current parsing mechanism is a preparser that only needs to be run once to create the input data for the Shiny application.
The Shiny application then takes that input data, collates it, and displays it.

#### Input Data

BART provides hourly passenger origin-destination data [on their website](https://www.bart.gov/about/reports/ridership).
I found this dataset when I was working on the final project for MAE 253.
At the time of writing, their data ranges from 2011-01-01 to 2018-12-31, which is eight years' worth of data.
The files are separated by year, and in their uncompressed form are approximately 250 MB each, which totals 2 GB for the current eight years of data.
The data is in the following format:

```
2011-05-23,4,FRMT,UCTY,77
```

In order, each column stands for the following:

1. Date, in ISO 8601-style `YYYY-MM-DD` format.
2. Hour, in 24-hour time.
3. Origin station, in BART's 4-letter abbreviation code.
4. Destination station, in BART's 4-letter abbreviation code.
5. The number of passengers who traveled between those two stations. Combinations of stations for which there were zero passengers that traveled between them are not included in the data.

Based on this, the BART data can be interpreted as a [graph](https://en.wikipedia.org/wiki/Graph_\(discrete_mathematics\)).
The vertices are the stations, the edges represent passenger flow, and the weight of each edge is based on how many passengers traveled between that station pair.
The source data then can be interpreted as a fully-connected graph, since from any station there can be an entry in the data (and thus an edge in the graph) to any other station.
A fully-connected graph is hard to visualize, however.

#### Prior Work

Instead of attempting to visualize a fully-connected graph, it would be easier if the graph was in the same format as [the BART system map](https://www.bart.gov/system-map).
There would only be edges between adjacent stations on the network.
The edge weights then become how many passengers traveled between those two stations.
Their final destination may not have necessarily been either of those two stations, but they traveled on the track between the two to get to their final destination.
Interpreting this visually, we want to turn the graph representation on the left into the one on the right, with the weight of an edge on the left added to every edge on the shortest path between the two stations on the right:

![](./graphics/parser.png)

The BART data is not in this simplified format, however.
In [my previous work for MAE 253](https://github.com/jlperona/bart-hourly-dataset-parser), I wrote a parser script in Python that parses through the BART origin-destination data and exports a graph file that is in the same format as the system map.
I used the Python package `NetworkX` to do the heavy lifting in the parser.
`NetworkX` has tools to create, manipulate, and analyze graphs and networks.
Since it had a function to determine shortest paths in a graph, it was perfect for my use case.
The methodology behind that script is as follows:

1. Parse a base input graph to set up the network topology, such as the graph on the right above.
    * The graph is considered to be unweighted and undirected.
2. Read in the hourly origin-destination data line by line.
    1. Parse out the origin station, destination station, and number of passengers between those two stations at that date and time.
    2. Calculate the shortest path between the origin station and destination station.
        * Since the input graph is both unweighted and undirected, a [breadth-first search](https://en.wikipedia.org/wiki/Breadth-first_search) is the best algorithm to determine these shortest paths.
        * Due to the current topology of the network, there will only ever be one shortest path between stations. There are no possible alternate paths that could cause an ambiguity.
    3. Add the number of passengers to every edge along the shortest path.
3. Export the final weighted graph to a graph file similar in format to the base input graph.

#### Parsing On Demand in R

In my Shiny application, I needed the same functionality that the Python script from my previous work gave.
Being able to turn the BART data into something more easily visualized was crucial to making the application useful.

Initially, I tried to replicate my previous Python script using R.
Instead of exporting a graph file, though, I would access the final graph's weights and use that as the passenger data for each edge.
You can see that attempt in [my repository history](https://github.com/jlperona/bart-passenger-heatmap/blob/a2ccbe602d75cb41e230803973afae65edb0d149/other/igraph.R).
This would mean that the Shiny application would take in the BART data as a direct input.
Since `NetworkX` doesn't exist in R, I used a package called `igraph` instead.
`igraph` also provides graph and network analysis tools, and is available in R.
It provided the same functionality that I needed from `NetworkX`, so it seemed like an appropriate choice.

I was able to successfully implement the parsing using `igraph`, but I came across a problem.
Parsing through a one-day subset of the data with about 26,000 rows would take about 5 minutes to execute in R, which was unreasonable.
Scaling this to the entire eight-year dataset would have been infeasible.
Another issue was that I would have had to include all 2 GB of the BART data with my Shiny application.
Parsing on demand was not the correct solution here.

#### Preparsing in Python

Some of my friends suggested other solutions, such as building a database to contain all the data.
I eventually settled on creating a *preparser* that would do the heavy lifting ahead of time.
It would replicate most of the functionality of the original Python script.
However, instead of writing to a graph file, it wrote to an output CSV with the same format as the BART data.
Instead of there being thousands of rows for a given date-hour combination, it would combine them into only 49 rows, the number of edges in the current BART network.
This significantly reduces the final file size.

Most of the logic for the preparser was identical to my original script, save for the exporting.
Based on this, I wrote the preparser in Python, again leveraging `NetworkX` for its graph functions.
This script can be found in the `preparser/` subdirectory in this repository.
Running the preparser on my laptop took 1 hour with all eight years' worth of data.
In the end, though, I obtained an output file that was approximately 90 MB in size.
The original files were approximately 2 GB in size, which means I managed to create an input file that was about 4.4% the size of the input files.
This file is `date-hour-soo-dest-all.csv` in the `data/` subdirectory of my repository.

Now, instead of loading then parsing the BART data, the Shiny application loads the preparsed data and only runs smaller functions to collate the data together.
Instead of taking five minutes to parse one day, my application can display results in seconds.
An additional benefit is that the preparser is easy to run if BART adds more data.
I believe that this is the most technically impressive part of my work.

### Base Shiny Application

After preparsing the data, we now need to display it.
Using the code in this repository, Shiny will render the application.

#### File Structure

There are three files where my code for the Shiny application resides in, each with a different purpose.

* `ui.R`, which contains the code that sets up the final user interface (UI) of the application. There are no calculations done in this file.
* `global.R`, which contains any variables and processing that is used by the entire application.
* `server.R`, which contains the code that is executed by the application to be displayed to the user.

It is also possible to make a Shiny application using a single file, but I opted against that.
Splitting the code into clear, separate pieces made sense to me.

#### Mapping the Stations and Tracks

The first step was to render the BART stations and tracks.
BART provides station and line GIS data [on their website](https://www.bart.gov/schedules/developers/geo).
I was able to take their station data mostly as is.
However, their line data was one big line for each of BART's routes.
I needed separate lines between each station, so that I could color each of them separately depending on how many passengers traveled on that edge.

To do this, I imported BART's station and line data into ArcGIS.
I exported the station data into the GeoJSON format after removing up some unnecessary variables in the data.
For the track data, I had to split the lines for the routes into separate segments, using the stations as the points where to split them.
Due to the way it was split, I had to merge some pieces back together, and remove some duplicate segments of track.
Finally, I cleaned up some unnecessary variables in the data and exported the track data into GeoJSON as well.
You can find both of these files in the `geojson/` subdirectory in this repository.

Now that I had the GIS data in the format I needed, I could bring the data into R and map it.
To do so, I needed to use the following packages:

* `rgdal`, which I used to import the GeoJSON files into R
* `leaflet`, which is used for creating maps in R much like `mapview`
* `shiny`, the framework I used to build the app

Combining these together, I was able to successfully render the tracks on a map.

#### Coloring the Tracks

The next step was to actually create a heatmap and color each individual segment of track by the number of passengers.
`leaflet` makes it easy to color segments using a continuous variable, and the number of passengers is continuous.
However, I first needed to get the passenger data.
The preparsed data is separated by date and time, so I needed to aggregate the passenger count on each segment of track.
Once I did that, I had to merge that together with the track data so that `leaflet` knows what passenger count corresponds to which track segment.

To accomplish this, I used the following packages:

* `data.table` to quickly import the CSV file
* `fasttime` to quickly convert the dates in the preparsed data to `POSIXct`, a date format in R

I could have done all the functionality that these two packages provide in base R, but using these two made the processing go much faster.
With a slight modification to the track rendering mentioned in the previous section, I was able to successfully color the tracks.

#### User Interface (UI)

Finally, I had to create the interface that the user would interact with.
`ui.R` defines the UI; I opted for a simple main panel and side bar.
The main panel contains the map, which is rendered using `leaflet`.
The side bar contains inputs for the user to subset the preparsed data by date and time, as well as a button to update the map when pressed.
When I presented my app originally, the sidebar was non-functional, so there wasn't much purpose to it at that point.

### Additional Features

After the presentation, I added reactivity to my application, and validated user input as well.
These features built off the original base application that I had presented.

#### Reactivity

[According to the Shiny documentation](https://shiny.rstudio.com/articles/understanding-reactivity.html), *reactivity* is what makes Shiny apps responsive.
A reactive value will update itself when a value that it depends on is updated.
In this application, the values that can be updated are the user's requested choices to subset the data by date and time.
I wanted the tracks to be re-rendered and coloring to be updated if those values are changed.
Thus, I needed to make a reactive variable for the track data and number of passengers on each piece of track.
I did this with the [`reactive()`](https://shiny.rstudio.com/reference/shiny/1.1.0/reactive.html) function, which declares a variable as reactive.
In `server.R`, you can find the following function call:

```{r echo=TRUE,eval=FALSE}
passengerData <- reactive({
  # subset and collate the data
)}
```

`passengerData` depends on the date and time variables from the input boxes in the sidebar.
When those are updated, `passengerData` will update.
However, I didn't want to have it update every single time the user made a change.
I wanted the user to have to explicitly press a button to re-render the data, to cut down on needless re-renderings.
Further down in `server.R`, you can find the following function call:

```{r echo=TRUE,eval=FALSE}
observeEvent(input$updateDatetime,
             ignoreNULL = FALSE,
             ignoreInit = FALSE, {
  # validate user input and re-render the map
})
```

[`observeEvent()`](https://shiny.rstudio.com/reference/shiny/1.1.0/observeEvent.html) allows us to perform an action in response to an event.
In this case, it's when the user presses on the update button in the UI.
Now, whenever the button is pressed, the `observeEvent()` function is run, which checks if the user input is valid.
This is discussed further in the next section.
If it is, then it re-renders only the tracks using the function [`leafletProxy()`](https://www.rdocumentation.org/packages/leaflet/versions/2.0.2/topics/leafletProxy) in `leaflet`.
The base layer of the map and the stations are left untouched.
The tracks are dependent on `passengerData`, so that will be updated as well.
It will re-subset the data based on the user input and update the passenger counts.

Together, these two functions allowed me to implement the interactivity in my application.
Reactive expressions in Shiny are exceedingly powerful, and I only utilized a tiny portion of what they can do.

#### Input Validation

While allowing users to be able to subset the data is useful, it's possible that users may give invalid input.
For example, if the starting date in the desired subset is *after* the ending date, then there will be no rows in the desired subset, and `data.table` will throw an error.
The same applies if the user doesn't select any hours in the time selector.
Originally, the application would crash if that was the case.

In order to solve this, I added `if` statements to the `observeEvent()` call that check for a variety of these cases.
If it encounters invalid input, it throws up a dialog saying what the error was, then stops.
This is much more tolerant of failure than simply crashing.
The user is allowed to go back and change their input.
Another useful side effect is that nothing is re-rendered, saving some processing time.
Since `passengerData` isn't used if the input is invalid, it isn't updated either, so we save time there as well.

Note that the Shiny application also requires other input files, like the GeoJSON data to render the stations and tracks, the preparsed data CSV, and the icon used to indicate where a station is.
I consider validating these files unnecessary.
If any of these files are invalid, the Shiny application will have nothing to display, so having it handle this type of failure gracefully doesn't seem like it would be particularly useful.
R will also throw an error when attempting to build the application, so the user will know what went wrong.

## Results

The end result of my work is a working and robust heatmap for the BART data that allows users to subset the input data by date and hour.

### Source Code

I have made all source code for this application available on GitHub in my repository [`jlperona/bart-passenger-heatmap`](https://github.com/jlperona/bart-passenger-heatmap).

The code in that repository is licensed under the MIT License, which is the same license that my prior work [`jlperona/bart-hourly-dataset-parser`](https://github.com/jlperona/bart-hourly-dataset-parser) was licensed under.
I'm a strong believer in the open-source movement.
Making this code available to all under an [open source license](https://opensource.org/licenses) is important to me.
Since I was the only one working on this data and project in the class at the time, I felt comfortable making the code open source before I turned this report in.
That being said, I hope nobody else attempts to pass this work off as their own.

### Running the Application

There are multiple mechanisms by which a user can run the heatmap.
I have listed some of them below.

#### Clone and Run Locally

Users who wish to run the heatmap can download the code and run it locally on their own machine.
One mechanism by which they can do this is via cloning my Git repository using the following command:

```
git clone https://github.com/jlperona/bart-passenger-heatmap.git
```

Once a user has cloned my repository, they can then build the application in R.
This also allows users to make edits to the application, test them out, and potentially build their own similar application.

#### `shiny::runGitHub()`

Another mechanism to run the heatmap locally is provided by a function in the Shiny package.
If the application code is hosted on GitHub, Shiny makes it very easy to launch an application via the [`runGitHub()`](https://www.rdocumentation.org/packages/shiny/versions/0.9.0/topics/runGitHub) function.
It will download and launch Shiny applications that are hosted in a GitHub repository.
If a user wanted to run my application using this function, for instance, the following line of code would do so:

```{r echo=TRUE,eval=FALSE}
shiny::runGitHub("bart-passenger-heatmap", "jlperona")
```

The code above cannot be executed in a RMarkdown document and will give an error if attempted.
It is shown here for informative purposes.

#### *shinyapps.io*

Shiny apps are web apps and thus were meant to be hosted online.
I could have hosted this on a personal website, but I lack a server and the funds to maintain said server.
However, I found out that Shiny apps can be hosted on [*shinyapps.io*](https://www.shinyapps.io/), a website made by the creators of RStudio.

I have hosted my application on *shinyapps.io* at https://jlperona.shinyapps.io/bart-passenger-heatmap/.
This is how I demonstrated the application in my presentation.
However, *shinyapps.io* has tiered pricing depending on usage, limiting the number of hours an application can be run per month.
I am using the free tier, which means that I have a very limited amount of hours.
The link above may not work depending on whether I have exhausted my allotment for a given month.

#### Embedded in RMarkdown

Finally, it is possible to embed Shiny applications in RMarkdown.
You can either define the entire application inline using the [`shinyApp()`](https://shiny.rstudio.com/reference/shiny/latest/shinyApp.html) function, or include the application in another directory using the [`shinyAppDir()`](https://shiny.rstudio.com/reference/shiny/latest/shinyApp.html) function.
Since my application is split across multiple files and pulls from other data files, I would opt for the latter function.
In addition, you have to modify the YAML at the top of the document to tell RMarkdown that Shiny content is being included.
You can do so by adding the following line:

```
runtime: shiny
```

You will also need to install some packages to build this RMarkdown document, besides the normal ones typically used to build any RMarkdown document.
The following code will install the necessary packages:

```{r echo=TRUE,eval=FALSE}
install.packages(c("data.table",
                   "fasttime",
                   "leaflet",
                   "rgdal",
                   "shiny")
)
```

The code below is shown for informative purposes.
Unfortunately, you cannot export a RMarkdown with Shiny content embedded in it to one HTML file.
You would have to host the HTML file somewhere else.

```{r echo=TRUE,eval=FALSE}
# code courtesy of R Markdown: The Definitive Guide by Xie, Allaire, and Grolemund
# https://bookdown.org/yihui/rmarkdown/shiny-embedded.html
shinyAppDir("./bart-passenger-heatmap",
            options = list(width = "100%",
                           height = 900)
)
```

## Conclusion

I accomplished the objective I set out to achieve at the beginning of this project: a fully-functional Shiny application that shows a heatmap of the BART system using BART's own data.
In addition, I went slightly further and made the application more robust by validating user input.
On a more personal level, I was able to take an idea I've had in the back of my mind ever since I took MAE 253 and bring it to life.
I also got to learn how to use Shiny to build web apps.
I'm quite happy with the final result and my growth as an R programmer in this class.

### Methods for Improvement

The biggest flaw in this application right now is the UI layout.
I am not a UI or user experience (UX) expert, but I think that the UI can definitely be improved.
However, to my understanding, to do so in Shiny requires Cascading Style Sheets (CSS) experience, which I don't have.

Another "flaw" (which I would consider a difference in opinion) is the layout of the map.
I wanted to see the BART network overlaid on a real map.
However, this means that some of the edges are much larger than the others.
For example, the edges between the San Francisco stations are very difficult to see when the application is initially started.
You have to zoom in in order to see their colors.
A "better" heatmap would make it easier to see all the edges at once, which means that the stations would be evenly spaced.
Due to the setup of this application, it would be relatively easy to create a different heatmap by simply changing the station GeoJSON data.
Since it's open source, anybody could take this as a base and create that application, if they so desire.

### Future Work

There are a couple of ways that I can think of to improve this application at the moment.
All of these should be possible with my skill set.

* Instead of calculating the sum of the number of passengers on an edge, show other summary statistics, such as the mean or median.
* Other visualization types, such as a [chord diagram](https://en.wikipedia.org/wiki/Chord_diagram) that shows passengers' final destinations from a given source station.

I have another idea for future work, which isn't related to improving my application.
I will admit that it was somewhat difficult to get started with Shiny.
[This blog post on DataScience.com](https://www.datascience.com/blog/beginners-guide-to-shiny-and-leaflet-for-interactive-mapping) was the most helpful thing for getting me off the ground, but I was doing some things differently than their tutorial, which complicated matters.
That being said, I spent most of my time attempting to figure out a few more advanced pieces of Shiny, such as reactives and validating user input.
In contrast, I was able to get other things, like the preparser, done relatively quickly.
I do think that the Shiny documentation could be better for certain topics, like the ones I've mentioned above.
That is certainly an area for improvement.

## Acknowledgements

I'd like to acknowledge my teammates in MAE 253 who helped me with the previous work that guided this project:

* Baotuan Nguyen, formerly a Master's student in Computer Science at UC Davis. He's currently a software development engineer at Workday, Inc.
* Heidi Schweizer, formerly a PhD student in Agricultural and Resource Economics at UC Davis. She's currently an assistant professor at North Carolina State University.

I'd also like to acknowledge Professor Deb Niemeier, who let me attempt this project even though there was a good chance I'd fail (at least, from my perspective).
I've learned a lot about both R and Shiny through this project, and I also created something which I think is both interesting and useful thanks to her class.