# Digging Into the Pronto Data Release

In October, [Pronto CycleShare](https://www.prontocycleshare.com/), turned one year old and released a treasure-trove of data on the 140,000 individual rides trips during the first year.
Here I want to dig into this data and answer a few questions:

- many naysayers insist that Seattle is too cold, too wet, too hilly to be a world class bicycling city. How do these elements actually affect users of the Pronto system?
- what is the difference in usage by annual members and short-term users? How might Pronto evolve to be more useful to these groups?
- how do Pronto trips compare to trips by other cyclists in the city? Can characteristics of Pronto use give us insight into deeper trends within the city?
- Can we cleverly de-anonymize the data and learn about the usage patterns of individual members?

If you're interested in how the plots and figures below were created, I have made available all the Python code I used to run this analysis. For details, see http://github.com/jakevdp/ProntoData.

## Dataset Overview

The Pronto dataset catalogs 140,000 rides from October 2014 through October 2015, split among Annual Members and Short-Term Pass Holders.
Splitting the data by rider type and viewing the number of rides gives us a good idea of aggregate rider behavior:

![Alt text](files/figs/daily_trend.png)
<small>*[figure source](BasicAnalysis.ipynb#Trips-Over-the-Year)*</small>

Particularly clear here is that there is a strong **weekly cycle** that differs between Annual and Short-Term users: there are more rides by annual users during the week, and more rides by short-term users on the weekend.

One notable feature above is the large spike in short-term ridership in mid-April: this is likely due to the [American Planning Association national conference](http://www.planetizen.com/node/75958/seattle-sets-bikeshare-record-apa-town) which was held in Seattle that week.

Another interesting view of the data is to look at the hourly usage throughout each day.
Here we show the ride count by hour of the day, split by user type and weekday/weekend:

![Alt text](files/figs/hourly_trend.png)
<small>*[figure source](BasicAnalysis.ipynb#Trips-Over-a-Day)*</small>

This displays two unique patterns of use: a double-peaked "commute pattern" of Annual riders from Monday to Friday, and a broad single-peaked "recreational pattern" for the remaining categories.
We will see more detail on this below.

## Speed and Distance

The dataset contains the duration of each trip, a well as the start and end points.
By querying the Google Maps API for bicycling directions between each pair of stations, we can estimate the distance ridden on each trip: because riders may not go directly from point A to point B, this distance will
Because riders may not go directly from point A to point B, this length is a *lower bound* on the ride distance.

Here we compare the distribution of ride durations and speed estimates for annual and short-term users:

![Alt text](files/figs/duration_and_speed.png)
<small>*[figure source](BasicAnalysis.ipynb#Trip-Distances-&-Rider-Speed)*</small>

The left panel shows the distribution of trip durations.
For annual members, we see that the most common ride length is around 5 minutes.
Annual members are also very savvy about the 30-minute free ride limit, with only a small number of their trips surpassing this.
Short-term users, on the other hand, either don't mind the extra cost, or don't understand the intended use of the system.
My hunch is that these short-term users aren't fully aware of this pricing structure ("I paid for the day, right?") and likely walk away unhappy with the experience.
If I worked for Pronto, I would do more to make sure day-pass users understand the pricing structure!

The right panel shows the distribution of (minimum) riding speed.
There is a spike at speed zero for both sets of users, indicating rides that start and stop at the same location.
This is much more prevalent for short-term users, and probably indicates visitors using bikes to explore a neighborhood rather than to get from point A to point B.
Beyond this, the distributions for annual and short-term users are quite different, with annual riders showing on average a higher lower-bound speed.
You might be tempted to conclude here that annual members ride faster than day-pass users, but the data alone aren't sufficient to support this conclusion.
These data could also be explained if annual users tend to go from point A to point B by the most direct route, while day pass users tend to meander around and get to their destination indirectly.
I suspect that the reality is some mix of these two effects.

We can see even more detail by plotting the speed and distance against each other:

![Alt text](files/figs/distance_vs_speed.png)
<small>*[figure source](BasicAnalysis.ipynb#Trip-Distances-&-Rider-Speed)*</small>

This re-emphasizes the strong tendency of annual members to keep their rides below 30 minutes
The sharp cutoff in rides at or near the 30 minute limit suggests that these users plan their rides to not exceed that, and that there are many users who would take longer trips if time allowed.
Short-term use is less affected by the 30-minute cutoff, but as I suggested above I believe this is more due to a misunderstanding of usage policy than users being willing to fork over more cash.

## Seattle's Challenges: Elevation and Weather

One oft-mentioned concern with the feasibility of bike share in Seattle is that it is a very hilly, cold, and rainy city – before Pronto's launch, armchair analysts predicted that *nobody* would ride when the weather is bad, and even in good weather all rides would just be downhill!
This idea was usually brought up as an argument against the feasibility of the system within the city ("Sure, bikeshare works other places, but it can't work here: Seattle is *special*! We're just so *special*!")
Let's take a look at ride trends with elevation and weather to see if this prediction was realized.

Elevation data is not included in the data release, but again we can turn to the Google Maps API to get what we need.

The distribution of elevation changes over the year is shown below:

![Alt text](files/figs/elevation.png)
<small>*[figure source](BasicAnalysis.ipynb#Trend-with-Elevation)*</small>

This shows that particularly with the annual members, downhill rides outnumber uphill rides by nearly a factor of 2!
This is especially true for rides with an elevation change of greater than 50 meters or so (for reference, 50 meters is about the elevation difference between Capitol Hill/11th and Pine and downtown/2nd and Pine).
Of the 142,000 trips logged in Pronto's first year, there were about 80,000 total downhill trips and 50,000 total uphill trips, which means Pronto staff had to shuttle almost 100 bikes per day from low-lying stations to higher-elevation stations!

Next let's take a look at the trends with weather. We will look a the effect of temperature and precipitation, separating weekdays from weekends and annual users from short-term users:

![Alt text](files/figs/temperature.png)
<small>*[figure source](BasicAnalysis.ipynb#Trend-with-Weather)*</small>

![Alt text](files/figs/precipitation.png)
<small>*[figure source](BasicAnalysis.ipynb#Trend-with-Weather)*</small>

The broad trends are exactly as one might expect in our climate: more people opt to ride their bicycles city-wide on warm, sunny days.
One interesting feature here is seen in both the precipitation and temperature plots: On Mondays through Fridays, the slope of the trend line is about equal for Annual and Short-term users. But on weekends, the annual members seem to be *less affected by weather* while short-term users are *more affected by weather*.
This suggests that the number of "opportunistic" riders — those who see a nice day and decide to go on a Pronto ride — is larger for Annual members during the week, and larger for short-term members on the weekend.
Additionally, we see that on Monday to Friday, annual users essentially always outnumber short-term users, while on the weekends short-term users outnumber annual users *as long as the weather is good*.

How does all this bode for Seattle's cycle share? The trends are as expected (people tend to bike downhill on sunny days), and I suspect that to most readers, the extent to which these quantitative trends condemn Pronto is most closely related with how you felt before seeing them.

### Comparing with the Fremont Bridge

Another interesting question we answer is how the number of Pronto rides relates to the number of *total* bicycle trips in Seattle.
The latter numbers are very difficult to pin down, but we do have a nice source of ridership data in the [Fremont Bridge Bike Counter](http://www.seattle.gov/transportation/bikecounter_fremont.htm), which has been logging bicycle trips across the Fremont bridge for the past three years.
The ratio of daily Pronto trips to daily Fremont Bridge trips is shown below:

![Alt text](files/figs/compare_with_fremont.png)
<small>*[figure source](CompareWithFremont.ipynb)*</small>

We see that the ratio for annual members hovers at around 10% throughout the year: each day, for every annual member Pronto trip, there are ten bicycle trips across the Fremont bridge, and this number is remarkably stable throughout the year (though the ratio appears to be slightly lower during the summer months).

## Data Summary

The above views of the data paint an interesting picture regarding the use of Pronto, and I see several main takeaway points:

- Annual Members and Day Pass users show markedly different behavior in aggregate: annual members seem to use Pronto mostly for commuting from point A to point B on Monday-Friday, while short-term users use Pronto primarily on weekends to explore particular areas of town.
- While annual members seem savvy to the pricing structure, one out of four short-term-pass rides exceeds the half hour limit and incurs an additional usage fee. For the sake of the customer, Pronto should probably make effort to better inform short-term users of this pricing structure.
- Elevation and weather affect use just as you would expect: there are nearly twice as many downhill trips as uphill trips, and cold & rain significantly decrease the number of rides on a given day. The effect of weather over the course of the year is comparable to that seen for riders crossing the Fremont Bridge.

With this basic understanding of the data, we can now go on to ask some more sophisticated questions of the data.

## What Days do Pronto Users Work?

We have found above that there are distinct differences in the hourly ride counts between annual and short-term users, and between weekdays and weekends.
One way we can explore this deeper is to use [*Unsupervised Machine Learning*](https://en.wikipedia.org/wiki/Unsupervised_learning) approaches to try to discover structure in these hourly trends.
What I'm going to do here is a bit abstract, but bear with me: each day has 24 hours, and we can count the number of rides over the course of a day to get a 24-component vector describing the day.
In this way, you can view each day as *a single point in a 24-dimensional space*, and ask questions about how the resulting cluster of 365 points in 24 dimensions behaves.

Now, humans are very good at visualizing two-dimensional or three-dimensional data: the plots above are mostly two-dimensional (plotting "x" values vs. "y" values), but as the dimension grows this gets more difficult.
To gain an intuition about high-dimensional data, scientists often use what are known as *dimensionality reduction* algorithms.
That is, we'd like to reduce the dimensions of the data from 24 to 2, while maintaining some semblance of the data structure.
A very common method for such things is [*Principal Component Analysis*](https://en.wikipedia.org/wiki/Principal_component_analysis), which is a way of automatically rotating and stretching high-dimensional data to create a suitable low-dimensional projection which preserves important relationships.
Applying such an analysis to the Pronto hourly data over the course of the year yields the following representation of the data, where we color the points by total daily rides:

![Alt text](files/figs/pca_raw.png)
<small>*[figure source](WorkHabits.ipynb#Principal-Component-Analysis)*</small>

What's notable here is that there are two distinct "types" of days within the data represented by the two oblong clusters, and that the more rides there are in a given day, the more the clusters diverge.

To see what these clusters actually represent, we can use another unsupervised machine learning method, a cluster detection algorithm (specifically a [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model)) to identify the 24-dimensional groups of points and automatically assign cluster membership.
After doing this, we can plot the average hourly trends for each group:

![Alt text](files/figs/pca_clustering.png)
<small>*[figure source](WorkHabits.ipynb#Automated-Clustering)*</small>

We see that the pattern reflected in these two groups of points is exactly the commute/recreation split that we saw in the hourly data above.
But we've gone a bit further than before: what we have done here is to create a model whereby we can *classify* each day of the year as a "commute day" in red or a "recreation day" in purple.
For example, we can split between annual and short-term users to get a better idea of where they lie:

![Alt text](files/figs/pca_annual_vs_shortterm.png)
<small>*[figure source](WorkHabits.ipynb#Automated-Clustering)*</small>

The results match our intuition from exploring the data above: the red "commute" cluster is made up entirely of annual riders, while the purple "recreational" cluster is a mix of annual and short-term riders (with one lone short-term day straying into the commute cluster).
Our intuition is that "commute" patterns would happen from Monday to Friday, while "recreational" patterns would happen on the weekends.
This is largely borne-out in the data, with the exception of just a few days.
Consider this plot, were we change the colors to distinguish between *Annual Member Weekdays* and everything else:

![Alt text](files/figs/pca_true_weekends.png)
<small>*[figure source](WorkHabits.ipynb#Automated-Clustering)*</small>

We see that there are several black points (Monday-Friday for Annual Members) which stray into the "wrong" cluster.
