How do the New York City and San Francisco taxi markets differ?
In 2014, I wanted to know how Uber was expanding the market for on-demand mobility. I suspected that in cities like New York the traditional taxi ecosystem was already well developed, so Uber not be a huge change. But cities like San Francisco have sparser taxi service, so Uber may be more market-expanding.
This project provides a baseline by identifying differences in the taxi market in SF and NYC. In order to do that, I used timestamped GPS data on taxi pickup and dropoffs to classify different types of trips in each city.
LDA is an unsupervised learning model that originally developed to classify text documents into topics, as described by Blei et al. (2003). The central idea is that each document in a given collection contains a mixture of latent topics, and these underlying topics give rise to a predictable vocabulary. Thus the pattern of word frequency in the document collection can be used to infer the topics it contains.
In this application I used LDA to classify trips based on their spatial and temporal attributes. (Other examples here and here.)
This analysis uses timestamped taxi GPS records from the SFMTA and the NY Taxi and Limousine Commission.
- San Francisco: complete trip records from one of the city’s larger taxi companies
- October 2012 and mid-July through October 2013 (>700,000 trips).
- New York: all trips in the city
- I chose to use only October 2013, more than 15 million trips. Of these I randomly sampled 10%.
- available here.
The results suggest New Yorkers tend to use taxis in a broader range of cases, especially for commuting. Taxi usage in San Francisco is more specialized, with a greater focus on late night and airport trips.
NYC has more distinct trip types than does SF. E.g., it has two distinct types for morning and evening commute.
All the trip types together. Note the spikes in SF on Friday and Saturday nights.
This demonstrates a way to infer taxi trip purposes from GPS data alone, without need for surveys or reliance on potentially biased data sources.
The jupyter notebooks contain the code I used for data preparation, model building, and some visualization - including an implementation of the Gensim LDA model. The code was written with Python2. It is untested and not very clean. In the future I may find time make it more reproducible.
See the project write-up for methodological details.