A latent Dirichlet allocation (LDA) model for taxi trips

Question

How do the New York City and San Francisco taxi markets differ?

Motivation

In 2014, I wanted to know how Uber was expanding the market for on-demand mobility. I suspected that in cities like New York the traditional taxi ecosystem was already well developed, so Uber not be a huge change. But cities like San Francisco have sparser taxi service, so Uber may be more market-expanding.

This project provides a baseline by identifying differences in the taxi market in SF and NYC. In order to do that, I used timestamped GPS data on taxi pickup and dropoffs to classify different types of trips in each city.

The Model

LDA is an unsupervised learning model that originally developed to classify text documents into topics, as described by Blei et al. (2003). The central idea is that each document in a given collection contains a mixture of latent topics, and these underlying topics give rise to a predictable vocabulary. Thus the pattern of word frequency in the document collection can be used to infer the topics it contains.

In this application I used LDA to classify trips based on their spatial and temporal attributes. (Other examples here and here.)

The Data

This analysis uses timestamped taxi GPS records from the SFMTA and the NY Taxi and Limousine Commission.

San Francisco: complete trip records from one of the city’s larger taxi companies
- October 2012 and mid-July through October 2013 (>700,000 trips).
New York: all trips in the city
- I chose to use only October 2013, more than 15 million trips. Of these I randomly sampled 10%.
- available here.

Results

The results suggest New Yorkers tend to use taxis in a broader range of cases, especially for commuting. Taxi usage in San Francisco is more specialized, with a greater focus on late night and airport trips.

NYC has more distinct trip types than does SF. E.g., it has two distinct types for morning and evening commute.

All the trip types together. Note the spikes in SF on Friday and Saturday nights.

This demonstrates a way to infer taxi trip purposes from GPS data alone, without need for surveys or reliance on potentially biased data sources.

Project Status

The jupyter notebooks contain the code I used for data preparation, model building, and some visualization - including an implementation of the Gensim LDA model. The code was written with Python2. It is untested and not very clean. In the future I may find time make it more reproducible.

See the project write-up for methodological details.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
images		images
results		results
working_data		working_data
LDA_writeup.pdf		LDA_writeup.pdf
README.md		README.md
taxi_LDA.ipynb		taxi_LDA.ipynb
taxi_data_prep.ipynb		taxi_data_prep.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A latent Dirichlet allocation (LDA) model for taxi trips

Question

Motivation

The Model

The Data

Results

Project Status

About

Releases

Packages

Languages

lrayle/taxi

Folders and files

Latest commit

History

Repository files navigation

A latent Dirichlet allocation (LDA) model for taxi trips

Question

Motivation

The Model

The Data

Results

Project Status

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages