Skip to content


Switch branches/tags
This branch is up to date with ladyson/123bigdata:master.

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Analyzing New York City Taxi Data: a MapReduce approach

Objective: Understand the taxi transportation dynamics for New York City (NYC) and how has it been impacted by Uber with the purpose of creating a more informed policy-making regarding mobility in NYC.

Using MapReduce to analyze Taxi and Uber data from NYC.


Task A: Analyzing taxi demand in big concerts

0. Getting Data


  • We obtained a database of 325 concerts in NYC, ranging from 2009 to 2015, using the Bandsintown API
  • We manually verified coordinates of several venues, as they had some defaults that did not match any known venue
  • We used “” to get the information in a csv format and turned it into a json file

Taxi rides

  • We downloaded two types of data: uber rides and yellow cab rides.
  • We took advantage of scripts to download monthly files of yellow cab rides, from 1/2009 to 12/2015. We also obtained monthly uber rides for the period of 4/14-9/14.
  • Data was uploaded to an S3 bucket

1. Counting taxi rides by event

  • We counted how many taxi rides occurred in a three-hour frame since the beginning of each event (as marked by the API), at a distance no greater than 200 meters from the venue coordinates.
  • Running one month file with 20 instances on AWS takes about 24 minutes (e.g. using python3 ~/…/ -r emr s3://…/yellow_tripdata_2013-03.csv )

2. Comparing taxi demand before and after Uber started operations in NYC

  • Results were separately analyzed using a spreadsheet. We divided total counts by total capacity of venues (in the case of Madison Square Garden) to compare before and after Uber operations.

Task B: Destination likelihood

0. Getting Data


  • We obtained the coordinate points for the polygon for Manhattan from here

Taxi rides

  • We used the same information as the previous task.

1. Clustering with K-Means

  • For each year (2009-2015), identify a set of cluster centroids (start with K=10) for taxi Pickup and Drop-off locations during three time categories: Weekday daytime, weekday nighttime, and weekends. We only look at trips that start and end within Manhattan.
  • Kmeans code via uchicago-cs/cmsc12300

2. Trip probability

  • For each trip starting and ending in Manhattan, determine to which pickup and drop-off cluster does it belong. Reduce on pickup locations and break this down into 30 minute increments. We then calculate the probability (as a relative frequency) of going to any given drop off cluster at that time from that starting region.
  • Look at how the probability of different destinations changes throughout the day from different starting points (e.g. “If I’m in Times Square at midnight, where am I likely to go?” versus “If I’m in Times Square at 7pm…?” How is this different on a weekend versus a weekday?


Using MapReduce to analyze Taxi and Uber data from NYC.







No releases published


No packages published


  • Python 100.0%