For this lab, you are going to be combining the fare data you used in **Lab 1** with some trip data about those same exact taxi rides.

Here is the schema of the trip dataset, found in `trip_data_1_trimmed.csv`:

* `medallion`: The ID of the cab being operated
* `hack_license`: The ID of the person operating the cab
* `vendor_id`: The type of vendor operating the cab, can either be `CMT` or `VTS`, no clue what these two types mean
* `rate_code`: Designates the kind of ride this is, must be `1` through `6`, any other number is incorrect
* `store_and_fwd_flag`: Can be either `Y`,`N`, or Nan
* `pickup_datetime`: The time when the ride started
* `dropoff_datetime`: The time when the ride ended
* `passenger_count`: The number of passengers during the ride
* `trip_time_in_secs`: How long the trip took
* `trip_distance`: Distance of the trip, to the nearest 1/10 mile
* `pickup_longitude`: Longitude of pickup location
* `pickup_latitude`: Latitude of pickup location
* `dropoff_longitude`: Longitude of dropoff location
* `dropoff_latitude`: Latitude of dropoff location

And here is the schema for the fare dataset, as a refresher (you're going to be using it too :)), found in `fare_data_1_trimmed.csv`:

* `medallion`: The ID of the cab being operated
* `hack_license`: The ID of the person operating the cab
* `vendor_id`: The type of vendor operating the cab, can either be `CMT` or `VTS`, no clue what these two types mean
* `pickup_datetime`: The time when the ride started
* `payment_type`: How the trip was paid, `UNK` stands for unknown, I have no idea what `NOC` stands for, but lets assume its some known way to pay
* `fare_amount`: Base fare cost of the trip
* `surcharge`: Additional charges that are not tolls
* `mta_tax`: The mta has to get its cut, right? :)
* `tip_amount`: How generous the rider(s) decided to be
* `tolls_amount`: How much money you had to pay in tolls
* `total_amount`: How much the trip cost, all in

One final piece of information you will need, the approximate latitude/longitude bounds of a bounding box centered around each borough in NYC:

* Queens,40.800760,40.542920,-73.700272,-73.962616
* Manhattan,40.874663,40.701293,-73.910759,-74.018721
* Bronx,40.915255,40.785743,-73.765274,-73.933406
* Brooklyn,40.739877,40.57042,-73.864754,-74.04344
* Staten Island,40.651812,40.477399,-74.034547,-74.259090

So, as was the case with the last lab, you are going to be tasked with answering a whole slew of questions about these data, except the questions here should be significantly more challenging.

Before beginning, please:

1. Remove any rides that did not have an appropriate `rate_code` 
* Convert pickup and dropoff locations from latitude/longitude to (approximate) borough. This will be very challening. I suggest the following approach, although you can try others:
  1. Calculate the exact center of each borough's bounding box and store each in a new variable (you should have one of these per borough, so 5 variables)
  * Find the distance from the given ride's pickup/dropoff location and the center of each borough found in A. 
  * Pick the borough that had the smallest distance from the given location to its bounding box center. (this is messy because it ignores cases where the start/end location isn't in the 5 boroughs, but its the best we can do with the information you're provided)
* Once this is done, join the `trip_data` and `fare_data` datasets together. You will need to join the datasets on more than one column, but you will have to figure out what those columns are!

Once you've gotten the preprocessing out of the way, answer the following questions:

1. What was the most common borough start location?
  * End location?
  * Pair of start/end locations? Excluding manhattan/manhattan?
* Which driver (`hack_license`) carried the most passengers, on average?
* Which driver had the highest tip percentage, on average, for those drivers that made at least 5 trips in the dataset?
* Was there any relationship between (correlation) when a taxi ride ended (get the closest minute within the day) and the tip percentage on the fare?
* Did the trip time correlate with the cost of the trip?
  * What about the tip amount?
* Which cab tended to generate the most revenue, on average, when picking people up in Manhattan?
* What was the average cost of all of the trips that originated in Brooklyn?
* Which driver made the most money overall? Assume the only money made was from tips.
  * Which driver made the most money on average?
  * Which driver made the most money in each borough?
* What was the average trip distance and trip cost for intra-borough (same starting/ending borough) trips?
  1. For inter-borough (different starting/ending borough) trips?
* Which borough had the cheapest tippers (had the smallest average tip percentage)? Assume that if a trip starts within some borough, then that trip belongs in that borough.
* Which driver logged the most miles in this dataset?
* What was the average toll amount for intra-borough rides? For inter-borough  rides?

In [4]:
import pandas as pd
import numpy as np
import math as math
from geopy.distance import great_circle

In [7]:
queensCenter = ((40.800760+40.542920)/2,(-73.700272-73.962616)/2)
brookCenter = ((40.739877+40.57042)/2,(-73.864754-74.04344)/2)
bronxCenter = ((40.915255+40.785743)/2,(-73.765274-73.933406)/2)
manhattanCenter = ((40.874663+40.701293)/2,(-73.910759-74.018721)/2)
siCenter = ((40.651812+40.477399)/2,(-74.034547-74.259090)/2)
boroughDict = {}
boroughDict["queens"] = queensCenter
boroughDict["brooklyn"] = brookCenter
boroughDict["bronx"] = bronxCenter
boroughDict["manhattan"] = manhattanCenter
boroughDict["staten"] = siCenter

In [8]:
def get_closest_borough(latitude,longitude,max_dist = 20):
    global boroughDict
    borough_distances = {borough:great_circle(boroughDict[borough],(latitude,longitude)).miles for borough in boroughDict}
    min_borough = min(borough_distances, key=borough_distances.get)
    if borough_distances[min_borough] < max_dist:
        return min_borough 
    else:
        return "outside_nyc"

In [9]:
tripData = pd.read_csv("/Users/sfogelson/code/flatiron_school/intro-datascience-workshop/weekend1/nycTaxiData/trip_data_1_trimmed.csv")

In [10]:
fareData = pd.read_csv("/Users/sfogelson/code/flatiron_school/intro-datascience-workshop/weekend1/nycTaxiData/trip_fare_1_trimmed.csv")

In [12]:
trip_and_fare_data = tripData.merge(fareData,on=["medallion","hack_license","vendor_id","pickup_datetime"])
del tripData
del fareData

In [14]:
#test_df = trip_and_fare_data.head()
test_df.columns

Index([u'medallion', u'hack_license', u'vendor_id', u'rate_code', u'store_and_fwd_flag', u'pickup_datetime', u'dropoff_datetime', u'passenger_count', u'trip_time_in_secs', u'trip_distance', u'pickup_longitude', u'pickup_latitude', u'dropoff_longitude', u'dropoff_latitude', u'payment_type', u'fare_amount', u'surcharge', u'mta_tax', u'tip_amount', u'tolls_amount', u'total_amount'], dtype='object')

In [23]:
trip_and_fare_data["borough_start"] = [get_closest_borough(lat,lon) for lat,lon in trip_and_fare_data[['pickup_latitude','pickup_longitude']].values]
trip_and_fare_data["borough_end"] = [get_closest_borough(lat,lon) for lat,lon in trip_and_fare_data[['dropoff_latitude','dropoff_longitude']].values]

In [28]:
trip_and_fare_data.borough_end.value_counts()

manhattan      4281466
brooklyn        509705
outside_nyc      93366
queens           73612
bronx            32389
staten            9463
dtype: int64

In [8]:
#2. Which driver (`hack_license`) carried the most passengers, on average?
passengers_per_driver = trip_and_fare_data.groupby("hack_license")["passenger_count"].mean()
largest_avg_passengers = passengers_per_driver[passengers_per_driver==passengers_per_driver.max()]
print "The driver that carried the most passengers on average:"
print largest_avg_passengers

In [16]:
#3. Which driver had the highest tip percentage, on average, for those drivers that made at least 5 trips in the dataset?
#this will make all columns have the same value, tip percentage, for each row, so if you pick one column, its
#values will be the same as any other column's
filteredDrivers = trip_and_fare_data.groupby("hack_license").filter(lambda x: x.size>=5).groupby("hack_license")[["tip_amount","total_amount"]].agg(lambda x: np.mean(x.tip_amount/x.total_amount))["tip_amount"]
print "The driver with the highest tip percentage was:"
print filteredDrivers[filteredDrivers==filteredDrivers.max()]

The driver with the highest tip percentage was:
hack_license
3E0D0714047240704CB51E0EB3A0101C    0.3125
Name: tip_amount, dtype: float64


In [25]:
#Was there any relationship between (correlation) when a taxi ride ended (get the closest minute within the day) 
#and the tip percentage on the fare?
#trip_and_fare_data["pickup_datetime"] = pd.to_datetime(trip_and_fare_data.pickup_datetime,unit='s')
#trip_and_fare_data["dropoff_datetime"] = pd.to_datetime(trip_and_fare_data.dropoff_datetime,unit='s')
trip_and_fare_data["tip_percentage"] = trip_and_fare_data.tip_amount/trip_and_fare_data.total_amount
trip_and_fare_data["minute_in_day"] = (trip_and_fare_data.dropoff_datetime.dt.hour*60)+trip_and_fare_data.dropoff_datetime.dt.minute
print "The correlation between tip percentage and minute in the day was:"
print trip_and_fare_data.minute_in_day.corr(trip_and_fare_data.tip_percentage)

The correlation between tip percentage and minute in the day was:
0.00806605983931


In [28]:
#Did the trip time correlate with the cost of the trip?
print "The correlation between trip time and trip cost was:"
print trip_and_fare_data.trip_time_in_secs.corr(trip_and_fare_data.total_amount)
  #What about the tip amount?
print "The correlation between trip time and tip amount was:"
print trip_and_fare_data.trip_time_in_secs.corr(trip_and_fare_data.tip_amount)

The correlation between trip time and trip cost was:
0.771897747749
The correlation between trip time and tip amount was:
0.499344485629


In [30]:
#Which driver made the most money overall? Assume the only money made was from tips.
hack_groups = trip_and_fare_data.groupby("hack_license")
total_made_in_tips = hack_groups["tip_amount"].sum()
print "The driver that made the most money in tips was:"
print total_made_in_tips[total_made_in_tips==total_made_in_tips.max()]
#  * Which driver made the most money on average?
avg_tips = hack_groups["tip_amount"].mean()
print "The driver that made the most money on average per trip:"
print avg_tips[avg_tips==avg_tips.max()]

The driver that made the most money in tips was:
hack_license
2BF7915E6DC6252344DA12975B2B3E06    1183.13
Name: tip_amount, dtype: float64
The driver that made the most money on average per trip:
hack_license
6C36C7C13C8B025DB8C66FA2E091D37F    14.25
Name: tip_amount, dtype: float64


In [31]:
# Which driver logged the most miles in this dataset?
total_miles = hack_groups["trip_distance"].sum()
print "The driver that logged the most miles was:"
print total_miles[total_miles==total_miles.max()]

The driver that logged the most miles was:
hack_license
1E94B13BB698BC3C98178429C45FDEED    1807.8
Name: trip_distance, dtype: float64
