For this lab, you will be using the `trip_fare_1_trimmed.csv` file found in the `day1` folder that you should have received when you signed up for this course.

This dataset contains a very large number of distinct trips taken in cabs in the NYC area in 2013 (5 million of them, to be exact!).

The dataset contains the following information at the top of the file (this is called the header):

* `medallion`: The ID of the cab being operated
* `hack_license`: The ID of the person operating the cab
* `vendor_id`: The type of vendor operating the cab, can either be `CMT` or `VTS`, no clue what these two types mean
* `pickup_datetime`: The time when the ride started
* `payment_type`: How the trip was paid, `UNK` stands for unknown, I have no idea what `NOC` stands for, but lets assume its some known way to pay
* `fare_amount`: Base fare cost of the trip
* `surcharge`: Additional charges that are not tolls
* `mta_tax`: The mta has to get its cut, right? :)
* `tip_amount`: How generous the rider(s) decided to be
* `tolls_amount`: How much money you had to pay in tolls
* `total_amount`: How much the trip cost, all in

Phew, That's a lot of info!

Here is the assignment:

1. What was the most expensive/least expensive trip taken?
* Does the overall `total_amount` paid per ride correlate with `tip_amount` per ride?
* Does it correlate when you remove all rides with unknown `payment_type`?
* Calculate the average cost of a trip in this dataset given the following conditions:
  1. Across the whole dataset
  2. Across the whole dataset when the `payment_type` is known (not `UNK`)
  3. For each `payment_type`
  4. Which `payment_type` had the highest average cost?
  5. Which `payment_type` had the largest spread in how much people paid (largest standard deviation)?
  6. Which `payment_type` had the most generous people (had the highest average tip), including unknown payment types?
  7. What hour in the day were people most generous, on average, when they got into a cab?
  8. What hour of the day did people fluctuate the most in terms of tips? That is, do some hours lead to unpredictable tip amounts? 
* Which person (`hack_license`) made the most money:
  1. In total
  2. On a per-trip basis, given that they took at least 20 trips
* Does the number of trips a given cabbie takes (her/his experience) correlate with how well she/he is tipped? If so, in what direction?
* Does the number of times a given cab is used correlate with how well the person driving the cab is tipped? That is, are there "lucky" cabs?
* Which `vendor_id` had the higher average `surcharge` on a per-hour basis?
* Which hour in the day: 
  1. Did people most frequently take rides?
  2. Did people least frequently take rides?
  3. Had the largest number of unique cabs on the street?
  4. Had the least number of cabs in the street?
  5. What is the average number of cabs on the streets in NYC in each quarter of the day (at least in this dataset?)?

Use the rest of this notebook to work through all these questions. 

I will be coming around to everyone to help/guide you in your data science quest!

If you can tackle all of these questions, then you've learned a lot already! 

If not, don't worry, this stuff is hard and I will gladly help/guide you through all of this.

but take charge of your learning!

This means:

* Ask a neighbor to help if you don't understand something. 
* If your neighbor can't help you, try using:
  * the interactive documentation I showed you how to use earlier
  * [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/index.html)
  * [google](http://www.google.com)
  * [stackoverflow](http://stackoverflow.com) to see if someone in the internet ether has had a similar problem before
  * if none of this works, then I will gladly help you
* This will accomplish at least two things:
  * It will get you to use online resources and take charge of your learning
  * Get you to learn alternative approaches (those I did not show you today) to solving your problem

I've started the bare-bones script for you by:

* importing what I ~~think~~ know you will need
* loading the dataset into a variable called `fareData` that stores the data as a `DataFrame` object (you might need to change the path to where the file is located on your system)
* formatting the timestamp for you so that you don't have to figure out how to do it, because spending 30+ minutes  (or more) trying to figure it out is not the point of the assignment. This way, all of the functions in `fareData.pickup_datetime.dt` can immediately be used on the `pickup_datetime` column your dataset.

The rest I leave to you. Happy hacking!

In [1]:
import pandas as pd
import numpy as np

In [2]:
#fareData.pickup_datetime.dt.
fareData = pd.read_csv("/Users/sergey.fogelson/code/flatiron_school/intro-datascience-workshop/day1/nycTaxiData/trip_fare_1_trimmed.csv")
fareData.pickup_datetime = pd.to_datetime(fareData.pickup_datetime,format="%Y-%m-%d %H:%M:%S")
fareData.dtypes #this is to confirm that the pickup_datetime column, as well as all of the other
# columns, are in the appropriate formats (pickup_datetime should be in datetime64 format)
# if it isn't something is wrong, and we need to figure what that is

medallion                  object
hack_license               object
vendor_id                  object
pickup_datetime    datetime64[ns]
payment_type               object
fare_amount               float64
surcharge                 float64
mta_tax                   float64
tip_amount                float64
tolls_amount              float64
total_amount              float64
dtype: object

In [3]:
#1. What was the most expensive/least expensive trip taken?
mostExpensiveTrip = fareData[fareData.total_amount == fareData.total_amount.max()]
leastExpensiveTrip = fareData[fareData.total_amount == fareData.total_amount.min()]

In [4]:
#2. Does the overall `total_amount` paid per ride correlate with `tip_amount` per ride?
totalToTipCorr = fareData.total_amount.corr(fareData.tip_amount)

In [5]:
#3. Does it correlate when you remove all rides with unknown `payment_type`?
fareDataNoUnk = fareData[fareData.payment_type!="UNK"]
totalToTipCorrNoUnk = fareDataNoUnk.total_amount.corr(fareDataNoUnk.tip_amount)

In [6]:
#4.* Calculate the average cost of a trip in this dataset given the following conditions:
  #1. Across the whole dataset:
avgTripCost = fareData.total_amount.mean()
  #2. Across the whole dataset when the `payment_type` is known (not `UNK`):
avgTripCostNoUnk = fareDataNoUnk.total_amount.mean()
  #3. For each `payment_type`
avgTripCostPerPaymentType = fareData.groupby("payment_type")["total_amount"].mean()
  #4. Which `payment_type` had the highest average cost?
print "The payment type with the highest average cost was:"
print avgTripCostPerPaymentType[avgTripCostPerPaymentType==avgTripCostPerPaymentType.max()]
  #5. Which `payment_type` had the largest spread in how much people paid (largest standard deviation)?
stdTripCostPerPaymentType = fareData.groupby("payment_type")["total_amount"].std()
print "The largest spread in payment type was:"
print stdTripCostPerPaymentType[stdTripCostPerPaymentType==stdTripCostPerPaymentType.max()]
  #6. Which `payment_type` had the most generous people (had the highest average tip), including unknown payment types?
avgTipPerPaymentType = fareData.groupby("payment_type")["tip_amount"].mean()
print "The most generous payment type was:"
print avgTipPerPaymentType[avgTipPerPaymentType==avgTipPerPaymentType.max()]
  #7. What hour in the day were people most generous, on average, when they got into a cab?
fareData["hour"] = fareData.pickup_datetime.dt.hour
avgTipPerHour = fareData.groupby("hour")["tip_amount"].mean()
print "The hour with the most generous tippers was:"
print avgTipPerHour[avgTipPerHour==avgTipPerHour.max()]
  #8. What hour of the day did people fluctuate the most in terms of tips? That is, do some hours lead to unpredictable tip amounts?
stdTipPerHour = fareData.groupby("hour")["tip_amount"].std()
print "The hour with the most unpredictable tippers was:"
print stdTipPerHour[stdTipPerHour==stdTipPerHour.max()]

The payment type with the highest average cost was:
payment_type
UNK             21.172428
Name: total_amount, dtype: float64
The largest spread in payment type was:
payment_type
UNK             19.028116
Name: total_amount, dtype: float64
The most generous payment type was:
payment_type
UNK             3.262124
Name: tip_amount, dtype: float64
The hour with the most generous tippers was:
hour
5       2.458814
Name: tip_amount, dtype: float64
The hour with the most unpredictable tippers was:
hour
5       3.526782
Name: tip_amount, dtype: float64


In [11]:
#* Which person (`hack_license`) made the most money:
  #1. In total
totalAmount = fareData.groupby("hack_license")["total_amount"].sum()
print "The driver that made the most money was:"
print totalAmount[totalAmount==totalAmount.max()]
  #2. On a per-trip basis, given that they took at least 20 trips
minNumTripsData = fareData.groupby("hack_license").filter(lambda x: x.shape[0] >= 20)
print "The person that made the most money, given they made at least 20 trips was:"
totalAmountFiltered = minNumTripsData.groupby("hack_license")["total_amount"].sum()
print totalAmountFiltered[totalAmountFiltered==totalAmountFiltered.max()]

The driver that made the most money was:
hack_license
CFCD208495D565EF66E7DFF9F98764DA    10255.17
Name: total_amount, dtype: float64
The person that made the most money, given they made at least 20 trips was:
hack_license
CFCD208495D565EF66E7DFF9F98764DA    10255.17
Name: total_amount, dtype: float64
Here is the correlation matrix of tips to number of trips per driver:
          size      mean
size  1.000000 -0.251093
mean -0.251093  1.000000
Here is the correlation matrix of tips to number of trips per car:
          size      mean
size  1.000000 -0.225064
mean -0.225064  1.000000
vendor_id       CMT       VTS
hour                         
0          0.488799  0.490631
1          0.491713  0.492930
2          0.493461  0.494573
3          0.491010  0.492072
4          0.477923  0.484048
5          0.419006  0.457251
6          0.005230  0.000000
7          0.000843  0.000000
8          0.000978  0.000000
9          0.000870  0.000000
10         0.000778  0.000000
11         0.000882 

In [None]:
#* Does the number of trips a given cabbie takes (her/his experience) correlate with how well she/he is tipped? If so, in what direction?
print "Here is the correlation matrix of tips to number of trips per driver:"
hackGroupsSummary = fareData.groupby("hack_license")["tip_amount"].agg([np.size, np.mean])
print hackGroupsSummary.corr()

In [None]:
#* Does the number of times a given cab is used correlate with how well the person driving the cab is tipped? That is, are there "lucky" cabs?
taxiGroupsSummary = fareData.groupby("medallion")["tip_amount"].agg([np.size, np.mean])
print "Here is the correlation matrix of tips to number of trips per car:"
print taxiGroupsSummary.corr()

In [15]:
#* Which `vendor_id` had the highest average `surcharge` on a per-hour basis?
perVendorMeans = fareData.groupby(["vendor_id","hour"])["surcharge"].mean().unstack(level=1)
perVendorMeans = perVendorMeans.mean(axis=1)
print "The vendor with the highest average surcharge per-hour is:"
print perVendorMeans[perVendorMeans==perVendorMeans.max()]

The vendor with the highest average surcharge per-hour is:
vendor_id
CMT          0.317149
dtype: float64


In [17]:
#* Which hour in the day: 
#  1. Did people most frequently take rides?
hourGroups = fareData.groupby("hour")
hourGroupsSizes = hourGroups.size()
print "The hour with the most rides is:"
print hourGroupsSizes[hourGroupsSizes==hourGroupsSizes.max()]
#  2. Did people least frequently take rides?
print "The hour with the fewest rides is:"
print hourGroupsSizes[hourGroupsSizes==hourGroupsSizes.min()]
#  3. Had the largest number of unique cabs on the street?
uniqueCabsHour = fareData.drop_duplicates(["medallion","hour"])
uniqueCabsPerHour = uniqueCabsHour.groupby("hour").size()
print "The hour with the most unique cabs on the street:"
print uniqueCabsPerHour[uniqueCabsPerHour==uniqueCabsPerHour.max()]
#  4. Had the least number of cabs in the street?
print "The hour with the fewest unique cabs on the street:"
print uniqueCabsPerHour[uniqueCabsPerHour==uniqueCabsPerHour.min()]
del uniqueCabsHour
#  5. What is the average number of cabs on the streets in NYC in each quarter of the day (at least in this dataset?)?
fareData["quarterDay"] = pd.cut(fareData.hour,[-1,5,11,17,np.inf])
earliestDay = fareData.pickup_datetime.dt.dayofyear.min()
lastDay = fareData.pickup_datetime.dt.dayofyear.max()
totalQuarters = (lastDay-earliestDay)*4
uniqueCabsQuarterDay = fareData.drop_duplicates(["medallion","quarterDay"])
print "The number of unique cabs per quarter of the day in this dataset is:"
print uniqueCabsQuarterDay.groupby("quarterDay").size()/totalQuarters

The hour with the most rides is:
hour
19      321171
dtype: int64
The hour with the fewest rides is:
hour
5       51066
dtype: int64
The hour with the most unique cabs on the street:
hour
18      13164
dtype: int64
The hour with the fewest unique cabs on the street:
hour
5       9184
dtype: int64
quarterDay
(-1, 5]       131.156250
(5, 11]       135.520833
(11, 17]      138.916667
(17, inf]     137.864583
dtype: float64
