## You will be using NYC taxi ride data. There are two files located in the `data/nycTaxiData/` folder: `trip_fare_500k.csv` and `trip_data_500k.csv`.

The answers will be posted 9/27 after class.

`trip_fare_500k.csv` file found in the `data/nycTaxiData/` folder. 
This dataset contains a fairly large number of distinct trips taken in cabs in the NYC area in 2013 (500 thousand of them, to be exact!).

The dataset contains the following information at the top of the file (this is called the header):

* `medallion`: The ID of the cab being operated
* `hack_license`: The ID of the person operating the cab
* `vendor_id`: The type of vendor operating the cab, can either be `CMT` or `VTS`, no clue what these two types mean
* `pickup_datetime`: The time when the ride started
* `payment_type`: How the trip was paid, `UNK` stands for unknown, I have no idea what `NOC` stands for, but lets assume its some known way to pay
* `fare_amount`: Base fare cost of the trip
* `surcharge`: Additional charges that are not tolls
* `mta_tax`: The mta has to get its cut, right? :)
* `tip_amount`: How generous the rider(s) decided to be
* `tolls_amount`: How much money you had to pay in tolls
* `total_amount`: How much the trip cost.

Here are the columns of the trip dataset, found in `trip_data_500k.csv`:

* `medallion`: The ID of the cab being operated
* `hack_license`: The ID of the person operating the cab
* `vendor_id`: The type of vendor operating the cab, can either be `CMT` or `VTS`, no clue what these two types mean
* `rate_code`: Designates the kind of ride this is, must be `1` through `6`, any other number is incorrect
* `store_and_fwd_flag`: Can be either `Y`,`N`, or Nan
* `pickup_datetime`: The time when the ride started
* `dropoff_datetime`: The time when the ride ended
* `passenger_count`: The number of passengers during the ride
* `trip_time_in_secs`: How long the trip took
* `trip_distance`: Distance of the trip, to the nearest 1/10 mile
* `pickup_longitude`: Longitude of pickup location
* `pickup_latitude`: Latitude of pickup location
* `dropoff_longitude`: Longitude of dropoff location
* `dropoff_latitude`: Latitude of dropoff location

First step - make your own copy of the notebook.

Use the rest of this notebook to work through all these questions. 

If you can tackle all of these questions, then you've learned a lot already! 

For tips and commands, see the pandas class notebooks or https://github.com/guipsamora/pandas_exercises.

If not, don't worry, this stuff is hard and T.J. and Ramesh will gladly help/guide you through all of this. Contact us through Slack with any questions.

But take charge of your learning! This means:

* Ask a classmate
to help if you don't understand something. 
* If your neighbor can't help you, try using:
  * [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/index.html)
  * [google](http://www.google.com)
  * [stackoverflow](http://stackoverflow.com) to see if someone in the internet ether has had a similar problem before
  * if none of this works, then I will gladly help you
* This will accomplish at least two things:
  * It will get you to use online resources and take charge of your learning
  * Get you to learn alternative approaches

I've started the bare-bones script for you by:

* importing what you will need.
* loading the two datasets into `DataFrame` objects (you might need to change the path to where the file is located on your system).
* formatting the timestamp for you so that you don't have to figure out how to do it, because spending 30+ minutes  (or more) trying to figure it out is not the point of the assignment. This way, all of the functions in `fareData.pickup_datetime.dt` can immediately be used on the `pickup_datetime` column your dataset.

The rest I leave to you. Happy hacking!

In [1]:
from __future__ import print_function, unicode_literals, division
import pandas as pd
import numpy as np

## Let's start with the fare data:

In [2]:
fareData = pd.read_csv("../data/nycTaxiData/trip_fare_500k.csv")
fareData.pickup_datetime = pd.to_datetime(fareData.pickup_datetime,format="%Y-%m-%d %H:%M:%S")
fareData.dtypes #this is to confirm that the pickup_datetime column, as well as all of the other
# columns, are in the appropriate formats (pickup_datetime should be in datetime64 format)
# if it isn't something is wrong, and we need to figure what that is

medallion                  object
hack_license               object
vendor_id                  object
pickup_datetime    datetime64[ns]
payment_type               object
fare_amount               float64
surcharge                 float64
mta_tax                   float64
tip_amount                float64
tolls_amount              float64
total_amount              float64
dtype: object

<b>Are there any missing data (null-values)?

In [3]:
print(fareData.shape)
print(fareData.isnull().sum())
print(fareData.isnull().any())
print(fareData.count())
print(fareData.size)
print(fareData.info())
print(fareData.describe())
# All of the below data appears to indicate that there is no missing (null-valued) data

(500000, 11)
medallion          0
hack_license       0
vendor_id          0
pickup_datetime    0
payment_type       0
fare_amount        0
surcharge          0
mta_tax            0
tip_amount         0
tolls_amount       0
total_amount       0
dtype: int64
medallion          False
hack_license       False
vendor_id          False
pickup_datetime    False
payment_type       False
fare_amount        False
surcharge          False
mta_tax            False
tip_amount         False
tolls_amount       False
total_amount       False
dtype: bool
medallion          500000
hack_license       500000
vendor_id          500000
pickup_datetime    500000
payment_type       500000
fare_amount        500000
surcharge          500000
mta_tax            500000
tip_amount         500000
tolls_amount       500000
total_amount       500000
dtype: int64
5500000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 11 columns):
medallion          500000 non-null obj

<b>What was the most expensive/least expensive trip taken?</b>

In [4]:
print('The total amount of the most expensive trip taken was', fareData.total_amount.max(), 'dollars')
print('The total amount of the least expensive trip taken was', fareData.total_amount.min(), 'dollars')
# This correct according to course instructors

The total amount of the most expensive trip taken was 460.5 dollars
The total amount of the least expensive trip taken was 2.5 dollars


<b>How does the overall `total_amount` paid per ride correlate with `tip_amount` per ride?</b>

In [5]:
# To correlate two columns in pandas we do the following, where we correlate total amount paid per ride with
# tip_amount per ride:
print(fareData.total_amount.corr(fareData.tip_amount))
# For the the full correlation matrix among all variables that are of a numeric type, we just call corr on 
# the whole DataFrame. pandas will know to only make the correlations among numeric columns only and will exclude all 
# non-numeric columns from the resulting correlation matrix:
fareData.corr()
# All of this is correct according to course instructors.

0.67087530714


Unnamed: 0,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
fare_amount,1.0,-0.064229,-0.256318,0.549345,0.614237,0.984733
surcharge,-0.064229,1.0,0.02395,-0.021421,-0.063364,-0.035148
mta_tax,-0.256318,0.02395,1.0,-0.134473,-0.260192,-0.258903
tip_amount,0.549345,-0.021421,-0.134473,1.0,0.413718,0.670875
tolls_amount,0.614237,-0.063364,-0.260192,0.413718,1.0,0.676342
total_amount,0.984733,-0.035148,-0.258903,0.670875,0.676342,1.0


<b>How do they correlate for only rides with cash `payment_type`?<b>

In [6]:
fareData_CSH_MASK = fareData.payment_type == 'CSH'
fareData_CSH=fareData[fareData_CSH_MASK]
fareData_CSH.total_amount.corr(fareData_CSH.tip_amount)
# We see that there is almost no correlation between total amount and tip amount per ride when cash is paid.
# This appears to be correct according to course instructors

0.0030334289521610526

In [7]:
# This is how the instructors did it:
fareData.payment_type.unique()
filteredDF = fareData[fareData.payment_type == 'CSH']
filteredDF.total_amount.corr(fareData.tip_amount)

0.0030334289521610526

<b>Calculate the average cost of a trip in this dataset given the following conditions:</b>
  1. Across the whole dataset
  2. Across the whole dataset when the `payment_type` is known (not `UNK`)
  3. For each `payment_type`. You can totally do this 1 by 1, but try to do this in a for loop.
  4. Which `payment_type` had the highest average cost?
  5. Which `payment_type` had the largest spread in how much people paid (largest standard deviation)?
  6. Which `payment_type` had the most generous people (had the highest average tip), including unknown payment types?
  7. What hour in the day were people most generous, on average, when they got into a cab?
  8. What hour of the day did people fluctuate the most in terms of tips? That is, do some hours lead to unpredictable tip amounts? 

In [8]:
#1 
print("The average cost of a trip across the whole dataset is:",fareData.total_amount.mean())
# This is correct according to instructors

The average cost of a trip across the whole dataset is: 14.17029444


In [9]:
#2
fareData_Known_MASK = fareData.payment_type !='UNK'
fareData_Known=fareData[fareData_Known_MASK]
# print(fareData_Known.total_amount)
print("Average cost of trip across whole data set when payment type is known:", fareData_Known.total_amount.mean())
# This is correct according to instructors. The instructors did this the following way:
print(fareData[fareData.payment_type != 'UNK'].total_amount.mean())

Average cost of trip across whole data set when payment type is known: 14.1630248047
14.1630248047


In [10]:
#3
print(fareData.payment_type.unique())

fareData_CSH_MASK = fareData.payment_type == 'CSH'
fareData_DIS_MASK = fareData.payment_type == 'DIS'
fareData_NOC_MASK = fareData.payment_type == 'NOC'
fareData_CRD_MASK = fareData.payment_type == 'CRD'
fareData_UNK_MASK = fareData.payment_type == 'UNK'

fareData_CSH=fareData[fareData_CSH_MASK]
# fardData_CSH = fareData[fareData.payment_type == 'CSH']  -- using this approach gives same results (sanity check)
fareData_DIS=fareData[fareData_DIS_MASK]
fareData_NOC=fareData[fareData_NOC_MASK]
fareData_CRD=fareData[fareData_CRD_MASK]
fareData_UNK=fareData[fareData_UNK_MASK]

print("Average cost of trip across whole data set when payment type is CSH:", fareData_CSH.total_amount.mean())
print("Average cost of trip across whole data set when payment type is DIS:", fareData_DIS.total_amount.mean())
print("Average cost of trip across whole data set when payment type is NOC:", fareData_NOC.total_amount.mean())
print("Average cost of trip across whole data set when payment type is CRD:", fareData_CRD.total_amount.mean())
print("Average cost of trip across whole data set when payment type is UNK:", fareData_UNK.total_amount.mean())
# payment_types = ['CSH', 'DIS', 'NOC', 'CRD', 'UNK']
# for types in payment_types:
    # print("Average cost of trip across whole data set when payment type is %s :" % types)
    # fareData[fareData.payment_type == types].mean()
# This is correct, but it would be best to use a for loop as the instructors did:
paytypes = fareData.payment_type.unique()
for paytype in paytypes:
    print("Average total fare when paid by " + paytype)
    print(fareData[fareData.payment_type == paytype].total_amount.mean())

['CSH' 'DIS' 'NOC' 'CRD' 'UNK']
Average cost of trip across whole data set when payment type is CSH: 11.6167465066
Average cost of trip across whole data set when payment type is DIS: 5.75
Average cost of trip across whole data set when payment type is NOC: 3.0
Average cost of trip across whole data set when payment type is CRD: 16.371266867
Average cost of trip across whole data set when payment type is UNK: 22.1166739606
Average total fare when paid by CSH
11.6167465066
Average total fare when paid by DIS
5.75
Average total fare when paid by NOC
3.0
Average total fare when paid by CRD
16.371266867
Average total fare when paid by UNK
22.1166739606


In [11]:
#4
# According to the results of #3, the payment type with the highest average cost was 'UNK'.

In [12]:
#5
print("Standard deviation of total payment amount when payment type is CSH:", fareData_CSH.total_amount.std())
print("Standard deviation of total payment amount when payment type is DIS:", fareData_DIS.total_amount.std())
print("Standard deviation of total payment amount when payment type is NOC:", fareData_NOC.total_amount.std())
print("Standard deviation of total payment amount when payment type is CRD:", fareData_CRD.total_amount.std())
print("Standard deviation of total payment amount when payment type is UNK:", fareData_UNK.total_amount.std())
# We see that the largest spread in how much people paid was for the unknown ('UNK') payment method.
# This is correct, but would probably be best to use a for loop as the instructors did:
paytypes = fareData.payment_type.unique()
for paytype in paytypes:
    print("STD of fare when paid by " + paytype)
    print(fareData[fareData.payment_type == paytype].total_amount.std())

Standard deviation of total payment amount when payment type is CSH: 9.5683194441
Standard deviation of total payment amount when payment type is DIS: 2.47487373415
Standard deviation of total payment amount when payment type is NOC: nan
Standard deviation of total payment amount when payment type is CRD: 13.7538825481
Standard deviation of total payment amount when payment type is UNK: 21.0492771964
STD of fare when paid by CSH
9.5683194441
STD of fare when paid by DIS
2.47487373415
STD of fare when paid by NOC
nan
STD of fare when paid by CRD
13.7538825481
STD of fare when paid by UNK
21.0492771964


In [13]:
#6: Which payment_type had the most generous people (had the highest average tip), including unknown payment types?
print("The average tip for the CSH payment type: ", fareData_CSH.tip_amount.mean())
print("The average tip for the DIS payment type: ", fareData_DIS.tip_amount.mean())
print("The average tip for the NOC payment type: ", fareData_NOC.tip_amount.mean())
print("The average tip for the CRD payment type: ", fareData_CRD.tip_amount.mean())
print("The average tip for the UNK payment type: ", fareData_UNK.tip_amount.mean())
# So the payment_type with the most generous people is the unknown ('UNK') payment type.
print("So the payment_type with the most generous people is the unknown ('UNK') payment type.")
# This is correct, but it would again probably be best to use a for loop as instructors did:
for paytype in paytypes:
    print("Average tip amount when paid by " + paytype)
    print(fareData[fareData.payment_type == paytype].tip_amount.mean())

The average tip for the CSH payment type:  7.43515253916e-05
The average tip for the DIS payment type:  0.0
The average tip for the NOC payment type:  0.0
The average tip for the CRD payment type:  2.42857431205
The average tip for the UNK payment type:  3.43008752735
So the payment_type with the most generous people is the unknown ('UNK') payment type.
Average tip amount when paid by CSH
7.43515253916e-05
Average tip amount when paid by DIS
0.0
Average tip amount when paid by NOC
0.0
Average tip amount when paid by CRD
2.42857431205
Average tip amount when paid by UNK
3.43008752735


In [14]:
#7: What hour in the day were people most generous, on average, when they got into a cab?
fareData.pickup_datetime.dt.hour
fareData["hour"]=fareData.pickup_datetime.dt.hour
fareData.head(10)
hour = fareData.hour
fareData[hour==1].tip_amount.mean()
hourGroups = fareData.groupby("hour")
hourGroups
# print(hourGroups.describe())
# print(hourGroups.tip_amount.describe())
print(hourGroups.tip_amount.mean())

hourGroup_mean_tips = hourGroups.tip_amount.mean()
hourGroup_mean_tips[hourGroup_mean_tips==hourGroup_mean_tips.max()]
# Tells us that hour 5 (5:00 AM) is the hour of the day where people 

hour
0     1.535690
1     1.379282
2     1.199969
3     1.254676
4     1.401401
5     1.942813
6     1.328011
7     1.271893
8     1.329713
9     1.339304
10    1.243790
11    1.194171
12    1.184760
13    1.202658
14    1.298043
15    1.283852
16    1.284983
17    1.296261
18    1.254164
19    1.266640
20    1.369344
21    1.454277
22    1.406743
23    1.495352
Name: tip_amount, dtype: float64


hour
5    1.942813
Name: tip_amount, dtype: float64

In [15]:
#7 was correct, but the instructors used a different approach as follows:
fareData['hour'] = fareData.pickup_datetime.dt.hour
meanByHourDF = fareData[['hour', 'tip_amount']].groupby('hour').mean().reset_index()
print(meanByHourDF[meanByHourDF.tip_amount == meanByHourDF.tip_amount.max()])
fareData.head()

   hour  tip_amount
5     5    1.942813


Unnamed: 0,medallion,hack_license,vendor_id,pickup_datetime,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,hour
0,89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01 15:11:48,CSH,6.5,0.0,0.5,0.0,0.0,7.0,15
1,0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06 00:18:35,CSH,6.0,0.5,0.5,0.0,0.0,7.0,0
2,0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05 18:49:41,CSH,5.5,1.0,0.5,0.0,0.0,7.0,18
3,DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:54:15,CSH,5.0,0.5,0.5,0.0,0.0,6.0,23
4,DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:25:03,CSH,9.5,0.5,0.5,0.0,0.0,10.5,23


In [16]:
#8: What hour of day did people fluctuate most in terms of tips? That is, do some hours lead to unpredictable tip 
# amounts?
print(hourGroups.tip_amount.std())
hourGroup_std_tips = hourGroups.tip_amount.std()
hourGroup_std_tips[hourGroup_std_tips==hourGroup_std_tips.max()]
# Tells us that hour 5 (5:00 AM) is the hour of the day where people fluctuate the most in terms of tips (highest 
# standard deviation of tip amounts).

hour
0     2.507986
1     2.319315
2     1.869682
3     2.085183
4     2.659419
5     3.275576
6     2.395520
7     2.066681
8     1.908732
9     1.988527
10    1.913502
11    1.894122
12    1.954182
13    2.082766
14    2.370127
15    2.363344
16    2.446023
17    2.370470
18    1.886830
19    2.122856
20    1.950661
21    2.220345
22    2.037812
23    2.286110
Name: tip_amount, dtype: float64


hour
5    3.275576
Name: tip_amount, dtype: float64

In [17]:
#8 is correct, but instructors used a different approach as follows:
stdByHourDF = fareData[['hour', 'tip_amount']].groupby('hour').std().reset_index()
print(stdByHourDF[stdByHourDF.tip_amount == stdByHourDF.tip_amount.max()])

   hour  tip_amount
5     5    3.275576


<b>Which person (`hack_license`) made the most money:</b>
  1. In total
  2. On a per-trip basis, given that they took at least 20 trips

In [18]:
# Which person(hack_license) made the most tip amount money in total?
license = fareData.hack_license
licenseGroup = fareData.groupby("hack_license")
licenseGroup
print(licenseGroup.tip_amount.sum())
licenseGroup_sum_tips = licenseGroup.tip_amount.sum()
licenseGroup_sum_tips[licenseGroup_sum_tips==licenseGroup_sum_tips.max()]
# Tells us that the person (hack_license) that made the most money in tip_amounts was BD0913D639AA03DA954EA97E2A3A1101
# ($225.34)

hack_license
0008B3E338CE8C3377E071A4D80D3694     39.05
000B8D660A329BBDBF888500E4BD8B98      2.00
00184958F5D5FD0A9EC0B115C5B55796     23.90
001C8AAB90AEE49F36FCAA7B4136C81A     59.32
0025133AD810DBE80D35FCA8BF0BCA1F     25.51
002C093A2CB9FD40C8C54AB5D158FC47     52.37
002FE84F0EA642650A650C2BE875DDD3     35.00
0031E634F79DA0E6B01239A8017F5928     22.60
00374328FBA75FBFCA7522671250F573      3.50
003A4E7151035D313A6778DE13A2F326      2.80
003C68DFE1EBE120556D011948C78829      1.00
00447A6197DBB329FBF764139ACA6EC4     63.03
0046F1E91AA13DEDE4F6EE775C6293AB     84.33
00567B1CBFD51DDFAC73359B09238922     44.17
0057CCB5BA8D29E343B3D6D275AB22D3     18.95
006114F940CB87B3ABDCE9BF6DF6FCC4    110.35
006313464EC98A24BB4EBC1E2419E439     45.60
006B6BD90C7B5C98599FBF541D056513     42.96
006FAD57CE21BB431C86C2845150765E      0.00
00711D0CC3FB5BC905BB62D9B62296D6     75.45
007357E7FFE212879B9B85C7F4681AE5     73.47
0074114ECFC6F04AA153AE4DE748888D     29.30
007439EEDB510EF8277C567BD200B08F     34.2

hack_license
BD0913D639AA03DA954EA97E2A3A1101    225.34
Name: tip_amount, dtype: float64

In [19]:
# Which person(hack_license) made the most amount of money overall in total?
print(licenseGroup.total_amount.sum())
licenseGroup_sum_total = licenseGroup.total_amount.sum()
licenseGroup_sum_total[licenseGroup_sum_total==licenseGroup_sum_total.max()]
# Tells us that the person (hack_license) that made the most money in total_amount was CFCD208495D565EF66E7DFF9F98764DA    
# ($2517.28)

hack_license
0008B3E338CE8C3377E071A4D80D3694     439.85
000B8D660A329BBDBF888500E4BD8B98      22.00
00184958F5D5FD0A9EC0B115C5B55796     238.90
001C8AAB90AEE49F36FCAA7B4136C81A     596.72
0025133AD810DBE80D35FCA8BF0BCA1F     302.51
002C093A2CB9FD40C8C54AB5D158FC47     632.27
002FE84F0EA642650A650C2BE875DDD3     420.00
0031E634F79DA0E6B01239A8017F5928     203.10
00374328FBA75FBFCA7522671250F573     103.00
003A4E7151035D313A6778DE13A2F326      16.80
003C68DFE1EBE120556D011948C78829       7.00
00447A6197DBB329FBF764139ACA6EC4     540.53
0046F1E91AA13DEDE4F6EE775C6293AB     753.13
00567B1CBFD51DDFAC73359B09238922     559.77
0057CCB5BA8D29E343B3D6D275AB22D3     206.25
006114F940CB87B3ABDCE9BF6DF6FCC4    1131.65
006313464EC98A24BB4EBC1E2419E439     492.60
006B6BD90C7B5C98599FBF541D056513     585.46
006FAD57CE21BB431C86C2845150765E      14.00
00711D0CC3FB5BC905BB62D9B62296D6     615.25
007357E7FFE212879B9B85C7F4681AE5    1017.92
0074114ECFC6F04AA153AE4DE748888D     289.80
007439EEDB510EF8277

hack_license
CFCD208495D565EF66E7DFF9F98764DA    2517.28
Name: total_amount, dtype: float64

In [20]:
# The above is correct, but the instructors' solution seems better:
totalPerDriver = fareData[['hack_license', 'total_amount']].groupby('hack_license').sum()
print(totalPerDriver[totalPerDriver.total_amount == totalPerDriver.total_amount.max()])
print(totalPerDriver)

                                  total_amount
hack_license                                  
CFCD208495D565EF66E7DFF9F98764DA       2517.28
                                  total_amount
hack_license                                  
0008B3E338CE8C3377E071A4D80D3694        439.85
000B8D660A329BBDBF888500E4BD8B98         22.00
00184958F5D5FD0A9EC0B115C5B55796        238.90
001C8AAB90AEE49F36FCAA7B4136C81A        596.72
0025133AD810DBE80D35FCA8BF0BCA1F        302.51
002C093A2CB9FD40C8C54AB5D158FC47        632.27
002FE84F0EA642650A650C2BE875DDD3        420.00
0031E634F79DA0E6B01239A8017F5928        203.10
00374328FBA75FBFCA7522671250F573        103.00
003A4E7151035D313A6778DE13A2F326         16.80
003C68DFE1EBE120556D011948C78829          7.00
00447A6197DBB329FBF764139ACA6EC4        540.53
0046F1E91AA13DEDE4F6EE775C6293AB        753.13
00567B1CBFD51DDFAC73359B09238922        559.77
0057CCB5BA8D29E343B3D6D275AB22D3        206.25
006114F940CB87B3ABDCE9BF6DF6FCC4       1131.65
006313464EC98

In [21]:
# Which person (hack_license) made the most money on a per-trip basis, given that they took at least 20 trips?
# This one is more challenging. Here are the instructor's solutions:

totalPerDriver2 = fareData[['hack_license', 'total_amount']] \
    .groupby('hack_license') \
    .agg(['sum', 'count']) \
    .reset_index() 
print(totalPerDriver2)    
totalPerDriver2['dollarPerTrip'] = totalPerDriver2.total_amount['sum']/totalPerDriver2.total_amount['count']
print(totalPerDriver2[totalPerDriver2.dollarPerTrip == totalPerDriver2.dollarPerTrip.max()])






                           hack_license total_amount      
                                                 sum count
0      0008B3E338CE8C3377E071A4D80D3694       439.85    32
1      000B8D660A329BBDBF888500E4BD8B98        22.00     1
2      00184958F5D5FD0A9EC0B115C5B55796       238.90    20
3      001C8AAB90AEE49F36FCAA7B4136C81A       596.72    34
4      0025133AD810DBE80D35FCA8BF0BCA1F       302.51    23
5      002C093A2CB9FD40C8C54AB5D158FC47       632.27    42
6      002FE84F0EA642650A650C2BE875DDD3       420.00    37
7      0031E634F79DA0E6B01239A8017F5928       203.10    16
8      00374328FBA75FBFCA7522671250F573       103.00     7
9      003A4E7151035D313A6778DE13A2F326        16.80     1
10     003C68DFE1EBE120556D011948C78829         7.00     1
11     00447A6197DBB329FBF764139ACA6EC4       540.53    19
12     0046F1E91AA13DEDE4F6EE775C6293AB       753.13    47
13     00567B1CBFD51DDFAC73359B09238922       559.77    48
14     0057CCB5BA8D29E343B3D6D275AB22D3       206.25    

<b>Does the number of trips a given cabbie takes (her/his experience) correlate with how well she/he is tipped? If so, in what direction</b>

In [22]:
# Does the number of trips a given cabbie takes (her/his experience) correlate with how well she/he is tipped?
# If so, in what direction?
fareData.groupby("hack_license")["tip_amount"].agg([np.size, np.mean]).corr()
# Recall that np.size returns the number of elements in the frame or series.
# The output indicates that the experience of the cabbie has a somewhat small negative correlation with how well
# he/she is tipped.

Unnamed: 0,size,mean
size,1.0,-0.291082
mean,-0.291082,1.0


In [23]:
# Just checking some things out here and experimenting a bit....
print(fareData.hack_license.unique().size)
print(fareData.hack_license.size)
print(fareData.hack_license.nunique())
print(fareData.hack_license.shape)
print(fareData.hack_license.unique().shape)
print(licenseGroup.size)
fareData_license=fareData.hack_license.value_counts()
print(fareData_license)
fareData[fareData.hack_license=='3E8BC9829EE46234B580C2DA5ED69C0C']
# print("Filtered number of users:",filteredRatings.UserID.unique().size)

14969
500000
14969
(500000,)
(14969,)
<bound method DataFrameGroupBy.size of <pandas.core.groupby.DataFrameGroupBy object at 0x119931e50>>
533ED7E1E4C118A91AE9F55CE74AAE66    127
00B7691D86D96AEBD21DD9E138F90840    126
9112D33A328C37CF6E8A6B364F0C6109    124
508B0C200B7911E94E3D58151FADD644    121
12DF08C467CE0D44897DFB82171CBC63    116
C9674190984BA193FFD8DDCC019804CF    114
052C966F82A2B106DC6414C453AD7CE6    109
CB9AEE4760DB0551500EAA538E8FC108    109
97F7B431B057B98EAED0F323C4347B62    106
AB6F028ECDB62E44BE3ACEDE4E935ED6    105
7B3DAEAD0556C7DC4BB925B0A8BED5D7    104
86A0B5CA6B8A48DF19FD6518C81AAE23    104
84096122CEADF9ECB7D894CFBF1C28A6    103
F49FD0D84449AE7F72F3BC492CD6C754    102
50B70C0D50EC8A4D887B3FAB5C6D99A6    102
895F9C541258B1B18B8EDD4756A845F0    102
14A72874665882E85B4FA12024820B60    101
91927349DF07550D9D15BDF632DD297A    101
99CA4DC32BAE022D8129833D033A4618    101
47D7E01679A8EFCF3CF025F675B79590    101
69996930170E51265187F2D360A2366D    100
B8E7F13D7811C680D1FDC

Unnamed: 0,medallion,hack_license,vendor_id,pickup_datetime,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,hour
480172,6094BBC8B7678D1870F1851735773972,3E8BC9829EE46234B580C2DA5ED69C0C,CMT,2013-01-20 07:17:57,CRD,52.0,0.0,0.5,30.69,4.8,87.99,7


<b>Does the number of times a given cab is used correlate with how well the person driving the cab is tipped? That is, are there "lucky" cabs?</b>

In [24]:
# Does the number of times a given cab is used correlate with how well the person driving the cab is tipped? That is, 
# are there "lucky" cabs?
# Recall that 'medallion' is ID of the cab being operated
print(fareData.groupby("medallion")["tip_amount"].agg([np.size, np.mean]).head())
fareData.groupby("medallion")["tip_amount"].agg([np.size, np.mean]).corr()

# There is a somewhat small/weak correlation here.


# DataFrameGroupBy.agg(arg, *args, **kwargs)
# Aggregate using input function or dict of {column -> function}

# Parameters:	
# arg : function or dict
# Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to 
# DataFrame.apply. If passed a dict, the keys must be DataFrame column names.
# Accepted Combinations are: 
# * string cythonized function name 
# * function
# * list of functions
# * dict of columns -> functions
# * nested dict of names -> dicts of functions
# Returns:	
# aggregated : DataFrame

# Notes: Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the 
# function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., 
# np.mean(arr_2d)).

                                   size      mean
medallion                                        
000318C2E3E6381580E5C99910A60668  129.0  0.986047
002B4CFC5B8920A87065FC131F9732D1   73.0  1.194110
002E3B405B6ABEA23B6305D3766140F1   88.0  1.447386
0030AD2648D81EE87796445DB61FCF20   40.0  1.943750
0035520A854E4F2769B37DAF5357426F   69.0  0.975217


Unnamed: 0,size,mean
size,1.0,-0.34822
mean,-0.34822,1.0


<b>Which `vendor_id` had the higher average `surcharge` on a per-hour basis?</b>


In [27]:
# Which vendor_id had the higher average surcharge on a per-hour basis?
# Recall that 'vendor_id' is the type of vendor operating the cab, can either be CMT or VTS
perVendorMeans = fareData.groupby(["vendor_id","hour"])["surcharge"].mean().unstack(level=1)
perVendorMeans = perVendorMeans.mean(axis=1)
print(perVendorMeans.head())
print("------------")
print(perVendorMeans[perVendorMeans==perVendorMeans.max()])

# DataFrame.unstack(level=-1, fill_value=None)
# Pivot a level of the (necessarily hierarchical) index labels, returning a DataFrame having a new level of column 
# labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output 
# will be a Series (the analogue of stack when the columns are not a MultiIndex). The level involved will 
# automatically get sorted.

# Parameters:	
# level : int, string, or list of these, default -1 (last level)
# Level(s) of index to unstack, can pass level name
# fill_value : replace NaN with this value if the unstack produces missing values
# Returns:	
# unstacked : DataFrame or Series

vendor_id
CMT    0.283565
VTS    0.293235
dtype: float64
------------
vendor_id
VTS    0.293235
dtype: float64


<b>Which hour in the day: </b>
  1. Did people most frequently take rides?
  2. Did people least frequently take rides?
  3. Had the largest number of unique cabs on the street?
  4. Had the least number of cabs in the street?
  5. What is the average number of cabs on the streets in NYC in each quarter of the day (at least in this dataset?)?

In [35]:
#1. Which hour of the day did people most frequenctly take rides?
print("----------- 1: ------------")
hourGroupsSizes = fareData.groupby("hour").size()
print("The hour with the most rides is:")
print(hourGroupsSizes[hourGroupsSizes==hourGroupsSizes.max()])
# 2. Which hour of the day did people least frequenctly take rides?
print("----------- 2: ------------")
hourGroupsSizes = fareData.groupby("hour").size()
print("The hour with the least rides is:")
print(hourGroupsSizes[hourGroupsSizes==hourGroupsSizes.min()])
#3. Which hour in the day had the largest number of unique cabs on the street?
print("----------- 3: ------------")
uniqueCabsHour = fareData.drop_duplicates(["medallion","hour"])
uniqueCabsPerHour = uniqueCabsHour.groupby("hour").size()
print("The hour with the most unique cabs on the street:")
print(uniqueCabsPerHour[uniqueCabsPerHour==uniqueCabsPerHour.max()])

# Signature: fareData.drop_duplicates(**kwargs)
# Docstring:
# Return DataFrame with duplicate rows removed, optionally only
# considering certain columns
# Parameters
# ----------
# subset : column label or sequence of labels, optional
#    Only consider certain columns for identifying duplicates, by
#    default use all of the columns
#
# Returns
# -------
# deduplicated : DataFrame

# 4. Which hour of the day had the least number of cabs in the street?  
print("----------- 4: ------------")
print("The hour with the least number of (unique) cabs on the street:")
print(uniqueCabsPerHour[uniqueCabsPerHour==uniqueCabsPerHour.min()])

#5. What is the average number of cabs on the streets in NYC in each quarter of the day (at least in this dataset)? 
print("----------- 5: ------------")
fareData["quarterDay"] = pd.cut(fareData.hour,[-1,5,11,17,np.inf])
earliestDay = fareData.pickup_datetime.dt.dayofyear.min()
lastDay = fareData.pickup_datetime.dt.dayofyear.max()
totalQuarters = (lastDay-earliestDay)*4
uniqueCabsQuarterDay = fareData.drop_duplicates(["medallion","quarterDay"])
print("The number of unique cabs per quarter of the day in this dataset is:")
print(uniqueCabsQuarterDay.groupby("quarterDay").size()/totalQuarters)
#
# Signature: pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
# Docstring:
# Return indices of half-open bins to which each value of `x` belongs.
#
# Parameters
# ----------
# x : array-like
#    Input array to be binned. It has to be 1-dimensional.

----------- 1: ------------
The hour with the most rides is:
hour
12    33564
dtype: int64
----------- 2: ------------
The hour with the least rides is:
hour
4    3962
dtype: int64
----------- 3: ------------
The hour with the most unique cabs on the street:
hour
14    6201
dtype: int64
----------- 4: ------------
The hour with the least number of (unique) cabs on the street:
hour
5    1935
dtype: int64
----------- 5: ------------
The number of unique cabs per quarter of the day in this dataset is:
quarterDay
(-1, 5]      73.460526
(5, 11]      86.302632
(11, 17]     92.434211
(17, inf]    91.552632
dtype: float64


<b>Read in the trip data file - `trip_data_500k.csv`. Join the trip data and fare data datasets together. You will need to join the datasets on more than one column, but you will have to figure out what those columns are!</b>

In [36]:
tripData = pd.read_csv("../data/nycTaxiData/trip_data_500k.csv")
tripData.pickup_datetime = pd.to_datetime(tripData.pickup_datetime,format="%Y-%m-%d %H:%M:%S")
tripData.dtypes 

medallion                     object
hack_license                  object
vendor_id                     object
rate_code                      int64
store_and_fwd_flag            object
pickup_datetime       datetime64[ns]
dropoff_datetime              object
passenger_count                int64
trip_time_in_secs              int64
trip_distance                float64
pickup_longitude             float64
pickup_latitude              float64
dropoff_longitude            float64
dropoff_latitude             float64
dtype: object

In [38]:
full_data = tripData.merge(fareData,on=["medallion","hack_license","vendor_id","pickup_datetime"])
full_data.head()
# Signature: tripData.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, 
# right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
# Docstring:
# Merge DataFrame objects by performing a database-style join operation by
# columns or indexes.

# If joining columns on columns, the DataFrame indexes *will be
# ignored*. Otherwise if joining indexes on indexes or indexes on a column or
# columns, the index will be passed on.

# Parameters
# ----------
# right : DataFrame
# how : {'left', 'right', 'outer', 'inner'}, default 'inner'
#     * left: use only keys from left frame (SQL: left outer join)
#     * right: use only keys from right frame (SQL: right outer join)
#     * outer: use union of keys from both frames (SQL: full outer join)
#     * inner: use intersection of keys from both frames (SQL: inner join)
# on : label or list
#     Field names to join on. Must be found in both DataFrames. If on is
#     None and not merging on indexes, then it merges on the intersection of
#     the columns by default.
# left_on : label or list, or array-like
#     Field names to join on in left DataFrame. Can be a vector or list of
#     vectors of the length of the DataFrame to use a particular vector as
#     the join key instead of columns
# right_on : label or list, or array-like
#     Field names to join on in right DataFrame or vector/list of vectors per
#     left_on docs
# left_index : boolean, default False
#     Use the index from the left DataFrame as the join key(s). If it is a
#     MultiIndex, the number of keys in the other DataFrame (either the index
#     or a number of columns) must match the number of levels
# right_index : boolean, default False
#     Use the index from the right DataFrame as the join key. Same caveats as
#     left_index
# sort : boolean, default False
#     Sort the join keys lexicographically in the result DataFrame
# suffixes : 2-length sequence (tuple, list, ...)
#     Suffix to apply to overlapping column names in the left and right
#     side, respectively
# copy : boolean, default True
#     If False, do not copy data unnecessarily
# indicator : boolean or string, default False
#     If True, adds a column to output DataFrame called "_merge" with
#     information on the source of each row.
#     If string, column with information on source of each row will be added to
#     output DataFrame, and column will be named value of string.
#     Information column is Categorical-type and takes on a value of "left_only"
#     for observations whose merge key only appears in 'left' DataFrame,
#     "right_only" for observations whose merge key only appears in 'right'
#     DataFrame, and "both" if the observation's merge key is found in both.
    
#     Returns
# -------
# merged : DataFrame
#     The output type will the be same as 'left', if it is a subclass
#     of DataFrame.


Unnamed: 0,medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,...,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,hour,quarterDay
0,89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.0,...,40.751171,CSH,6.5,0.0,0.5,0.0,0.0,7.0,15,"(11, 17]"
1,0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.5,...,40.75066,CSH,6.0,0.5,0.5,0.0,0.0,7.0,0,"(-1, 5]"
2,0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.1,...,40.726002,CSH,5.5,1.0,0.5,0.0,0.0,7.0,18,"(17, inf]"
3,DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,0.7,...,40.759388,CSH,5.0,0.5,0.5,0.0,0.0,6.0,23,"(17, inf]"
4,DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.1,...,40.747868,CSH,9.5,0.5,0.5,0.0,0.0,10.5,23,"(17, inf]"


<b>Which driver (`hack_license`) carried the most passengers, on average?</b>

In [39]:
full_data.groupby("hack_license")["passenger_count"].mean().sort_values(ascending=False).head(1)

hack_license
DF1338A98DAA39B20B528EEC54081A3D    6.0
Name: passenger_count, dtype: float64

<b>How does the number of passengers correlate with the tip amount?</b>

In [40]:
full_data['passenger_count'].corr(full_data['tip_amount'])    # there doesn't seem to be much of a correlation here.

-0.0083289315821814744