## You will be using NYC taxi ride data. There are two files located in the `data/nycTaxiData/` folder: `trip_fare_500k.csv` and `trip_data_500k.csv`.

The answers will be posted 9/27 after class.

`trip_fare_500k.csv` file found in the `data/nycTaxiData/` folder. 
This dataset contains a fairly large number of distinct trips taken in cabs in the NYC area in 2013 (500 thousand of them, to be exact!).

The dataset contains the following information at the top of the file (this is called the header):

* `medallion`: The ID of the cab being operated
* `hack_license`: The ID of the person operating the cab
* `vendor_id`: The type of vendor operating the cab, can either be `CMT` or `VTS`, no clue what these two types mean
* `pickup_datetime`: The time when the ride started
* `payment_type`: How the trip was paid, `UNK` stands for unknown, I have no idea what `NOC` stands for, but lets assume its some known way to pay
* `fare_amount`: Base fare cost of the trip
* `surcharge`: Additional charges that are not tolls
* `mta_tax`: The mta has to get its cut, right? :)
* `tip_amount`: How generous the rider(s) decided to be
* `tolls_amount`: How much money you had to pay in tolls
* `total_amount`: How much the trip cost.

Here are the columns of the trip dataset, found in `trip_data_500k.csv`:

* `medallion`: The ID of the cab being operated
* `hack_license`: The ID of the person operating the cab
* `vendor_id`: The type of vendor operating the cab, can either be `CMT` or `VTS`, no clue what these two types mean
* `rate_code`: Designates the kind of ride this is, must be `1` through `6`, any other number is incorrect
* `store_and_fwd_flag`: Can be either `Y`,`N`, or Nan
* `pickup_datetime`: The time when the ride started
* `dropoff_datetime`: The time when the ride ended
* `passenger_count`: The number of passengers during the ride
* `trip_time_in_secs`: How long the trip took
* `trip_distance`: Distance of the trip, to the nearest 1/10 mile
* `pickup_longitude`: Longitude of pickup location
* `pickup_latitude`: Latitude of pickup location
* `dropoff_longitude`: Longitude of dropoff location
* `dropoff_latitude`: Latitude of dropoff location

First step - make your own copy of the notebook.

Use the rest of this notebook to work through all these questions. 

If you can tackle all of these questions, then you've learned a lot already! 

For tips and commands, see the pandas class notebooks or https://github.com/guipsamora/pandas_exercises.

If not, don't worry, this stuff is hard and T.J. and Ramesh will gladly help/guide you through all of this. Contact us through Slack with any questions.

But take charge of your learning! This means:

* Ask a classmate
to help if you don't understand something. 
* If your neighbor can't help you, try using:
  * [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/index.html)
  * [google](http://www.google.com)
  * [stackoverflow](http://stackoverflow.com) to see if someone in the internet ether has had a similar problem before
  * if none of this works, then I will gladly help you
* This will accomplish at least two things:
  * It will get you to use online resources and take charge of your learning
  * Get you to learn alternative approaches

I've started the bare-bones script for you by:

* importing what you will need.
* loading the two datasets into `DataFrame` objects (you might need to change the path to where the file is located on your system).
* formatting the timestamp for you so that you don't have to figure out how to do it, because spending 30+ minutes  (or more) trying to figure it out is not the point of the assignment. This way, all of the functions in `fareData.pickup_datetime.dt` can immediately be used on the `pickup_datetime` column your dataset.

The rest I leave to you. Happy hacking!

In [7]:
from __future__ import print_function, unicode_literals, division
import pandas as pd
import numpy as np

## Let's start with the fare data:

In [2]:
fareData = pd.read_csv("../data/nycTaxiData/trip_fare_500k.csv")
fareData.pickup_datetime = pd.to_datetime(fareData.pickup_datetime,format="%Y-%m-%d %H:%M:%S")
fareData.dtypes #this is to confirm that the pickup_datetime column, as well as all of the other
# columns, are in the appropriate formats (pickup_datetime should be in datetime64 format)
# if it isn't something is wrong, and we need to figure what that is

medallion                  object
hack_license               object
vendor_id                  object
pickup_datetime    datetime64[ns]
payment_type               object
fare_amount               float64
surcharge                 float64
mta_tax                   float64
tip_amount                float64
tolls_amount              float64
total_amount              float64
dtype: object

<b>Are there any missing data (null-values)?

<b>What was the most expensive/least expensive trip taken?</b>

<b>How does the overall `total_amount` paid per ride correlate with `tip_amount` per ride?</b>

<b>How do they correlate for only rides with cash `payment_type`?<b>

<b>Calculate the average cost of a trip in this dataset given the following conditions:</b>
  1. Across the whole dataset
  2. Across the whole dataset when the `payment_type` is known (not `UNK`)
  3. For each `payment_type`. You can totally do this 1 by 1, but try to do this in a for loop.
  4. Which `payment_type` had the highest average cost?
  5. Which `payment_type` had the largest spread in how much people paid (largest standard deviation)?
  6. Which `payment_type` had the most generous people (had the highest average tip), including unknown payment types?
  7. What hour in the day were people most generous, on average, when they got into a cab?
  8. What hour of the day did people fluctuate the most in terms of tips? That is, do some hours lead to unpredictable tip amounts? 

<b>Which person (`hack_license`) made the most money:</b>
  1. In total
  2. On a per-trip basis, given that they took at least 20 trips

<b>Does the number of trips a given cabbie takes (her/his experience) correlate with how well she/he is tipped? If so, in what direction</b>

<b>Does the number of times a given cab is used correlate with how well the person driving the cab is tipped? That is, are there "lucky" cabs?</b>

<b>Which `vendor_id` had the higher average `surcharge` on a per-hour basis?</b>


<b>Which hour in the day: </b>
  1. Did people most frequently take rides?
  2. Did people least frequently take rides?
  3. Had the largest number of unique cabs on the street?
  4. Had the least number of cabs in the street?
  5. What is the average number of cabs on the streets in NYC in each quarter of the day (at least in this dataset?)?

<b>Read in the trip data file - `trip_data_500k.csv`. Join the trip data and fare data datasets together. You will need to join the datasets on more than one column, but you will have to figure out what those columns are!</b>

<b>Which driver (`hack_license`) carried the most passengers, on average?</b>

<b>How does the number of passengers correlate with the tip amount?</b>