
# New York City taxi trip data released by the city of New York.

We'll only work with a subset of this data — approximately 90,000 yellow taxi trips to and from New York City airports between January and June 2016. Below is information about selected columns from the dataset:

- pickup_year: the year of the trip
- pickup_month: the month of the trip (January is 1, December is 12)
- pickup_day: the day of the month of the trip
- pickup_location_code: the airport or borough where the trip started
- dropoff_location_code: the airport or borough where the trip ended
- trip_distance: the distance of the trip in miles
- trip_length: the length of the trip in seconds
- fare_amount: the base fare of the trip, in dollars
- total_amount: the total amount charged to the passenger, including all fees, tolls and tips



# NYC Taxi Trip Data

Source: [NYC Taxi and Limousine Commission](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)

This data set includes a 1/50th random sample of all trips between January and June 2016 that either start or end at an aiport location.

## Column Summary

- `pickup_year` - The year of the trip.
- `pickup_month` - The month of the trip (January is `1`, December is `12`).
- `pickup_day` - The day of the month of the trip.
- `pickup_dayofweek` - The day of the week (Monday is `1`, Sunday is `7`)
- `pickup_time` - The time that the trip started, as one of six categories:
    - `0` - 0:00am-3:59am.
    - `1` - 4:00am-7:59am.
    - `2` - 8:00am-11:59am.
    - `3` - 12:00pm-3:59pm.
    - `4` - 4:00pm-7:59pm.
    - `5` - 8:00pm-11:59pm.
- `pickup_location_code` - The airport or [borough](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City) where the the trip started, as one of eight categories:
    - `0` - Bronx.
    - `1` - Brooklyn.
    - `2` - JFK Airport.
    - `3` - LaGuardia Airport.
    - `4` - Manhattan.
    - `5` - Newark Airport.
    - `6` - Queens.
    - `7` - Staten Island.
- `dropoff_location_code` - The airport or borough where the the trip finished, using the same eight category codes as `pickup_location_code`.
- `trip_distance` - The distance of the trip in miles.
- `trip_length` - The length of the trip in seconds.
- `fare_amount` - The base fare of the trip, in dollars.
- `fees_amount` - Any fees added to the fare, eg surcharges, extras, and MTA taxes.
- `tolls_amount` - The amount of all tolls paid during the trip.
- `tip_amount` - The tip added by the customer - does not include cash tips.
- `total_amount` - The total amount charged to the passenger, excluding cash tips.
- `payment_type` - The payment type, one of six categories:
    - `1` - Credit card.
    - `2` - Cash.
    - `3` - No charge.
    - `4` - Dispute.
    - `5` - Unknown.
    - `6` - Voided trip.

In [1]:
import csv
import numpy as np

In [2]:
# Convert taxi CSV into a NumPy ndarray

f = open('nyc_taxis.csv', 'r')
taxi_list = list(csv.reader(f))

#remove the header row
taxi_list = taxi_list[1:]

In [3]:
# Convert all values to floats
converted_taxi_list = []

for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

In [4]:
# Convert the converted_taxi_list to a Numpy ndarray
taxi = np.array(converted_taxi_list)

In [5]:
taxi.ndim, taxi.shape, taxi.size

(2, (2013, 15), 30195)

In [6]:
taxi[0]

array([2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
       4.000e+00, 2.100e+01, 2.037e+03, 5.200e+01, 8.000e-01, 5.540e+00,
       1.165e+01, 6.999e+01, 1.000e+00])

In [7]:
print(taxi)

[[2.016e+03 1.000e+00 1.000e+00 ... 1.165e+01 6.999e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 8.000e+00 5.430e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 0.000e+00 3.780e+01 2.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 5.000e+00 6.334e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 8.950e+00 4.475e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 0.000e+00 5.484e+01 2.000e+00]]


In [8]:
row_5 = taxi[5]
rows_426_to_500 = taxi[426:501]
row_45_column_13 = taxi[45,13]

In [9]:
# Use vector addition to add fare_amount and fees_amount. Assign the result to fare_and_fees.
# fare_amount - 9th column (index 0)
# fees_amount - 10th column (index 0)

fares_and_fees = taxi[:, 9] + taxi[:, 10]
print(fares_and_fees)

[52.8 46.3 37.8 ... 52.8 35.8 49.3]


In [10]:
# Calculate the average travel speed of each trip in miles per hour. S

# trip_distance in miles = column 7
# trip_length in seconds = column 8

trip_mph = taxi[:, 7] / (taxi[:, 8] / 3600) #3600 seconds in an hour

print(trip_mph)

[37.11340206 38.58157895 31.27222982 ... 22.29907867 42.41551247
 36.90473407]


In [11]:
# calculate the maximum and mean (average) speed from our trip_mph ndarray.

mph_min = trip_mph.min()
mph_max = trip_mph.max()
mph_mean = trip_mph.mean()

print(mph_min, mph_max, mph_mean)

0.0 82800.0 169.98315083655157


In [12]:
# fare amount+fees amount+tolls amount+tip amount=total amount

# We'll check these values. 
# We'll only review the first five rows in taxi so we can verify the results more easily.

# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]

# select these columns: fare_amount, fees_amount, tolls_amount, tip_amount
fare_components = taxi_first_five[:, 9:13]
fare_sums = fare_components.sum(axis=1)
fare_totals = taxi_first_five[:, 13]

print(fare_sums)
print(fare_totals)

[69.99 54.3  37.8  32.76 18.8 ]
[69.99 54.3  37.8  32.76 18.8 ]
