
# NYC Taxi Trip Data

Source: [NYC Taxi and Limousine Commission](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)

This data set includes a 1/50th random sample of all trips between January and June 2016 that either start or end at an aiport location.

## Column Summary

- `pickup_year` - The year of the trip.
- `pickup_month` - The month of the trip (January is `1`, December is `12`).
- `pickup_day` - The day of the month of the trip.
- `pickup_dayofweek` - The day of the week (Monday is `1`, Sunday is `7`)
- `pickup_time` - The time that the trip started, as one of six categories:
    - `0` - 0:00am-3:59am.
    - `1` - 4:00am-7:59am.
    - `2` - 8:00am-11:59am.
    - `3` - 12:00pm-3:59pm.
    - `4` - 4:00pm-7:59pm.
    - `5` - 8:00pm-11:59pm.
- `pickup_location_code` - The airport or [borough](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City) where the the trip started, as one of eight categories:
    - `0` - Bronx.
    - `1` - Brooklyn.
    - `2` - JFK Airport.
    - `3` - LaGuardia Airport.
    - `4` - Manhattan.
    - `5` - Newark Airport.
    - `6` - Queens.
    - `7` - Staten Island.
- `dropoff_location_code` - The airport or borough where the the trip finished, using the same eight category codes as `pickup_location_code`.
- `trip_distance` - The distance of the trip in miles.
- `trip_length` - The length of the trip in seconds.
- `fare_amount` - The base fare of the trip, in dollars.
- `fees_amount` - Any fees added to the fare, eg surcharges, extras, and MTA taxes.
- `tolls_amount` - The amount of all tolls paid during the trip.
- `tip_amount` - The tip added by the customer - does not include cash tips.
- `total_amount` - The total amount charged to the passenger, excluding cash tips.
- `payment_type` - The payment type, one of six categories:
    - `1` - Credit card.
    - `2` - Cash.
    - `3` - No charge.
    - `4` - Dispute.
    - `5` - Unknown.
    - `6` - Voided trip.


## Column Indexes  
0. `pickup_year` - The year of the trip.  
1. `pickup_month` - The month of the trip (January is 1, December is 12).  
2. `pickup_day` - The day of the month of the trip.  
3. `pickup_dayofweek` - The day of the week (Monday is 1, Sunday is 7)  
4. `pickup_time` - The time that the trip started  
5. `pickup_location_code` - The airport or borough where the the trip started  
6. `dropoff_location_code` - The airport or borough where the the trip finished  
7. `trip_distance` - The distance of the trip in miles.  
8. `trip_length` - The length of the trip in seconds.  
9. `fare_amount` - The base fare of the trip, in dollars.  
10. `fees_amount` - Any fees added to the fare, eg surcharges, extras, and MTA taxes.  
11. `tolls_amount` - The amount of all tolls paid during the trip.  
12. `tip_amount` - The tip added by the customer - does not include cash tips.  
13. `total_amount` - The total amount charged to the passenger, excluding cash tips.  
14. ` payment_type` - The payment type  

In [1]:
import numpy as np

In [2]:
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1)
taxi.shape

(2013, 15)

In [3]:
taxi[:5]

array([[2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
        4.000e+00, 2.100e+01, 2.037e+03, 5.200e+01, 8.000e-01, 5.540e+00,
        1.165e+01, 6.999e+01, 1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
        1.000e+00, 1.629e+01, 1.520e+03, 4.500e+01, 1.300e+00, 0.000e+00,
        8.000e+00, 5.430e+01, 1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
        6.000e+00, 1.270e+01, 1.462e+03, 3.650e+01, 1.300e+00, 0.000e+00,
        0.000e+00, 3.780e+01, 2.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
        6.000e+00, 8.700e+00, 1.210e+03, 2.600e+01, 1.300e+00, 0.000e+00,
        5.460e+00, 3.276e+01, 1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
        6.000e+00, 5.560e+00, 7.590e+02, 1.750e+01, 1.300e+00, 0.000e+00,
        0.000e+00, 1.880e+01, 2.000e+00]])

In [4]:
taxi.dtype

dtype('float64')

In [5]:
# Use boolean indexing to confirm the number of taxi rides in our data set from the month of January. 
# First, let's select just the pickup_month column, which is the second column in the ndarray:

pickup_month = taxi[:, 1]

In [6]:
# use a boolean operation to make a boolean array, where the value 1 corresponds to January:
january_bool = pickup_month == 1

# use the new boolean array to select only the items from pickup_month that have a value of 1:
january = pickup_month[january_bool]

# Finally, we use the .shape attribute to find out how many items are in our january ndarray, 
# which is equal to the number of taxi rides from the month of January. 
# We'll use [0] to extract the value from the tuple returned by .shape:

january_rides = january.shape[0]
print(january_rides)

800


In [7]:
# Do the same for February

february = pickup_month[pickup_month == 2]
february_rides = february.shape[0]
print(february_rides)

176


Let's use what we've learned to analyze the average speed of trips. In the previous lesson(nyc_taxis.ipynb), we calculated the maximum trip speed to be 82,000 mph, which we know is definitely not accurate. Let's verify if there are any issues with the data. Recall that we calculated the average travel speed as follows:

`trip_mph = taxi[:,7] / (taxi[:,8] / 3600)`  

Next, we'll check for trips with an average speed greater than 20,000 mph:

In [8]:
# Create a boolean array for trips greater than 20,000 mph

trips_mph = taxi[:, 7] / (taxi[:, 8] / 3600)

trips_over_20000mph_bool = trips_mph > 20000

In [9]:
trips_over_20000mph_bool

array([False, False, False, ..., False, False, False])

In [10]:
# use the boolean array to select the rows for those trips, 
# and the pickup_location_code, dropoff_location_code, trip_distance, and
# trip_length columns

trips_over_20000mph = taxi[trips_over_20000mph_bool, 5:9]

In [11]:
trips_over_20000mph

array([[ 2. ,  2. , 23. ,  1. ],
       [ 2. ,  2. , 19.6,  1. ],
       [ 2. ,  2. , 16.7,  2. ],
       [ 3. ,  3. , 17.8,  2. ],
       [ 2. ,  2. , 17.2,  2. ],
       [ 3. ,  3. , 16.9,  3. ],
       [ 2. ,  2. , 27.1,  4. ]])

Let's use this technique to examine the rows that have the highest values for the tip_amount column.

In [12]:
# Create a boolean array, tip_bool, 
# that determines which rows have values for the tip_amount column of more than 50.

tips_over_50_bool = taxi[:, 12] > 50

In [13]:
tips_over_50_bool

array([False, False, False, ..., False, False, False])

In [14]:
# Use the tip_bool array to select all rows from taxi with values tip amounts of more than 50, 
# and the columns from indexes 5 to 13 inclusive. Assign the resulting array to top_tips.

top_tips = taxi[tips_over_50_bool, 5:14]

In [15]:
top_tips

array([[4.000e+00, 2.000e+00, 2.145e+01, 2.004e+03, 5.200e+01, 8.000e-01,
        0.000e+00, 5.280e+01, 1.056e+02]])

To help you practice without making changes to our original array, we have used the [ndarray.copy()](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.copy.html) method to make `taxi_modified`, a copy of our original for these exercises.

In [16]:
taxi_modified = taxi.copy()

In [17]:
taxi_modified[1066, 5]

4.0

In [18]:
# The value at column index 5 (pickup_location) of row index 1066 is incorrect. 
# Use assignment to change this value to 1 in the taxi_modified ndarray.

taxi_modified[1066, 5] = 1
taxi_modified[1066, 5]

1.0

In [19]:
taxi_modified[:, 0]

array([2016., 2016., 2016., ..., 2016., 2016., 2016.])

In [20]:
# The first column (index 0) contains year values as four digit numbers 
# in the format YYYY (2016, since all trips in our data set are from 2016). 
# Use assignment to change these values to the YY format (16) in the taxi_modified ndarray.

taxi_modified[:, 0] = 16
taxi_modified[:, 0]

array([16., 16., 16., ..., 16., 16., 16.])

In [21]:
taxi_modified[:, 7][550:552]

array([9.88, 8.6 ])

In [22]:
np.average(taxi_modified[:, 7])

12.924778936910084

In [23]:
# The values at column index 7 (trip_distance) of rows index 550 and 551 are incorrect. 
# Use assignment to change these values in the taxi_modified ndarray to the mean value for that column.

taxi_modified[:, 7][550:552] = np.average(taxi_modified[:, 7])
taxi_modified[:, 7][550:552]

array([12.92477894, 12.92477894])

We again used the ndarray.copy() method to make taxi_copy, a copy of our original for this exercise.

1.  Select the fourteenth column (index 13) in taxi_copy. Assign it to a variable named total_amount.
2.  For rows where the value of total_amount is less than 0, use assignment to change the value to 0.


In [24]:
taxi_copy = taxi.copy()

In [25]:
taxi_copy[taxi_copy[:, 13] < 0] = 0

We have created a new copy of our taxi dataset, `taxi_modified` with an additional column containing the value `0` for every row.

In our new column at index 15, assign the value 1 if the pickup_location_code (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
- For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
- For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
- For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.


In [26]:
taxi.shape

(2013, 15)

In [27]:
# create a new column filled with `0`.
zeros = np.zeros([taxi.shape[0],1])
taxis_modified = np.concatenate([taxi, zeros], axis = 1)

In [28]:
taxis_modified

array([[2.016e+03, 1.000e+00, 1.000e+00, ..., 6.999e+01, 1.000e+00,
        0.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 5.430e+01, 1.000e+00,
        0.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 3.780e+01, 2.000e+00,
        0.000e+00],
       ...,
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 6.334e+01, 1.000e+00,
        0.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 4.475e+01, 1.000e+00,
        0.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 5.484e+01, 2.000e+00,
        0.000e+00]])

In our new column at index 15, assign the value 1 if the pickup_location_code (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
- For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
- For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
- For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [29]:
taxis_modified[taxis_modified[:, 5] == 2, 15] = 1
taxis_modified[taxis_modified[:, 5] == 3, 15] = 1
taxis_modified[taxis_modified[:, 5] == 5, 15] = 1

In [30]:
taxis_modified

array([[2.016e+03, 1.000e+00, 1.000e+00, ..., 6.999e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 5.430e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 3.780e+01, 2.000e+00,
        1.000e+00],
       ...,
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 6.334e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 4.475e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 5.484e+01, 2.000e+00,
        1.000e+00]])

We'll conclude this lesson with two challenges. Challenges are designed to help you practice the techniques you've learned in this lesson.

We supplied several hints to help you, but first try to complete the challenge without the hints, if you can. Don't be discouraged if these challenge steps take a few attempts to get right – working with data is an iterative process!

In this challenge, we want to figure out which airport is the most popular destination in our data set. To do that, we'll use boolean indexing to create three filtered arrays and then look at how many rows are in each array.

To complete this task, we'll need to check if the dropoff_location_code column (column index 6) is equal to one of the following values:

- 2: JFK Airport
- 3: LaGuardia Airport
- 5: Newark Airport.

Instructions:

1. Using the original `taxi` ndarray, calculate how many trips had JFK Airport as their destination:
    - Use boolean indexing to select only the rows where the `dropoff_location_code` column (column index `6`) has a value that corresponds to JFK. Assign the result to `jfk`.
    - Calculate how many rows are in the new `jfk` array and assign the result to `jfk_count`.
2. Calculate how many trips from `taxi` had Laguardia Airport as their destination:
    - Use boolean indexing to select only the rows where the `dropoff_location_code` column (column index `6`) has a value that corresponds to Laguardia. Assign the result to `laguardia`.
    - Calculate how many rows are in the new laguardia array. Assign the result to laguardia_count.
3. Calculate how many trips from `taxi` had Newark Airport as their destination:
    - Select only the rows where the `dropoff_location_code` column has a value that corresponds to Newark, and assign the result to `newark`.
    - Calculate how many rows are in the new `newark` array and assign the result to `newark_count`.
4. After you have run your code, inspect the values for `jfk_count`, `laguardia_count`, and `newark_count` and see which airport has the most dropoffs.


In [31]:
jfk_count = taxi[taxi[:, 6] == 2].shape[0]
jfk_count

285

In [32]:
laguardia_count = taxi[taxi[:, 6] == 3].shape[0]
laguardia_count

308

In [33]:
newark_count = taxi[taxi[:, 6] == 5].shape[0]
newark_count

2

Our calculations in the previous screen show that Laguardia is the most common airport for dropoffs in our data set.

Our second and final challenge involves removing potentially bad data from our data set, and then calculating some descriptive statistics on the remaining "clean" data.

We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph) which should remove the questionable data we have worked with over the past two lessons. Then, we'll use array methods to calculate the mean for specific columns of the remaining data. The columns we're interested in are:

- `trip_distance`, at column index 7
- `trip_length`, at column index 8
- `total_amount`, at column index 13

Instructions:

The `trip_mph` ndarray has been provided for you.

- Create a new ndarray, `cleaned_taxi`, containing only rows for which the values of `trip_mph` are less than 100.
- Calculate the mean of the `trip_distance` column of `cleaned_taxi`. Assign the result to `mean_distance`.
- Calculate the mean of the `trip_length` column of `cleaned_taxi`. Assign the result to `mean_length`.
- Calculate the mean of the `total_amount` column of `cleaned_taxi`. Assign the result to `mean_total_amount`.


In [34]:
trip_mph = taxi[:, 7] / (taxi[:, 8] / 3600)
trip_mph.shape

(2013,)

In [35]:
trip_mph

array([37.11340206, 38.58157895, 31.27222982, ..., 22.29907867,
       42.41551247, 36.90473407])

In [36]:
cleaned_taxi = taxi[trip_mph > 100]
cleaned_taxi.shape

(9, 15)

In [37]:
cleaned_taxi[:2] #show the first 2 rows

array([[2.0160e+03, 1.0000e+00, 1.0000e+00, 5.0000e+00, 4.0000e+00,
        3.0000e+00, 3.0000e+00, 5.4000e+00, 7.4000e+01, 8.0000e+01,
        3.0000e-01, 0.0000e+00, 2.4050e+01, 1.0435e+02, 1.0000e+00],
       [2.0160e+03, 1.0000e+00, 4.0000e+00, 1.0000e+00, 4.0000e+00,
        2.0000e+00, 2.0000e+00, 1.6900e+01, 3.3000e+01, 6.5000e+01,
        3.0000e-01, 0.0000e+00, 1.3050e+01, 7.8350e+01, 1.0000e+00]])

In [38]:
# Calculate the mean of the trip_distance column of cleaned_taxi. Assign the result to mean_distance.
# Calculate the mean of the trip_length column of cleaned_taxi. Assign the result to mean_length.
# Calculate the mean of the total_amount column of cleaned_taxi. Assign the result to mean_total_amount.

mean_distance = cleaned_taxi[:, 7].mean()
mean_length = cleaned_taxi[:, 8].mean()
mean_total_amount = cleaned_taxi[:, 13].mean()

In [39]:
mean_distance

17.844444444444445

In [40]:
mean_length

13.555555555555555

In [41]:
mean_total_amount

43.31666666666666