This is a subset of the New York City taxi trip data.

pickup_year | pickup_month | pickup_day |pickup_dayofweek|pick_time|pickup_location_code |dropoff_location_code |trip_distance|trip_length|fare_amount|fees_amount|tolls_amount|tip_amount|total_amount|Payment_type
--|--|--|--|--|--|--|--|--|--|--|--|--|--|--
2016|1|1|5|0|2|4|21.00|2037|52.0|0.8|5.54|11.65|69.99|1

-  pickup_month: January is 1, December is 12
-  The airport or borough where the the trip started/ended, as one of eight categories:
   0 - Bronx.
   1 - Brooklyn.
   2 - JFK Airport.
   3 - LaGuardia Airport.
   4 - Manhattan.
   5 - Newark Airport.
   6 - Queens.
   7 - Staten Island.
-  trip distances in miles
-  trip_length is seconds
-  fare_amount in dollars

_Knowledge leanred_:
-  read a txt into a ndarray
-  retrieve data from ndarrays
-  change dimensions
-  add rows or column
-  change print style
-  sort values
-  modify values
-  remove bad data


In [1]:
# The solution using csv module:  10 more lines! read the dataset and convert all the values to float
#import csv
#taxi_list_header = list(csv.reader(open('nyc_taxis.csv','r')))
#taxi_list = taxi_list_header[1:]
#converted_list =[]                    # convert all the values to float.  every value in ndarray must be of the same type.
#for row in taxi_list:
#    converted_row = []
#    for item in row:
#        converted_row.append(float(item))
#    converted_list.append(converted_row)
#taxiarray = np.array(converted_list)  # convert list of lists to ndarray using np.array

import numpy as np
taxiarray = np.genfromtxt('nyc_taxis.csv',delimiter=',',skip_header=1)  # This is the way to convert txt to ndarray!

np.set_printoptions(suppress=True, precision=2)    # np.set_printoptions(suppress=True or precision=#)
print(taxiarray)

[[2016.      1.      1.   ...   11.65   69.99    1.  ]
 [2016.      1.      1.   ...    8.     54.3     1.  ]
 [2016.      1.      1.   ...    0.     37.8     2.  ]
 ...
 [2016.      6.     30.   ...    5.     63.34    1.  ]
 [2016.      6.     30.   ...    8.95   44.75    1.  ]
 [2016.      6.     30.   ...    0.     54.84    2.  ]]


In [2]:
#calculate the average trip speed, trip_distance / trip_length
trip_speed = taxiarray[:,7]/taxiarray[:,8]*3600  # speed unit is miles/hour. Column/Column = 1D array, has no rows or columns
speedmax = trip_speed.max()
speedmin = trip_speed.min()
avespeed = trip_speed.mean()
print(speedmax,speedmin,avespeed)

82800.0 0.0 32.24258580925573


In [3]:
# The max speed looks wrong. We will add the trip_speed to a new column to the array, then sort it and look at the rows with speed that are too high.

trip_speed_2d = np.expand_dims(trip_speed, axis=1)             # np.expand_dims()    expand the 1D arrage into a column
combined = np.concatenate([taxiarray,trip_speed_2d],axis = 1)  # np.concatenate([original,addon], axis=#)
taxisorted = combined[np.argsort(combined[:,15])]              # np.argsort:  sortedarray = originalarray[np.argsort(originalarray[a row/a column])]
print(taxisorted)

#All of these rows have the same pickup_location_code and dropoff_location_code. This might suggest that the machines that record the data may use the last known GPS signal if they can't find the location, and if a driver starts and finishes a fare quickly, the machine will calculate an accurate time with inaccurate location data.


[[ 2016.       1.       3.   ...    24.84     1.       0.  ]
 [ 2016.       1.      22.   ...    63.34     1.       0.  ]
 [ 2016.       1.      14.   ...    52.8      1.       0.  ]
 ...
 [ 2016.       3.      28.   ...     4.3      2.   32040.  ]
 [ 2016.       2.      13.   ...     3.3      2.   70560.  ]
 [ 2016.       1.      22.   ...     3.3      2.   82800.  ]]


In [4]:
# remove the bad data: create a new ndarray that only contains the rows for which the values of trip_mph are less than 100

cleaned_taxi = combined[combined[:,15]<100]
mean_distance = cleaned_taxi[:,7].mean()
mean_length = cleaned_taxi[:,8].mean()
mean_total_amount = cleaned_taxi[:,13].mean()

print('%.2f miles  %.2f seconds  %.2f dollars' %(mean_distance, mean_length, mean_total_amount))


12.67 miles  2239.50 seconds  48.98 dollars


In [5]:
# Boolean Indexing with NumPy
# How many pickups happened in each month?  The data set only has data of the first 6 months in 2016.

def monthdict(array):                       # this fuction creates a dictionary that shows the picups number in each month 
    pickupdict = {}
    pickupmonth = array[:,1]                # 1. select the column
    for num in range(1,7):
        monthbool = pickupmonth==num        # 2. make the boolean array: monthbool equals to all the rows with pickupmonth == num is True.
        pickups = pickupmonth[monthbool]    # 3. use the new boolean array to filter the selected column. Note use [] not ().
        pickupnum = pickups.shape[0]        # 4. ndarray.shape[0] returns the number of rows
        pickupdict[num]= pickupnum
    return pickupdict
        
print(monthdict(taxiarray))

{1: 13481, 2: 13333, 3: 15547, 4: 14810, 5: 16650, 6: 15739}


In [6]:
# the trips with over $50 tips?

tip_amount = taxiarray[:,12]           # 1. select the tip column
tip_bool = tip_amount>50               # 2. make the boolean array: all the rows that tip_amount > 50 is True
top_tips = taxiarray[tip_bool,5:14]    # 3. slice taxiarray with the rows that meet tip_bool and columns index 5 to 13.

print(top_tips)                        # there are 16 trips that have more then $50 tips.

[[    4.       2.      21.45  2004.      52.       0.8      0.      52.8
    105.6 ]
 [    3.       4.       9.2   1041.      27.       1.3      5.54    60.
     93.84]
 [    2.       0.      19.8   1671.      52.5      1.3      5.54    59.34
    118.68]
 [    4.       2.      18.42  2968.      52.       0.8      5.54    80.
    138.34]
 [    3.       6.       0.49   158.       3.5      1.8      0.      70.
     75.3 ]
 [    2.       2.       2.7    381.       9.5      0.8      0.      60.
     70.3 ]
 [    3.       4.       9.54  1210.      27.5      0.8      5.54    55.
     88.84]
 [    2.       4.      17.6   3251.      52.       0.8      5.54    65.
    123.34]
 [    4.       2.      38.2   9252.      52.       0.8      5.54    80.
    138.34]
 [    4.       2.      18.    2276.       0.01     0.3      5.54    62.
     67.85]
 [    2.       0.      26.21 17029.     180.5      0.8      5.54   100.
    286.84]
 [    2.       2.       0.      24.       2.5      0.8      0.      58.
 

In [7]:
# Which airport is the most popular destination in the data set?
# the dropoff_location_code column is column index 6

jfk = taxiarray[taxiarray[:,6]==2].shape[0]   # array[array[column==2]]  no need to add column infor, becasue only need to count the number of rows.
lga = taxiarray[taxiarray[:,6]==3].shape[0]
ewr = taxiarray[taxiarray[:,6]==5].shape[0]

print(jfk,lga,ewr)       # LGA is the most popular airport for dropoffs

11832 16602 63
