### NYC Taxi dataset 

This is for internal use only. It aims to illustrate the use of different function in selecting/changing value in the NumPy/Panda library.

## Read the data

First is to load the packages, import the file and have an overview of the data 

In [33]:
import csv
import numpy as np

f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# Or an alternative from numpy function (which exculdes the header)
taxi = np.genfromtxt('nyc_taxis.csv',  delimiter=',', skip_header = 1)

In [34]:
taxi_list

[['pickup_year',
  'pickup_month',
  'pickup_day',
  'pickup_dayofweek',
  'pickup_time',
  'pickup_location_code',
  'dropoff_location_code',
  'trip_distance',
  'trip_length',
  'fare_amount',
  'fees_amount',
  'tolls_amount',
  'tip_amount',
  'total_amount',
  'payment_type'],
 ['2016',
  '1',
  '1',
  '5',
  '0',
  '2',
  '4',
  '21.00',
  '2037',
  '52.00',
  '0.80',
  '5.54',
  '11.65',
  '69.99',
  '1'],
 ['2016',
  '1',
  '1',
  '5',
  '0',
  '2',
  '1',
  '16.29',
  '1520',
  '45.00',
  '1.30',
  '0.00',
  '8.00',
  '54.30',
  '1'],
 ['2016',
  '1',
  '1',
  '5',
  '0',
  '2',
  '6',
  '12.70',
  '1462',
  '36.50',
  '1.30',
  '0.00',
  '0.00',
  '37.80',
  '2'],
 ['2016',
  '1',
  '1',
  '5',
  '0',
  '2',
  '6',
  '8.70',
  '1210',
  '26.00',
  '1.30',
  '0.00',
  '5.46',
  '32.76',
  '1'],
 ['2016',
  '1',
  '1',
  '5',
  '0',
  '2',
  '6',
  '5.56',
  '759',
  '17.50',
  '1.30',
  '0.00',
  '0.00',
  '18.80',
  '2'],
 ['2016',
  '1',
  '1',
  '5',
  '0',
  '4',
  '2',
 

From the result above, what we can see is, the data is imported as a list of a list. And all the data is in string. 
To prepare the data for further process, it is necessary to convert all data to float

In [35]:
# Remove the header row 
taxi_list = taxi_list[1:]

# Change all values to float
converted_data_list = []

for rows in taxi_list:
    each_row = []
    for ele in rows:
        each_row.append(float(ele))
    converted_data_list.append(each_row)

# Convert the list into np ndarray
taxi = np.array(converted_data_list)
taxi

array([[2.016e+03, 1.000e+00, 1.000e+00, ..., 1.165e+01, 6.999e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 8.000e+00, 5.430e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 0.000e+00, 3.780e+01,
        2.000e+00],
       ...,
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 5.000e+00, 6.334e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 8.950e+00, 4.475e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 0.000e+00, 5.484e+01,
        2.000e+00]])

## Array shapes

In [36]:
taxi.shape

(89560, 15)

## Selecting and slicing rows and items for ndarrays

Where row_index defines the location along the row axis and column_index defines the location along the column axis.
<br>
<br>
Like lists, array slicing is from the first specified index up to — but not including — the second specified index.
<br>
<br>
Example are as follows:

In [37]:
row_0 = taxi[0]
rows_391_to_500 = taxi[391:501,]
row_21_column_5 = taxi[21,5]

In [38]:
cols = [1,4,7]
columns_1_4_7 = taxi[:,cols]
row_99_columns_5_to_8 = taxi[99,5:9]
rows_100_to_200_column_14 = taxi[100:201, 14]

## Vectorized operations

Instead of a for loop, we can use vectorized operations to do calculation. We used the example of adding two columns of data. With data in a list of lists, we'd have to construct a for-loop and add each pair of values from each row individually. 
<br>
<br>
we would perform the same task above with vectorized operations

In [39]:
fare_amount = taxi[:,9]
fees_amount = taxi[:,10]

fare_and_fees = fare_amount + fees_amount

In [40]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

trip_mph = trip_distance_miles/trip_length_hours

## Statistics for 1D array 

To calculate the minimum value of a 1D ndarray, we use the vectorized ndarray.min() method, like so:

In [41]:
mph_min = trip_mph.min()
mph_max = trip_mph.max()
mph_mean = trip_mph.mean()

## Statistics for 2D array 

Next, we'll calculate statistics for 2D ndarrays. If we use the ndarray.max() method on a 2D ndarray without any additional parameters, it will return a single value, just like with a 1D array
<br>
<br>
But what if we wanted to find the maximum value of each row? We'd need to use the axis parameter and specify a value of 1 to indicate we want to calculate the maximum value for each row.

In [42]:
# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]
# select these columns: fare_amount, fees_amount, tolls_amount, tip_amount
fare_components = taxi_first_five[:,9:13]

fare_sums = fare_components.sum(axis=1)
fare_totals = taxi_first_five[:,13]
print(fare_totals)
print(fare_sums)

[69.99 54.3  37.8  32.76 18.8 ]
[69.99 54.3  37.8  32.76 18.8 ]


## Reading CSV files with NumPy 

we used the numpy.genfromtxt() function to read the nyc_taxis.csv file into NumPy, which allowed us to import the data much more quickly and efficiently than the method we used in the previous mission

In [43]:
import numpy as np

taxi = np.genfromtxt('nyc_taxis.csv', delimiter=",")
taxi_shape = taxi.shape
print(taxi_shape)

(89561, 15)


NumPy ndarrays can contain only one datatype.
<br>
<br>
Alternatively, we can pass an additional parameter, skip_header, to the numpy.genfromtxt() function. The skip_header parameter accepts an integer, the number of rows from the start of the file to skip. Note that because this integer should be the number of rows and not the index, skipping the first row would require a value of 1, not 0.

In [44]:
taxi = np.genfromtxt('nyc_taxis.csv',  delimiter=',', skip_header = 1)
taxi_shape = taxi.shape
print(taxi_shape)

(89560, 15)


## Boolean indexing 

The boolean array acts as a filter, so that the values corresponding to True become part of the result and the values corresponding to False are removed.

In [45]:
pickup_month = taxi[:,1]

february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0]

## Boolean indexing with 2D ndarrays 

When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing.

In [46]:
tip_amount = taxi[:,12]

tip_bool = (tip_amount > 50)
top_tips = taxi[tip_bool, 5:14]

## Assigning values in ndarrays  

So far, we've learned how to retrieve data from ndarrays. Next, we'll use the same indexing techniques we've already learned to modify values within an ndarray. 

In [47]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

taxi_modified[28214,5] = 1
taxi_modified[:,0] = 16
taxi_modified[1800:1802, 7] = taxi_modified[:,7].mean()

In [48]:
taxi_copy = taxi.copy()

total_amount = taxi_copy[:,13]
total_amount[total_amount < 0] = 0 

Another example would be assigning values to another column under boolean arrays. For example, if the value of column 5 is equal to 1, then column 15 shd also have value of 1 etc. Syntax are as follows:

In [49]:
# create a new column filled with `0`.
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

taxi_modified[ taxi_modified[:,5] == 0, 15] = 0
taxi_modified[ taxi_modified[:,5] == 2, 15] = 1
taxi_modified[ taxi_modified[:,5] == 3, 15] = 1
taxi_modified[ taxi_modified[:,5] == 5, 15] = 1

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]


## Reference 

There are some investigations/manupulations that can be made: 

#### 1. For those rides with more than 50 US Dollar tips, how much was the total fee?

In here, we can set a condition to include records which only contains tips over 50 USD, and select the columns that are related to these records

In [50]:
tip_amount = taxi[:,12]

tip_bool = (tip_amount > 50)
# Select the rows with the above conditions, and columns which contains the fee information
top_tips = taxi[tip_bool, 5:14]

top_tips

array([[4.0000e+00, 2.0000e+00, 2.1450e+01, 2.0040e+03, 5.2000e+01,
        8.0000e-01, 0.0000e+00, 5.2800e+01, 1.0560e+02],
       [3.0000e+00, 4.0000e+00, 9.2000e+00, 1.0410e+03, 2.7000e+01,
        1.3000e+00, 5.5400e+00, 6.0000e+01, 9.3840e+01],
       [2.0000e+00, 0.0000e+00, 1.9800e+01, 1.6710e+03, 5.2500e+01,
        1.3000e+00, 5.5400e+00, 5.9340e+01, 1.1868e+02],
       [4.0000e+00, 2.0000e+00, 1.8420e+01, 2.9680e+03, 5.2000e+01,
        8.0000e-01, 5.5400e+00, 8.0000e+01, 1.3834e+02],
       [3.0000e+00, 6.0000e+00, 4.9000e-01, 1.5800e+02, 3.5000e+00,
        1.8000e+00, 0.0000e+00, 7.0000e+01, 7.5300e+01],
       [2.0000e+00, 2.0000e+00, 2.7000e+00, 3.8100e+02, 9.5000e+00,
        8.0000e-01, 0.0000e+00, 6.0000e+01, 7.0300e+01],
       [3.0000e+00, 4.0000e+00, 9.5400e+00, 1.2100e+03, 2.7500e+01,
        8.0000e-01, 5.5400e+00, 5.5000e+01, 8.8840e+01],
       [2.0000e+00, 4.0000e+00, 1.7600e+01, 3.2510e+03, 5.2000e+01,
        8.0000e-01, 5.5400e+00, 6.5000e+01, 1.2334e+02],


#### 2. In case there is any empty value, and it is necessary to assign some values e.g mean of the column  

In [51]:
# Avoid changing the source data 
taxi_modified = taxi.copy()

# Change/Manupulate the values 
taxi_modified[:,0] = 16
taxi_modified[1800:1802, 7] = taxi_modified[:,7].mean()

#### 3. Replace values in a certain column with  cerain conditions 

For example, when the value in column 5 == 2, then value in that column equal to 1. 

The same condition can alos be applied into another column e.g when the value in column 5 == 2, then value in another column e.g column 15, equal to 1

In [52]:
# Example: can add a new column and assign the value of 0 
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi_modified, zeros], axis=1)
print(taxi_modified)

[[16.    1.    1.   ... 69.99  1.    0.  ]
 [16.    1.    1.   ... 54.3   1.    0.  ]
 [16.    1.    1.   ... 37.8   2.    0.  ]
 ...
 [16.    6.   30.   ... 63.34  1.    0.  ]
 [16.    6.   30.   ... 44.75  1.    0.  ]
 [16.    6.   30.   ... 54.84  2.    0.  ]]


In [53]:
# Replace the value in the same column e.g value in column 5 
taxi_modified[ taxi_modified[:,5] == 0, 5] = 1

In [54]:
# Replace the value in the another column e.g value in column 15 
taxi_modified[taxi_modified[:,5] == 0, 15] = 1 

#### 4.Select rows in ndarray with conditions

If we need to select a subset with condition e.g only rows which column 6 == 2 

In [55]:
jfk = taxi[taxi[:,6] == 2]
jfk_count = jfk.shape[0]