# NumPy Exploration

## New York City
We'll only work with a subset of this data - approximately 90,000 yellow taxi trips to and from New York City airports between January and June 2016. Below is information about selected columns from the data set:

1. pickup_year: The year of the trip.
2. pickup_month: The month of the trip (January is 1, December is 12).
3. pickup_day: The day of the month of the trip.
4. pickup_location_code: The airport or borough where the trip started.
5. dropoff_location_code: The airport or borough where the trip finished.
6. trip_distance: The distance of the trip in miles.
7. trip_length: The length of the trip in seconds.
8. fare_amount: The base fare of the trip, in dollars.
9. total_amount: The total amount charged to the passenger, including all fees, tolls and tips.

Now that we understand NumPy a little better, let's learn how to use the numpy.genfromtxt() function to read files into NumPy ndarrays. Here is the simplified syntax for the function, and an explanation of the two parameters:

    np.genfromtxt(filename, delimiter=None)
filename: A positional argument, usually a string representing the path to the text file to be read.
delimiter: A named argument, specifying the string used to separate each value.
In this case, because we have a CSV file, the delimiter is a comma. Here's how we'd read in a file named data.csv:

    data = np.genfromtxt('data.csv', delimiter=',')

In [3]:
import numpy as np

#import file with delimiter format
taxi = np.genfromtxt("nyc_taxis.csv", delimiter = ",")
taxi

#check shape
taxi_shape = taxi.shape
taxi_shape

(89561, 15)

NumPy ndarrays can contain only one datatype.

We can use the ndarray.dtype attribute to see the internal datatype that has been used.

In [4]:
taxi.dtype

dtype('float64')

NumPy chose the float64 type, since it will allow most of the values from our CSV to be read. You can think of NumPy's float64 type as being identical to Python's float type (the "64" refers to the number of bits used to store the underlying value).

If we review the results from the last exercise, we can see that taxi contains almost all numbers except for: nan.

NaN is most commonly seen when a value is missing, but in this case, we have NaN values because the first line from our CSV file contains the names of each column. NumPy is unable to convert string values like pickup_year into the float64 data type.

For now, we need to remove this header row from our ndarray. We can do this the same way we would if our data was stored in a list of lists:

    taxi = taxi[1:]
Alternatively, we can pass an additional parameter, skip_header, to the numpy.genfromtxt() function. The skip_header parameter accepts an integer, the number of rows from the start of the file to skip. Note that because this integer should be the number of rows and not the index, skipping the first row would require a value of 1, not 0.

In [6]:
# use np.gemfromtxt() again with skip_header

taxi = np.genfromtxt("nyc_taxis.csv", delimiter=",", skip_header=1)

taxi_shape=taxi.shape
taxi_shape

(89560, 15)

## Boolean operations on ndarray
Now, let's look at what happens when we perform a boolean operation between an ndarray and a single value:

    print(np.array([2,4,6,8]) < 5)
    [ True  True False False]
A similar pattern occurs – each value in the array is compared to five. If the value is less than five, True is returned. Otherwise, False is returned.

![img](https://s3.amazonaws.com/dq-content/290/vectorized_bool.svg)

In [8]:
# evaluate array a if less than 3
a = np.array([1, 2, 3, 4, 5])

a_bool = a < 3
a_bool

array([ True,  True, False, False, False])

In [9]:
# evalule b if blue
b = np.array(["blue", "blue", "red", "blue"])

b_bool = b=='blue'
b_bool

array([ True,  True, False,  True])

## Boolean Indexing
To index using our new boolean array, we simply insert it in the square brackets, just like we would do with our other selection techniques:

![img](https://s3.amazonaws.com/dq-content/290/1d_bool_2.svg)

The boolean array acts as a filter, so that the values corresponding to True become part of the result and the values corresponding to False are removed.

Let's use boolean indexing to confirm the number of taxi rides in our data set from the month of January. First, let's select just the pickup_month column, which is the second column in the ndarray:

In [11]:
# pickup_month

pickup_month = taxi[:,1]

Next, we use a boolean operation to make a boolean array, where the value 1 corresponds to January:



In [12]:
# january

januray_bool = pickup_month == 1

Then we use the new boolean array to select only the items from pickup_month that have a value of 1:

In [13]:
january = pickup_month[januray_bool]

Finally, we use the .shape attribute to find out how many items are in our january ndarray, which is equal to the number of taxi rides from the month of January. We'll use [0] to extract the value from the tuple returned by .shape:

In [14]:
january_rides = january.shape[0]
print(january_rides)

13481


There are 13,481 rides in our dataset from the month of January.

When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing. Let's look at some examples:

![img](https://s3.amazonaws.com/dq-content/290/bool_dims_updated.svg)

Let's verify if there are any issues with the data. Recall that we calculated the average travel speed as follows:

In [16]:
# calculate the average speed
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
trip_mph

array([37.11340206, 38.58157895, 31.27222982, ..., 22.29907867,
       42.41551247, 36.90473407])

Next, we'll check for trips with an average speed greater than 20,000 mph:

In [17]:
# create a boolean array for trips with average
#speeds greater than  20,000 mph

trip_mph_bool = trip_mph > 20000

# use boolean array to select rows for boolen above
#include pickup_location, drop_off, trip_distance, and trip_length columns

trips_over_20000_mph = taxi[trip_mph_bool, 5:9]
trips_over_20000_mph

array([[ 2. ,  2. , 23. ,  1. ],
       [ 2. ,  2. , 19.6,  1. ],
       [ 2. ,  2. , 16.7,  2. ],
       [ 3. ,  3. , 17.8,  2. ],
       [ 2. ,  2. , 17.2,  2. ],
       [ 3. ,  3. , 16.9,  3. ],
       [ 2. ,  2. , 27.1,  4. ]])

We can see from the last column that most of these are very short rides - all have trip_length values of 4 or less seconds, which does not reconcile with the trip distances, all of which are more than 16 miles.

In [19]:
# create boolean for tips that have value over 50

#isolate column
tip_amount = taxi[:,12]

# crceate boolen
tip_bool = tip_amount > 50

#select rows
top_tips = taxi[tip_bool, 5:14]
top_tips

array([[4.0000e+00, 2.0000e+00, 2.1450e+01, 2.0040e+03, 5.2000e+01,
        8.0000e-01, 0.0000e+00, 5.2800e+01, 1.0560e+02],
       [3.0000e+00, 4.0000e+00, 9.2000e+00, 1.0410e+03, 2.7000e+01,
        1.3000e+00, 5.5400e+00, 6.0000e+01, 9.3840e+01],
       [2.0000e+00, 0.0000e+00, 1.9800e+01, 1.6710e+03, 5.2500e+01,
        1.3000e+00, 5.5400e+00, 5.9340e+01, 1.1868e+02],
       [4.0000e+00, 2.0000e+00, 1.8420e+01, 2.9680e+03, 5.2000e+01,
        8.0000e-01, 5.5400e+00, 8.0000e+01, 1.3834e+02],
       [3.0000e+00, 6.0000e+00, 4.9000e-01, 1.5800e+02, 3.5000e+00,
        1.8000e+00, 0.0000e+00, 7.0000e+01, 7.5300e+01],
       [2.0000e+00, 2.0000e+00, 2.7000e+00, 3.8100e+02, 9.5000e+00,
        8.0000e-01, 0.0000e+00, 6.0000e+01, 7.0300e+01],
       [3.0000e+00, 4.0000e+00, 9.5400e+00, 1.2100e+03, 2.7500e+01,
        8.0000e-01, 5.5400e+00, 5.5000e+01, 8.8840e+01],
       [2.0000e+00, 4.0000e+00, 1.7600e+01, 3.2510e+03, 5.2000e+01,
        8.0000e-01, 5.5400e+00, 6.5000e+01, 1.2334e+02],


So far, we've learned how to retrieve data from ndarrays. Next, we'll use the same indexing techniques we've already learned to modify values within an ndarray. The syntax we'll use (in pseudocode) is:

    ndarray[location_of_values] = new_value
Let's take a look at what that looks like in actual code. With our 1D array, we can specify one specific index location:

    a = np.array(['red','blue','black','blue','purple'])
    a[0] = 'orange'
    print(a)
    
    ['orange', 'blue', 'black', 'blue', 'purple']
Or we can assign multiple values at once:

    a[3:] = 'pink'
    print(a)
    ['orange', 'blue', 'black', 'pink', 'pink']
With a 2D ndarray, just like with a 1D ndarray, we can assign one specific index location:

    ones = np.array([[1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1]])
    ones[1,2] = 99
    print(ones)
    
    [[ 1,  1,  1,  1,  1],
     [ 1,  1, 99,  1,  1],
     [ 1,  1,  1,  1,  1]]
We can also assign a whole row...

    ones[0] = 42
    print(ones)
    
    [[42, 42, 42, 42, 42],
     [ 1,  1, 99,  1,  1],
     [ 1,  1,  1,  1,  1]]
...or a whole column:

    ones[:,2] = 0
    print(ones)
    
    [[42, 42, 0, 42, 42],
     [ 1,  1, 0,  1,  1],
     [ 1,  1, 0,  1,  1]]
     
## Solve

In [20]:
# crceate copy of taxi ndarray

taxi_modified = taxi.copy()

1. The value at column index 5 (pickup_location) of row index 28214 is incorrect. Use assignment to change this value to 1 in the taxi_modified ndarray.

In [21]:
taxi_modified[28214,5] = 1

2. The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the taxi_modified ndarray.

In [22]:
taxi_modified[:,0] = 16

3. The values at column index 7 (trip_distance) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the taxi_modified ndarray to the mean value for that column.

In [24]:
taxi_modified[1800:1802, 7] = taxi_modified[:,7].mean()

Boolean arrays become very powerful when we use them for assignment. Let's look at an example:

        # calculate the average speed
        trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
        a2 = np.array([1, 2, 3, 4, 5])
        ​
        a2_bool = a2 > 2
        ​
        a2[a2_bool] = 99
        ​
        print(a2)
        [ 1  2 99 99 99]
The boolean array controls the values that the assignment applies to, and the other values remain unchanged. Let's look at how this code works:

![img](https://s3.amazonaws.com/dq-content/290/bool_assignment_1.svg)

In [25]:
#create a copy of taxi dataset

taxi_copy = taxi.copy()

1. Select the fourteenth column (index 13) in taxi_copy. Assign it to a variable named total_amount.

In [26]:
total_amount = taxi_copy[:,14]

2. For rows where the value of total_amount is less than 0, use assignment to change the value to 0.

In [27]:
total_amount[total_amount < 0] = 0

Next, we'll look at an example of assignment using a boolean array with two dimensions:

![img](https://s3.amazonaws.com/dq-content/290/bool_assignment_2.svg)

The b > 4 boolean operation produces a 2D boolean array which then controls the values that the assignment applies to.

We can also use a 1D boolean array to perform assignment on a 2D array:

![img](https://s3.amazonaws.com/dq-content/290/bool_assignment_3.svg)

The c[:,1] > 2 boolean operation compares just one column's values and produces a 1D boolean array. We then use that boolean array as the row index for assignment, and 1 as the column index to specify the second column. Our boolean array is only applied to the second column, while all other values remaining unchanged.

The pseudocode syntax for this code is as follows, first using an intermediate variable:

    # calculate the average speed
    trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
    bool = array[:, column_for_comparison] == value_for_comparison
    array[bool, column_for_assignment] = new_value
and then all in one line:

    array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
    
Create a new copy of our taxi dataset, taxi_modified with an additional column containing the value 0 for every row.

In [29]:
# create a new column filled with `0`.
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]


### Solve
In our new column at index 15, assign the value 1 if the pickup_location_code (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
1. For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
2. For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
3. For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [30]:
# 1

taxi_modified[taxi_modified[:,5] ==2, 15] = 1 

In [31]:
# 2

taxi_modified[taxi_modified[:,5] ==3, 15] = 1 

In [32]:
# 3

taxi_modified[taxi_modified[:,5] ==5, 15] = 1 

### Challenge part 1
To complete this task, we'll need to check if the dropoff_location_code column (column index 6) is equal to one of the following values:

1. 2: JFK Airport
2. 3: LaGuardia Airport
3. 5: Newark Airport.

#### Problem 1
Using the original taxi ndarray, calculate how many trips had JFK Airport as their destination:
- Use boolean indexing to select only the rows where the dropoff_location_code column (column index 6) has a value that corresponds to JFK. Assign the result to jfk.
- Calculate how many rows are in the new jfk array and assign the result to jfk_count.

In [42]:
jfk = taxi[taxi[:,6] == 2]
jfk_count = jfk.shape[0]
jfk_count

11832

In [43]:
laguardia = taxi[taxi[:,6] == 3]
laguardia_count = laguardia.shape[0]
laguardia_count

16602

In [44]:
newark = taxi[taxi[:,6] == 5]
newark_count = newark.shape[0]
newark_count

63

Our calculations in the previous screen show that Laguardia is the most common airport for dropoffs in our data set.

### Challenge Part 2
Our second and final challenge involves removing potentially bad data from our data set, and then calculating some descriptive statistics on the remaining "clean" data.

We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph) which should remove the questionable data we have worked with over the past two missions. Then, we'll use array methods to calculate the mean for specific columns of the remaining data. The columns we're interested in are:

1. trip_distance, at column index 7
2. trip_length, at column index 8
3. total_amount, at column index 13

Instructions:
- Create a new ndarray, cleaned_taxi, containing only rows for which the values of trip_mph are less than 100.
- Calculate the mean of the trip_distance column of cleaned_taxi. Assign the result to mean_distance.
- Calculate the mean of the trip_length column of cleaned_taxi. Assign the result to mean_length.
- Calculate the mean of the total_amount column of cleaned_taxi. Assign the result to mean_total_amount.

In [59]:
# trip_mph formuala
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

In [63]:
cleaned_taxi = taxi[trip_mph < 100]

In [64]:
mean_distance = cleaned_taxi[:,7].mean()

In [65]:
mean_length = cleaned_taxi[:,8].mean()

In [66]:
mean_total_amount = cleaned_taxi[:,-2].mean()

# Syntax

## READING CSV FILES WITH NUMPY
Reading in a CSV file:

    import numpy as np
    taxi = np.genfromtxt('nyctaxis.csv', delimiter=',', skip_header=1)

## BOOLEAN ARRAYS
Creating a Boolean array from filtering criteria:

    np.array([2,4,6,8]) < 5

Boolean filtering for 1D ndarray:

    a = np.array([2,4,6,8])
    filter = a < 5
    a[filter]

Boolean filtering for 2D ndarray:

    tip_amount = taxi[:,12]
    tip_bool = tip_amount > 50
    top_tips = taxi[tip_bool, 5:14]

## ASSIGNING VALUES
Assigning values in a 2D ndarray using indices:

    taxi[28214,5] = 1
    taxi[:,0] = 16
    taxi[1800:1802,7] = taxi[:,7].mean()

Assigning values using Boolean arrays:

    taxi[taxi[:, 5] == 2, 15] = 1

# Concepts
Selecting values from a ndarray using Boolean arrays is very powerful. Using Boolean arrays helps us think in terms of filters on the data, instead of specific index values (like we did when working with Python lists).

# Resources
1. [Reading a CSV file into NumPy](https://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt)
2. [Indexing and selecting data](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html)