# *Introduction To NumPy*

**Here, we'll be analyzing taxi trip data released by the city of New York. The city releases data on taxis and for-hire vehicles on the <a href="http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml">Taxi and Limousine Commission (TLC) Website</a>. There is data on over 1.3 trillion individual trips, reaching back as far as 2009 and is regularly updated.**

### Data Dictionary

> <font color=blue>***pickup_year***</font> - The year of the trip.<br>
> <font color=blue>***pickup_month***</font> - The month of the trip (January is 1, December is 12).<br>
> <font color=blue>***pickup_day***</font> - The day of the month of the trip.<br>
> <font color=blue>***pickup_location_code***</font> - The airport or borough where the the trip started, as one of eight categories:<br>
> <font color=blue>***0 - Bronx.***</font><br>
> <font color=blue>***1 - Brooklyn.***</font><br>
> <font color=blue>***2 - JFK Airport.***</font><br>
> <font color=blue>***3 - LaGuardia Airport.***</font><br>
> <font color=blue>***4 - Manhattan.***</font><br>
> <font color=blue>***5 - Newark Airport.***</font><br>
> <font color=blue>***6 - Queens.***</font><br>
> <font color=blue>***7 - Staten Island.***</font><br>
> <font color=blue>***dropoff_location_code***</font> - The airport or borough where the the trip finished, using the same eight category codes as pickup_location_code.<br>
> <font color=blue>***trip_distance***</font> - The distance of the trip in miles.<br>
> <font color=blue>***trip_length***</font> - The length of the trip in seconds.<br>
> <font color=blue>***fare_amount***</font> - The base fare of the trip, in dollars.<br>
> <font color=blue>***total_amount***</font> - The total amount charged to the passenger, including all fees, tolls and tips.<br>

**We have randomly sampled approximately 90,000 trips for our analysis, representing one 50th of the trips for the six month period. Our data is stored in a CSV file called nyc_taxis.csv**

### NYC Taxi-Airport Data
We have imported numpy, and used Python's csv module to import the nyc_taxis.csv file and convert it to a list of lists containing float values.

1. Add a line of code using the numpy.array() constructor to convert the converted_taxi_list variable to a NumPy ndarray. Assign the result to the variable name taxi.

In [4]:
import csv
import numpy as np

# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

# start writing your code below this comment
taxi = np.array(converted_taxi_list)
np.set_printoptions(precision=2, suppress=True)
print(taxi)

[[2016.      1.      1.   ...   11.65   69.99    1.  ]
 [2016.      1.      1.   ...    8.     54.3     1.  ]
 [2016.      1.      1.   ...    0.     37.8     2.  ]
 ...
 [2016.      6.     30.   ...    5.     63.34    1.  ]
 [2016.      6.     30.   ...    8.95   44.75    1.  ]
 [2016.      6.     30.   ...    0.     54.84    2.  ]]


### Understanding NumPy ndarrays
1. Select the first ten rows of the taxi ndarray, and assign the result to a new variable taxi_ten.
2. Use Python's print() function to display taxi_ten.

In [5]:
taxi_five = taxi[:5]
taxi_ten = taxi[:10]

print(taxi_ten)

[[2016.      1.      1.      5.      0.      2.      4.     21.   2037.
    52.      0.8     5.54   11.65   69.99    1.  ]
 [2016.      1.      1.      5.      0.      2.      1.     16.29 1520.
    45.      1.3     0.      8.     54.3     1.  ]
 [2016.      1.      1.      5.      0.      2.      6.     12.7  1462.
    36.5     1.3     0.      0.     37.8     2.  ]
 [2016.      1.      1.      5.      0.      2.      6.      8.7  1210.
    26.      1.3     0.      5.46   32.76    1.  ]
 [2016.      1.      1.      5.      0.      2.      6.      5.56  759.
    17.5     1.3     0.      0.     18.8     2.  ]
 [2016.      1.      1.      5.      0.      4.      2.     21.45 2004.
    52.      0.8     0.     52.8   105.6     1.  ]
 [2016.      1.      1.      5.      0.      2.      6.      8.45  927.
    24.5     1.3     0.      6.45   32.25    1.  ]
 [2016.      1.      1.      5.      0.      2.      6.      7.3   731.
    21.5     1.3     0.      0.     22.8     2.  ]
 [2016.      1. 

### Selecting and Slicing Rows and Items from ndarrays
1. From the taxi ndarray:
    - Select the row at index 0 and assign it to row_0.
    - Select every column for the rows at indexes 391 to 500 inclusive and assign them to rows_391_to_500.
    - Select the item at row index 21 and column index 5 and assign it to row_21_column_5

In [8]:
row_0 = taxi[0]
rows_391_to_500 = taxi[391:501]
row_21_column_5 = taxi[21, 5]

print(rows_391_to_500)

[[2016.      1.      2.   ...    0.     26.3     2.  ]
 [2016.      1.      2.   ...    3.     30.3     1.  ]
 [2016.      1.      2.   ...    6.67   40.01    1.  ]
 ...
 [2016.      1.      2.   ...    4.96   29.76    1.  ]
 [2016.      1.      2.   ...    0.     32.84    2.  ]
 [2016.      1.      2.   ...    7.05   42.39    1.  ]]


### Selecting Columns and Custom Slicing ndarrays
1. From the taxi ndarray:
    - Select every row for the columns at indexes 1, 4, and 7 and assign them to columns_1_4_7.
    - Select the columns at indexes 5 to 8 inclusive for the row at index 99 and assign them to row_99_columns_5_to_8.
    - Select the rows at indexes 100 to 200 inclusive for the column at index 14 and assign them to rows_100_to_200_column_14.

In [10]:
columns_1_4_7 = taxi[:, [1, 4, 7]]
row_99_columns_5_to_8 = taxi[99, 5:9]
rows_100_to_200_column_14 = taxi[100:201, 14]

print(columns_1_4_7)

[[ 1.    0.   21.  ]
 [ 1.    0.   16.29]
 [ 1.    0.   12.7 ]
 ...
 [ 6.    5.   17.48]
 [ 6.    5.   12.76]
 [ 6.    5.   17.54]]


### Vector Math

1. Use vector division to divide trip_distance_miles by trip_length_hours, assigning the result to trip_mph.
2. After you have run your code, use the variable inspector below the code box to inspect the contents of the new trip_mph variable.

In [11]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

trip_mph = trip_distance_miles/trip_length_hours

print(trip_mph)

[37.11 38.58 31.27 ... 22.3  42.42 36.9 ]


### Calculating Statistics for 1D ndarrays

1. Use the ndarray.max() method to calculate the maximum value of trip_mph and assign the result to mph_max.
2. Use the ndarray.mean() method to calculate the average value of trip_mph and assign the result to mph_mean.

In [12]:
mph_min = trip_mph.min()
mph_max = trip_mph.max()
mph_mean = trip_mph.mean()

print(mph_mean)

32.24258580925573


### Calculating Statistics for 2D ndarrays

1. Using a single method, calculate the mean value for each column of taxi, and assign the result to taxi_column_means.

In [13]:
taxi_column_means = taxi.mean(axis = 0)
print(taxi_column_means)

[2016.      3.61   15.69    3.84    3.08    2.96    3.38   12.67 2235.98
   38.4     1.21    3.54    5.81   48.97    1.29]


### Adding Rows and Columns to ndarrays
1. Expand the dimensions of trip_mph to be a single column in a 2D ndarray, and assign the result to trip_mph_2d.
2. Add trip_mph_2d as a new column at the end of taxi, assigning the result back to taxi.
3. Use the print() function to display taxi and view the new column.

In [19]:
# These `ones` and `zeros` variables
# are different from the ones in the
# main lesson 

ones = np.array([[1, 1, 1, 1, 1, 1], 
        [1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1]])
zeros = np.array([0, 0, 0])

print(ones)
print(zeros)
print() # creates a space in our output

print(ones.shape)
print(zeros.shape)
print()

zeros_2d = np.expand_dims(zeros,axis=1)
print(zeros_2d)
print(zeros_2d.shape)
print()

combined = np.concatenate([ones,zeros_2d],axis=1)
print(combined)
print()

# the `trip_mph` variable is still available from the
# previous screen

trip_mph_2d = np.expand_dims(trip_mph, axis = 1)
taxi = np.concatenate([taxi, trip_mph_2d], axis = 1)
print(taxi)

[[1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]]
[0 0 0]

(3, 6)
(3,)

[[0]
 [0]
 [0]]
(3, 1)

[[1 1 1 1 1 1 0]
 [1 1 1 1 1 1 0]
 [1 1 1 1 1 1 0]]

[[2016.      1.      1.   ...    1.     37.11   37.11]
 [2016.      1.      1.   ...    1.     38.58   38.58]
 [2016.      1.      1.   ...    2.     31.27   31.27]
 ...
 [2016.      6.     30.   ...    1.     22.3    22.3 ]
 [2016.      6.     30.   ...    1.     42.42   42.42]
 [2016.      6.     30.   ...    2.     36.9    36.9 ]]


### Sorting ndarrasy
1. Use numpy.argsort() to get the indices which would sort the trip_mph column from the taxi ndarray. The trip_mph column is at column index 15.
2. Use the indices from the previous instruction to sort the taxi ndarray, and assign the result to taxi_sorted.
3. Use the print() function to examine the taxi_sorted ndarray.

In [21]:
sorted = np.argsort(taxi[:, 15])
taxi_sorted = taxi[sorted]
print(taxi_sorted)

[[ 2016.     6.    28. ...     1.     0.     0.]
 [ 2016.     3.     3. ...     1.     0.     0.]
 [ 2016.     4.     6. ...     4.     0.     0.]
 ...
 [ 2016.     3.    28. ...     2. 32040. 32040.]
 [ 2016.     2.    13. ...     2. 70560. 70560.]
 [ 2016.     1.    22. ...     2. 82800. 82800.]]


### Analyzing trips with High Average Speeds
1. Below are the last 10 rows of our sorted_taxi ndarray, with trip_mph values ranging between 15,570 and 82,800:

In [24]:
taxi_sorted[-10:]

array([[ 2016.  ,     2.  ,    19.  ,     5.  ,     4.  ,     2.  ,
            2.  ,    17.3 ,     4.  ,     2.5 ,     1.8 ,     0.  ,
            0.  ,     4.3 ,     2.  , 15570.  , 15570.  ],
       [ 2016.  ,     6.  ,     6.  ,     1.  ,     0.  ,     2.  ,
            2.  ,    18.7 ,     4.  ,     2.5 ,     1.3 ,     0.  ,
            0.  ,     3.8 ,     3.  , 16830.  , 16830.  ],
       [ 2016.  ,     4.  ,    12.  ,     2.  ,     4.  ,     2.  ,
            2.  ,    19.8 ,     4.  ,     2.5 ,     1.8 ,     0.  ,
            0.  ,     4.3 ,     2.  , 17820.  , 17820.  ],
       [ 2016.  ,     4.  ,    24.  ,     7.  ,     5.  ,     3.  ,
            3.  ,    16.9 ,     3.  ,    52.  ,     0.8 ,     0.  ,
            0.  ,    52.8 ,     3.  , 20280.  , 20280.  ],
       [ 2016.  ,     6.  ,    30.  ,     4.  ,     3.  ,     2.  ,
            2.  ,    27.1 ,     4.  ,    75.  ,     0.8 ,     0.  ,
            0.  ,    75.8 ,     2.  , 24390.  , 24390.  ],
       [ 2016.  ,     3. 

# Boolean Indexing with NumPy

### Reading CSV Files With NumPy
1. Import the NumPy library.
2. Use the numpy.genfromtxt() function to read the nyc_taxis.csv file into NumPy, skipping the first row, and assign the result to taxi.
3. Use the variable inspector under the code box to view the taxi ndarray after you have run your code.

In [27]:
import numpy as np
taxi = np.genfromtxt("nyc_taxis.csv", delimiter = ",")
taxi = taxi[1:]

taxi

array([[2016.  ,    1.  ,    1.  , ...,   11.65,   69.99,    1.  ],
       [2016.  ,    1.  ,    1.  , ...,    8.  ,   54.3 ,    1.  ],
       [2016.  ,    1.  ,    1.  , ...,    0.  ,   37.8 ,    2.  ],
       ...,
       [2016.  ,    6.  ,   30.  , ...,    5.  ,   63.34,    1.  ],
       [2016.  ,    6.  ,   30.  , ...,    8.95,   44.75,    1.  ],
       [2016.  ,    6.  ,   30.  , ...,    0.  ,   54.84,    2.  ]])

### Boolean Arrays
1. Use vectorized boolean operations to:
    - Evaluate whether the elements in array a are less than 3 and assign the result to a_bool.
    - Evaluate whether the elements in array b are equal to "blue" and assign the result to b_bool.
    - Evaluate whether the elements in array c are greater than 100 and assign the result to c_bool.
2. Once you have run your code, use the variable inspector below the code box to view each boolean array.

In [29]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

a_bool = np.array(a) < 3
b_bool = np.array(b) == "blue"
c_bool = np.array(c) > 100

a_bool

array([ True,  True, False, False, False])

### Boolean Indexing with 1D ndarrays

1. Calculate the number of rides in the taxi ndarray that are from February:
    - Create a boolean array, february_bool, that evaluates whether the items in pickup_month are equal to 2.
    - Use the february_bool boolean array to index pickup_month, and assign the result to february.
    - Use the ndarray.shape attribute to find the number of items in february and assign the result to february_rides.
2. Calculate the number of rides in the taxi ndarray that are from March:
    - Create a boolean array, march_bool, that evaluates whether the items in pickup_month are equal to 3.
    - Use the march_bool boolean array to index pickup_month, and assign the result to march.
    - Use the ndarray.shape attribute to find the number of items in march and assign the result to march_rides.
3. Once you have run your code, use the variable inspector to view the number of rides for February and March.

In [33]:
pickup_month = taxi[:,1]

january_bool = pickup_month == 1
january = pickup_month[january_bool]
january_rides = january.shape[0]

february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0]

march_bool = pickup_month == 3
march = pickup_month[march_bool]
march_rides = march.shape[0]

march_rides

15547

### Boolean Indexing with 2D ndarrays

1. Create a boolean array, tip_bool, that determines which rows have values for the tip_amount column of more than 50.
2. Use the tip_bool array to select all rows from taxi with values tip amounts of more than 50, and the columns from indexes 5 to 13 inclusive. Assign the resulting array to top_tips.
3. Once you have run your code, use the variable inspector to view the top_tips array.
    - To help you understand the data, the columns names are (in order):<br> pickup_location_code,<br> dropoff_location_code,<br> trip_distance,<br> trip_length,<br> fare_amount,<br> fees_amount,<br> tolls_amount,<br> tip_amount,<br> total_amount.

In [35]:
tip_amount = taxi[:,12]
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]
top_tips

array([[    4.  ,     2.  ,    21.45,  2004.  ,    52.  ,     0.8 ,
            0.  ,    52.8 ,   105.6 ],
       [    3.  ,     4.  ,     9.2 ,  1041.  ,    27.  ,     1.3 ,
            5.54,    60.  ,    93.84],
       [    2.  ,     0.  ,    19.8 ,  1671.  ,    52.5 ,     1.3 ,
            5.54,    59.34,   118.68],
       [    4.  ,     2.  ,    18.42,  2968.  ,    52.  ,     0.8 ,
            5.54,    80.  ,   138.34],
       [    3.  ,     6.  ,     0.49,   158.  ,     3.5 ,     1.8 ,
            0.  ,    70.  ,    75.3 ],
       [    2.  ,     2.  ,     2.7 ,   381.  ,     9.5 ,     0.8 ,
            0.  ,    60.  ,    70.3 ],
       [    3.  ,     4.  ,     9.54,  1210.  ,    27.5 ,     0.8 ,
            5.54,    55.  ,    88.84],
       [    2.  ,     4.  ,    17.6 ,  3251.  ,    52.  ,     0.8 ,
            5.54,    65.  ,   123.34],
       [    4.  ,     2.  ,    38.2 ,  9252.  ,    52.  ,     0.8 ,
            5.54,    80.  ,   138.34],
       [    4.  ,     2.  ,    18.  ,

### Assigning Values in ndarrays
**To help you practice without making changes to our original array, we have used the ndarray.copy() method to make taxi_modified, a copy of our original for these exercises.**

1. The value at column index 5 (pickup_location) of row index 28214 is incorrect. Use assignment to change this value to 1 in the taxi_modified ndarray.
2. The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the taxi_modified ndarray.
3. The values at column index 7 (trip_distance) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the taxi_modified ndarray to the mean value for that column.

In [37]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

taxi_modified[28214, 5] = 1

taxi_modified[:, 0] = 16

taxi_modified[1800:1802, 7] = taxi_modified[:, 7].mean()

taxi_modified

array([[16.  ,  1.  ,  1.  , ..., 11.65, 69.99,  1.  ],
       [16.  ,  1.  ,  1.  , ...,  8.  , 54.3 ,  1.  ],
       [16.  ,  1.  ,  1.  , ...,  0.  , 37.8 ,  2.  ],
       ...,
       [16.  ,  6.  , 30.  , ...,  5.  , 63.34,  1.  ],
       [16.  ,  6.  , 30.  , ...,  8.95, 44.75,  1.  ],
       [16.  ,  6.  , 30.  , ...,  0.  , 54.84,  2.  ]])

### Assignment using Boolean Arrays
**We have created a new copy of our taxi dataset, taxi_modified with an additional column containing the value 0 for every row.**

1. In our new column at index 15, assign the value 1 if the pickup_location_code (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
    - For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
    - For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
    - For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [39]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

# create a new column filled with `0`.
zeros = np.zeros([taxi_modified.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)

taxi_modified[taxi_modified[:, 5] == 2, 15] = 1
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

taxi_modified

array([[2016.  ,    1.  ,    1.  , ...,   69.99,    1.  ,    1.  ],
       [2016.  ,    1.  ,    1.  , ...,   54.3 ,    1.  ,    1.  ],
       [2016.  ,    1.  ,    1.  , ...,   37.8 ,    2.  ,    1.  ],
       ...,
       [2016.  ,    6.  ,   30.  , ...,   63.34,    1.  ,    1.  ],
       [2016.  ,    6.  ,   30.  , ...,   44.75,    1.  ,    1.  ],
       [2016.  ,    6.  ,   30.  , ...,   54.84,    2.  ,    1.  ]])

### Challenge Which is the Most Popular Airport?
1. Using the original taxi ndarray, calculate how many trips had JFK Airport as their destination:
    - Select only the rows there the dropoff_location_code column has a value that corresponds to JFK, and assign the result to jfk.
    - Calculate how many rows are in the new jfk array and assign the result to jfk_count.
2. Calculate how many trips from taxi had Laguardia Airport as their destination:
    - Select only the rows there the dropoff_location_code column has a value that corresponds to Laguardia, and assign the result to laguardia.
    - Calculate how many rows are in the new laguardia array and assign the result to laguardia_count.
3. Calculate how many trips from taxi had Newark Airport as their destination:
    - Select only the rows there the dropoff_location_code column has a value that corresponds to Newark, and assign the result to newark.
    - Calculate how many rows are in the new newark array and assign the result to newark_count.
4. After you have run your code, inspect the values for jfk_count, laguardia_count, and newark_count and see which airport has the most dropoffs.

In [41]:
jfk = taxi[taxi[:, 6] == 2]
jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:, 6] == 3]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:, 6] == 5]
newark_count = newark.shape[0]

print(jfk_count)
print(laguardia_count)
print(newark_count)

11832
16602
63


### Challenge: Calculating Statistics for Trips on Clean Data

**The trip_mph ndarray has been provided for you.**

1. Create a new ndarray, cleaned_taxi, containing only rows for which the values of trip_mph are less than 100.
2. Calculate the mean of the trip_distance column of cleaned_taxi, and assign the result to mean_distance.
3. Calculate the mean of the trip_length column of cleaned_taxi, and assign the result to mean_length.
4. Calculate the mean of the total_amount column of cleaned_taxi, and assign the result to mean_total_amount.
5. Calculate the mean of the trip_mph, excluding values greater than 100, and assign the result to mean_mph.

In [43]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

cleaned_taxi = taxi[trip_mph < 100]

mean_distance = cleaned_taxi[:, 7].mean()
mean_length = cleaned_taxi[:, 8].mean()
mean_total_amount = cleaned_taxi[:, 13].mean()
new_trip_mph = trip_mph < 100
mean_mph = trip_mph[trip_mph[:] < 100].mean()

cleaned_taxi

array([[2016.  ,    1.  ,    1.  , ...,   11.65,   69.99,    1.  ],
       [2016.  ,    1.  ,    1.  , ...,    8.  ,   54.3 ,    1.  ],
       [2016.  ,    1.  ,    1.  , ...,    0.  ,   37.8 ,    2.  ],
       ...,
       [2016.  ,    6.  ,   30.  , ...,    5.  ,   63.34,    1.  ],
       [2016.  ,    6.  ,   30.  , ...,    8.95,   44.75,    1.  ],
       [2016.  ,    6.  ,   30.  , ...,    0.  ,   54.84,    2.  ]])