# Pandas and NumPy Fundamentals

To use the NumPy library, we first need to import it into our Python environment. NumPy is commonly imported using the alias np:

In [1]:
import numpy as np

The NumPy library takes advantage of a processor feature called Single Instruction Multiple Data (SIMD) to process data faster. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle:

![SIMD](https://s3.amazonaws.com/dq-content/289/vectorized.gif)

As a result, the NumPy version of our code would only take two processor cycles — a four times speed-up! This concept of replacing for loops with operations applied to multiple data points at once is called vectorization and ndarrays make vectorization possible.

So far, we've only practiced creating one-dimensional ndarrays, but ndarrays can also be two-dimensional:
![NDimensional](https://s3.amazonaws.com/dq-content/289/Two_Dim.svg)

### NYC Taxi-Airport Data
We'll work with a subset of this data - approximately 90,000 yellow taxi trips to and from New York City airports between January and June 2016. Below is information about selected columns from the data set:
* `pickup_year`: The year of the trip.
* `pickup_month`: The month of the trip (January is 1, December is 12).
* `pickup_day`: The day of the month of the trip.
* `pickup_location_code`: The airport or borough where the trip started.
* `dropoff_location_code`: The airport or borough where the trip finished.
* `trip_distance`: The distance of the trip in miles.
* `trip_length`: The length of the trip in seconds.
* `fare_amount`: The base fare of the trip, in dollars.
* `total_amount`: The total amount charged to the passenger, including all fees, tolls and tips.

**Import the Data**

In [2]:
import csv 
import numpy as np

f = open('data/nyc_taxis.csv')
taxi_list = list(csv.reader(f))

# remove header
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)
    
# convert to numpy array
taxi = np.array(converted_taxi_list)

In [3]:
type(taxi)

numpy.ndarray

In [4]:
print(taxi.shape)

(89560, 15)


There are 89,560 rows. Each row has 15 columns of data.

For any 2D array, the full syntax for selecting data is:
`ndarray[row_index,column_index]`
* Select multiple rows with `:`

In [5]:
row_0 = taxi[0]
rows_391_to_500 = taxi[391:501]
row_21_column_5 = taxi[21, 5]

![partial selection](https://s3.amazonaws.com/dq-content/289/selection_1darray_updated.svg)

Selecting one or more columns:

![column selection](https://s3.amazonaws.com/dq-content/289/selection_columns_updated.svg)

If we want to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:
![1D slice](https://s3.amazonaws.com/dq-content/289/selection_1darray_updated.svg)

Lastly, if we want to select a 2D slice, we can use slices for both dimensions:
![2D Slice](https://s3.amazonaws.com/dq-content/289/selection_2darray_updated.svg)

### Challenge:
From the taxi ndarray:

1. Select every row for the columns at indexes 1, 4, and 7. Assign them to columns_1_4_7.

`cols = [1,4,7]
columns_1_4_7 = taxi[:,cols]`

2. Select the columns at indexes 5 to 8 inclusive for the row at index 99. Assign them to row_99_columns_5_to_8.

`row_99_columns_5_to_8 = taxi[99,5:9]`

3. Select the rows at indexes 100 to 200 inclusive for the column at index 14. Assign them to rows_100_to_200_column_14.

`rows_100_to_200_column_14 = taxi[100:201,14]`

### Vector Math
As we saw in the last two screens, NumPy ndarrays allow us to select data much more easily. Beyond this, the selection we make is a lot faster when working with vectorized operations because the operations are applied to multiple data points at once.

Vectorized operations also make our code easier to execute. Here's how we would perform the same task with vectorized operations:

In [7]:
# sums = my_numbers[:,0] + my_numbers[:,1]

Here are some key observations about this code:

* When we selected each column, we used the syntax ndarray[:,c] where c is the column index we wanted to select. Like we saw in the previous screen, the colon selects all rows.
* To add the two 1D ndarrays, col1 and col2, we simply use the addition operator (+) between them.
Here's what happened behind the scenes:

![vector addition](https://s3.amazonaws.com/dq-content/289/vectorized_addition.svg)