# Pandas and numpy fundamentals

## Introduction to numpy
Numpy is a python library that allow to analyze large datasets, the python built-in array, list is limited to small datasets. For more large datasets, numpy will be used for data analysis and processing.

**SIMD**: processor features is used by that numpy for a more effective data analyze, vetorization.
We will work with a subset of New York City's taxi trip dataset released between january and june 2016 by the city of New York. The dataset contains the following informations:
- pickup_year - The year of the trip.
- pickup_month - The month of the trip (January is 1, December is 12).
- pickup_day - The day of the month of the trip.
- pickup_location_code - The airport or borough where the trip started.
- dropoff_location_code - The airport or borough where the trip finished, using the same eight category codes as pickup_location_code.
- trip_distance - The distance of the trip in miles.
- trip_length - The length of the trip in seconds.
- fare_amount - The base fare of the trip, in dollars.
- total_amount - The total amount charged to the passenger, including all fees, tolls and tips.

In [1]:
# use of the csv module to convert the dataset in a list of list and then convert it into a 2D-array with numpy array contructor
from csv import reader
import numpy as np
with open("nyc_taxis.csv") as open_file:
    read_file = reader(open_file)
    taxi_trip_lists = list(read_file)
    # the header of the dataset
    taxi_trip_lists_header = taxi_trip_lists[0]
    # remove the header
    taxi_trip_lists = taxi_trip_lists[1:]
    print(taxi_trip_lists[:5])

[['2016', '1', '1', '5', '0', '2', '4', '21.00', '2037', '52.00', '0.80', '5.54', '11.65', '69.99', '1'], ['2016', '1', '1', '5', '0', '2', '1', '16.29', '1520', '45.00', '1.30', '0.00', '8.00', '54.30', '1'], ['2016', '1', '1', '5', '0', '2', '6', '12.70', '1462', '36.50', '1.30', '0.00', '0.00', '37.80', '2'], ['2016', '1', '1', '5', '0', '2', '6', '8.70', '1210', '26.00', '1.30', '0.00', '5.46', '32.76', '1'], ['2016', '1', '1', '5', '0', '2', '6', '5.56', '759', '17.50', '1.30', '0.00', '0.00', '18.80', '2']]


In [2]:
# convert the data points in to floating points number
converted_taxi_data = []
for row in taxi_trip_lists:
    converted_taxi_row = []
    for item in row:
        item = float(item)
        converted_taxi_row.append(item)
    converted_taxi_data.append(converted_taxi_row)
print(converted_taxi_data[:5])

[[2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 4.0, 21.0, 2037.0, 52.0, 0.8, 5.54, 11.65, 69.99, 1.0], [2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 1.0, 16.29, 1520.0, 45.0, 1.3, 0.0, 8.0, 54.3, 1.0], [2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 6.0, 12.7, 1462.0, 36.5, 1.3, 0.0, 0.0, 37.8, 2.0], [2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 6.0, 8.7, 1210.0, 26.0, 1.3, 0.0, 5.46, 32.76, 1.0], [2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 6.0, 5.56, 759.0, 17.5, 1.3, 0.0, 0.0, 18.8, 2.0]]


In [29]:
taxi = np.array(converted_taxi_data)
print(taxi[:5])

[[2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
  2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 1.000e+00
  1.629e+01 1.520e+03 4.500e+01 1.300e+00 0.000e+00 8.000e+00 5.430e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  1.270e+01 1.462e+03 3.650e+01 1.300e+00 0.000e+00 0.000e+00 3.780e+01
  2.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  8.700e+00 1.210e+03 2.600e+01 1.300e+00 0.000e+00 5.460e+00 3.276e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  5.560e+00 7.590e+02 1.750e+01 1.300e+00 0.000e+00 0.000e+00 1.880e+01
  2.000e+00]]


In [30]:
print(type(taxi))

<class 'numpy.ndarray'>


In [31]:
# selecting a row nd_array = [row, column]
taxi[0, 1]

1.0

In [21]:
# slicing taxi ndarray
taxi[:5, 6:]

array([[4.000e+00, 2.100e+01, 2.037e+03, 5.200e+01, 8.000e-01, 5.540e+00,
        1.165e+01, 6.999e+01, 1.000e+00],
       [1.000e+00, 1.629e+01, 1.520e+03, 4.500e+01, 1.300e+00, 0.000e+00,
        8.000e+00, 5.430e+01, 1.000e+00],
       [6.000e+00, 1.270e+01, 1.462e+03, 3.650e+01, 1.300e+00, 0.000e+00,
        0.000e+00, 3.780e+01, 2.000e+00],
       [6.000e+00, 8.700e+00, 1.210e+03, 2.600e+01, 1.300e+00, 0.000e+00,
        5.460e+00, 3.276e+01, 1.000e+00],
       [6.000e+00, 5.560e+00, 7.590e+02, 1.750e+01, 1.300e+00, 0.000e+00,
        0.000e+00, 1.880e+01, 2.000e+00]])

In [22]:
# gives the value of the 4th first rows
print(taxi[[0,1,3,4],0])

[2016. 2016. 2016. 2016.]


- From the taxi ndarray:
  - Select the row at index 0 and assign it to row_0.
  - Select every column for the rows at indexes 391 to 500 inclusive and assign them to rows_391_to_500.
  - Select the item at row index 21 and column index 5 and assign it to row_21_column_5

In [23]:
row_0 = taxi[0]
print(row_0)

[2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
 2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
 1.000e+00]


In [27]:
rows_391_to_500 = taxi[391:501]
print(rows_391_to_500[:5])

[[2.016e+03 1.000e+00 2.000e+00 6.000e+00 1.000e+00 4.000e+00 3.000e+00
  8.300e+00 1.081e+03 2.500e+01 1.300e+00 0.000e+00 0.000e+00 2.630e+01
  2.000e+00]
 [2.016e+03 1.000e+00 2.000e+00 6.000e+00 1.000e+00 4.000e+00 3.000e+00
  8.900e+00 1.114e+03 2.600e+01 1.300e+00 0.000e+00 3.000e+00 3.030e+01
  1.000e+00]
 [2.016e+03 1.000e+00 2.000e+00 6.000e+00 1.000e+00 4.000e+00 3.000e+00
  9.190e+00 9.930e+02 2.650e+01 1.300e+00 5.540e+00 6.670e+00 4.001e+01
  1.000e+00]
 [2.016e+03 1.000e+00 2.000e+00 6.000e+00 1.000e+00 4.000e+00 2.000e+00
  1.802e+01 1.654e+03 5.200e+01 8.000e-01 5.540e+00 1.167e+01 7.001e+01
  1.000e+00]
 [2.016e+03 1.000e+00 2.000e+00 6.000e+00 1.000e+00 4.000e+00 2.000e+00
  1.835e+01 1.765e+03 5.200e+01 8.000e-01 0.000e+00 0.000e+00 5.280e+01
  2.000e+00]]


In [26]:
row_21_column_5 = taxi[21, 5]
print(row_21_column_5)

4.0


## Vectorization
For vectorization, we first convert a list of list into ndarray, then select each columns and finally applied and operations on thoses columns

## Reading file with numpy
In the foregoing chapter we have used python built-in module csv to read file and then call the reader definitions to read the file and the function list to convert the read data into a list of list.
In this section, we will take the advantages of the numpy vectorization to process quickly operations and genfromtxt function to convert a text into ndarray 

In [3]:
import numpy as np

In [11]:
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1);
print(taxi[:5])


[[2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
  2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 1.000e+00
  1.629e+01 1.520e+03 4.500e+01 1.300e+00 0.000e+00 8.000e+00 5.430e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  1.270e+01 1.462e+03 3.650e+01 1.300e+00 0.000e+00 0.000e+00 3.780e+01
  2.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  8.700e+00 1.210e+03 2.600e+01 1.300e+00 0.000e+00 5.460e+00 3.276e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  5.560e+00 7.590e+02 1.750e+01 1.300e+00 0.000e+00 0.000e+00 1.880e+01
  2.000e+00]]


In [12]:
print(taxi.dtype)

float64


## Boolean array and indexing
Boolean array acts like filter and are passed to the ndarray

In [13]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])
# use of vectorization to check whether an array is less than 3
a_bool = a < 3

# element in b == "blue"
b_bool =  (b == "blue")

# c > 100

c_bool = c > 100

## Boolean indexing with numpy 1ndarray
- **shape** : function to access to the number of elements of a row, return a tuple (rowNum, colNum)
- **Filter** : Boolean  return array act like a filter and is later passed to the ndarray


In [14]:
pickup_month = taxi[:, 1]
print(pickup_month)

[1. 1. 1. ... 6. 6. 6.]


In [26]:
january_bool = pickup_month == 1
print(january_bool)

january = pickup_month[january_bool]
print(january)

january_rides = january.shape
print(january_rides)
print(january_rides[0])

[ True  True  True ... False False False]
[1. 1. 1. ... 1. 1. 1.]
(13481,)
13481


In [27]:
february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0] # return a tuple (row, col)

## Boolean indexing with 2D ndarrays

In [31]:
tip_amount = taxi[:,  12]

# selecting tip_amount > 50$
tip_bool = tip_amount > 50

top_tips = taxi[tip_bool]
top_tips = top_tips[:, 5:14]
print(top_tips)

[[4.0000e+00 2.0000e+00 2.1450e+01 2.0040e+03 5.2000e+01 8.0000e-01
  0.0000e+00 5.2800e+01 1.0560e+02]
 [3.0000e+00 4.0000e+00 9.2000e+00 1.0410e+03 2.7000e+01 1.3000e+00
  5.5400e+00 6.0000e+01 9.3840e+01]
 [2.0000e+00 0.0000e+00 1.9800e+01 1.6710e+03 5.2500e+01 1.3000e+00
  5.5400e+00 5.9340e+01 1.1868e+02]
 [4.0000e+00 2.0000e+00 1.8420e+01 2.9680e+03 5.2000e+01 8.0000e-01
  5.5400e+00 8.0000e+01 1.3834e+02]
 [3.0000e+00 6.0000e+00 4.9000e-01 1.5800e+02 3.5000e+00 1.8000e+00
  0.0000e+00 7.0000e+01 7.5300e+01]
 [2.0000e+00 2.0000e+00 2.7000e+00 3.8100e+02 9.5000e+00 8.0000e-01
  0.0000e+00 6.0000e+01 7.0300e+01]
 [3.0000e+00 4.0000e+00 9.5400e+00 1.2100e+03 2.7500e+01 8.0000e-01
  5.5400e+00 5.5000e+01 8.8840e+01]
 [2.0000e+00 4.0000e+00 1.7600e+01 3.2510e+03 5.2000e+01 8.0000e-01
  5.5400e+00 6.5000e+01 1.2334e+02]
 [4.0000e+00 2.0000e+00 3.8200e+01 9.2520e+03 5.2000e+01 8.0000e-01
  5.5400e+00 8.0000e+01 1.3834e+02]
 [4.0000e+00 2.0000e+00 1.8000e+01 2.2760e+03 1.0000e-02 3.0000e

## Values assignment to an ndarray

In [32]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

# get location of row = 28214
taxi_modified[28214, 5] = 1

taxi_modified[:, 0] = 16

mean_value = taxi_modified[:, 7].mean(axis=0)
print(mean_value)

taxi_modified[1800:1802, 7] = mean_value

12.6674260830728


## Values assignment with boolean arrays

In [55]:
a = np.array([1, 2, 3, 4, 5])
a 

array([1, 2, 3, 4, 5])

In [56]:
b = a > 2
b 

array([False, False,  True,  True,  True])

In [57]:
a[b] = 0

In [58]:
a

array([1, 2, 0, 0, 0])

In [59]:
# create a new column filled with `0`.
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

# pickup_location_code = 'JFK Airport'  column = 15 => 1
bool_aiport = taxi_modified[:,5] == 2
print(bool_aiport)
taxi_modified[bool_aiport, 15] = 1

#array[array[:, column_for_comparison] == value_for_comparison,
#column_for_assignment] = new_value
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1

taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]
[ True  True  True ...  True  True  True]


## Most popular airport 
comparison of of the number of rides using the the shape function


In [60]:
taxi_modified

array([[2.016e+03, 1.000e+00, 1.000e+00, ..., 6.999e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 5.430e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 3.780e+01, 2.000e+00,
        1.000e+00],
       ...,
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 6.334e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 4.475e+01, 1.000e+00,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 5.484e+01, 2.000e+00,
        1.000e+00]])

In [64]:
# Creation of filter for the 3 different airport
bool_jfk = taxi_modified[:, 6] == 2 # JFK Airport
bool_laguardia = taxi_modified[:, 6] == 3 # LaGuardia 
bool_newark = taxi_modified[:, 6] == 5 # Newark Airport

In [65]:
# passing the filter to the taxi_modified ndarrays
jfk_rides = taxi_modified[bool_jfk].shape[0]
message = "The number of rides for the JFK Airport is {}"
output = message.format(jfk_rides)
print(output)

The number of rides for the JFK Airport is 11832


In [66]:
# passing the filter to the taxi_modified ndarrays
jfk_laguardia = taxi_modified[bool_laguardia].shape[0]
message = "The number of rides for the LaGuardia Airport is {}"
output = message.format(jfk_laguardia)
print(output)

The number of rides for the JFK Airport is 16602


In [None]:
# passing the filter to the taxi_modified ndarrays
jfk_newark = taxi_modified[bool_newark].shape[0]
message = "The number of rides for the Newark Airport is {}"
output = message.format(jfk_laguardia)
print(output)