# An introduction to using NumPy for Fast Data Manipulation

This Dataset contains around 90,000 yellow taxi trips taken to and from New York City airports between January and June 2016.
 


In [1]:
# importing needed libraries
# import pandas as pd
import csv
import numpy as np

In [2]:
o_file = open("nyc_taxis.csv", "r")  # open for reading (default)
r_file = csv.reader(o_file)
dataset = list(r_file)
dataset_f = dataset
dataset_h = dataset[0]  # grab header
dataset = dataset[1:]  # remove header

In [3]:
# How is the data labeled?
print(dataset_h)

['pickup_year', 'pickup_month', 'pickup_day', 'pickup_dayofweek', 'pickup_time', 'pickup_location_code', 'dropoff_location_code', 'trip_distance', 'trip_length', 'fare_amount', 'fees_amount', 'tolls_amount', 'tip_amount', 'total_amount', 'payment_type']


## Column Meanings

pickup_year: the year of the trip

pickup_month: the month of the trip (January is 1, December is 12)

pickup_day: the day of the month of the trip

pickup_location_code: the airport or borough where the trip started

dropoff_location_code: the airport or borough where the trip ended

trip_distance: the distance of the trip in miles

trip_length: the length of the trip in seconds

fare_amount: the base fare of the trip, in dollars

total_amount: the total amount charged to the passenger, including all 
fees, tolls and tips


## Changing all values to floats

Numpy only works with one datatype at a time

In [4]:
dataset_floats = []  # dataset with all datas as floats

# converts to floats
for data in dataset:  # grab each row
    data_rows = []  # hold row item values
    for item in data:
        data_rows.append(float(item))  # convert each item to float
    dataset_floats.append(data_rows)  # append new values to new list

## Using NumPy to convert lists to ndarrays

In [5]:
# dtype for better formating
taxi_array = np.array(dataset_floats, dtype=object)

# check first row
print("First Row:\n", taxi_array[0], "\n")

# gathering info
taxi_shape = taxi_array.shape
print(
    "This Array has {} rows and {} columns".format(taxi_shape[0], taxi_shape[1])
)  # 2013 rows, 15 cols

First Row:
 [2016.0 1.0 1.0 5.0 0.0 2.0 4.0 21.0 2037.0 52.0 0.8 5.54 11.65 69.99 1.0] 

This Array has 2013 rows and 15 columns


## What kind of information can we gather from this dataset?


1. We can find the **average travel speed** using the trip distance, length and vector math

In [6]:
# speed = distance / time

print("Needed Columns - ", "7", dataset_h[7], "8", dataset_h[8], "\n")
print("Trip Distance in miles: ", taxi_array[0][7], "\n")  # example trip distance
print("Trip Length in seconds: ", taxi_array[0][8], "\n")  # example trip length

# convert seconds to hours, 3600 seconds in an hour
trip_length_hours = taxi_array[:, 8] / 3600


taxi_mph = taxi_array[:, 7] / trip_length_hours

# check output
print(taxi_mph)

Needed Columns -  7 trip_distance 8 trip_length 

Trip Distance in miles:  21.0 

Trip Length in seconds:  2037.0 

[37.11340206185567 38.58157894736842 31.27222982216142 ...
 22.299078667611624 42.41551246537396 36.904734073641144]


## Using ndarray.min and ndarray.max

Now that we know the mph, we can gather further information such as: the **max, min, and mean mph**

In [7]:
taxi_max = taxi_mph.max()
taxi_min = taxi_mph.min()
taxi_mean = taxi_mph.mean()

"The slowest taxi ride was {} mph, the fastest was {} mph, and the average mph of a trip was {}".format(
    taxi_min, taxi_max, taxi_mean
)

'The slowest taxi ride was 0.0 mph, the fastest was 82800.0 mph, and the average mph of a trip was 169.98315083655177'

# Observations

Our top mph at 82.8k suggests an error in the dataset.

Most likely a recording malfunction. 

Filtering for outliers/impossible speeds might return better results

## Next Steps

Using Boolean Vectors we can find the **total taxi rides per month**

In [8]:
# taxi_array
# months are stored as 1-12
# print(taxi_array[0, 1])

# Month = col 2
month = "January"
taxi_month_col = taxi_array[:, 1]

# what month to grab?
filter = taxi_month_col == 1
january_rides = taxi_month_col[filter]
# print(january_rides)

# sum january rides
# grabs the x value
print(
    "There were a total of {} rides in the month of {}".format(
        january_rides.shape[0], month
    )
)

There were a total of 800 rides in the month of January


In [9]:
def get_monthly_rides(num=None):
    """
    returns a frequency table containing rides per month
    num is a optional arg that returns just that months number
    """
    monthly_rides = {}

    for i in taxi_array[:, 1]:
        # print(i)
        month = str(i)
        if month not in monthly_rides:

            # grab the col
            taxi_month_col = taxi_array[:, 1]

            # filter by number
            filter = taxi_month_col == i
            monthly_rides[month] = 1
        else:
            monthly_rides[month] += 1
    if num != None:
        selected_month = str(num) + ".0"

        if selected_month not in monthly_rides:
            return "Month not found."
        return monthly_rides[selected_month]
    else:
        return monthly_rides

In [10]:
print(get_monthly_rides())
print(get_monthly_rides(num=1))

{'1.0': 800, '2.0': 176, '3.0': 554, '4.0': 171, '6.0': 312}
800


In [11]:
# https://docs.python.org/3/library/collections.html
# Counter dict subclass for counting hashable objects
# Using a module to accomplish the above function in one line
from collections import Counter

print(Counter(taxi_array[:, 1]))

Counter({1.0: 800, 3.0: 554, 6.0: 312, 2.0: 176, 4.0: 171})


# Observations cont.

We can see that January saw the most significant amount of taxi rides.

March comes in a close second.

## Next Steps

Using boolean vectors to
1. sort/filter an entire array
2. find errors in our dataset
3. sort rides by dropoff location

# Filtering an ndarray using Boolean Vectors

In [12]:
# sort and filter example
# tax_mph contains average travel speed for each row
# print(taxi_mph)

# if the speed is over 500 mph
# set that row to True
mph_filter = taxi_mph > 500
# print(mph_filter)

# if this row is True, gran cols 5-9
print(dataset_h[7:9], "\n")
print(taxi_array[mph_filter, 7:9])

['trip_distance', 'trip_length'] 

[[16.9 33.0]
 [23.0 1.0]
 [19.6 1.0]
 [16.7 2.0]
 [17.8 2.0]
 [17.2 2.0]
 [16.9 3.0]
 [27.1 4.0]]


# Dataset Errors uncovered

Here we find **8** examples where the average speed was **over 500 mph**.

Many of these have a short distance of **less than 25 miles**, and trip lengths of **under 5 seconds**.

Removing these from our dataset would help us ensure accurate information down the line.

## Next Steps

We will use boolean indexing to **remove** the found errors

In [13]:
# short hand mph, reused taxi_mph
taxi_mph = taxi_array[:, 7] / (taxi_array[:, 8] / 3600)

# filtered by speeds over 100 mph
cleaned_taxi = taxi_array[taxi_mph < 100]

# how many did we remove?
print("Original Size: ", taxi_array.shape[0])
print("New Size: ", cleaned_taxi.shape[0])
print("Removed: ", taxi_array.shape[0] - cleaned_taxi.shape[0])

Original Size:  2013
New Size:  2004
Removed:  9


# Pulling additional information out of our cleaned dataset

Using the dropoff_location_code(column 6)

We can find out **how many rides were taken to specific locations**

In [14]:
# take every col in taxi_array that is equal to x
# filter taxi_array with that bool vector
jfk = taxi_array[taxi_array[:, 6] == 2]
laguardia = taxi_array[taxi_array[:, 6] == 3, 6]
newark = taxi_array[taxi_array[:, 6] == 5, 6]

# show the sizes
jfk_count = jfk.shape[0]
laguardia_count = laguardia.shape[0]
newark_count = newark.shape[0]

print("There were {} rides to JFK Airport".format(jfk_count))
print("There were {} rides to LaGuardia Airport".format(laguardia_count))
print("There were {} rides to Newark Airport".format(newark_count))

There were 285 rides to JFK Airport
There were 308 rides to LaGuardia Airport
There were 2 rides to Newark Airport


# Conclusions

Using NumPy we were able to:
1. Efficiently manipulate n-dimensional arrays
2. Filter our dataset using booleans
3. Find and remove errors
4. Gather information and insights from clean data