## 1. Clean the data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
import cartopy.io.shapereader as shpreader

After loading the modules, I adjusted the print settings to get a better idea of what the data looked like. 

In [None]:
np.set_printoptions(threshold=sys.maxsize)
pd.set_option('display.max_columns', None)
input = pd.read_csv('tlc_yellow_trips_2018_11_22.csv')
input['pickup_datetime']=pd.to_datetime(input['pickup_datetime'])
input['dropoff_datetime']=pd.to_datetime(input['dropoff_datetime'])

Dropped the following 2 columns as they did not appear relevent.

In [None]:
input = input.drop('vendor_id',1)# removed these as they seem irrelevant
input = input.drop('store_and_fwd_flag',1)

Looking at the columns and also reading the provided pdf, I identified the following properties that are not useful and may hinder the analysis. This included voided trips (although there weren't any of those in this dataset), and also trips with no passengers, distance or total amount. 

In [None]:
input = input[input['payment_type']!=6]

input = input[input['fare_amount']>0.]

input = input[input['passenger_count']!=0]

input = input[input['trip_distance']!=0.]

I added an extra column, duration, to investigate if it has any interesting trends related to tips. 

In [None]:
input['duration_min'] = (np.array(input['dropoff_datetime']) - np.array(input['pickup_datetime'])).astype(float)/(60*1000000000)

Noticed that some durations were >1400 mins, and looking at their properties it seemed that they were normal rides, but the start and end times were switched and the end date moved one day forward. Given that there were only around 200, I decided to remove them altogether. In addition, I removed trips with zero time. 

In [None]:
input = input[input['duration_min']<1380.]

input = input[input['duration_min']>0.]

Also added an extra column, average speed, to see later if there was anything interesting.

In [None]:
input['speed_mph'] = (np.array(input['trip_distance'])*60)/np.array(input['duration_min'])

Wrote the following function to remove statistical outliers from the data. Tried with a few different values of k, but settled with k=3 as lower values seemed too strict. 

In [None]:
def remove_outliers(df, key):
    q1 = df[key].quantile(q=0.25)
    q3 = df[key].quantile(q=0.75)

    df = df[df[key] > (q1 - 3. * (q3 - q1))]
    df = df[df[key] < (q3 + 3. * (q3 - q1))]
    return df

Used the function on duration, distance, speed and total cost.

In [None]:
input = remove_outliers(input, 'speed_mph')
input = remove_outliers(input, 'trip_distance')
input = remove_outliers(input, 'duration_min')
input = remove_outliers(input, 'total_amount')

Loaded the shapefile to filter out any entries whose location ids we didn't have, i.e. those not in NYC.

In [None]:
shpfilename = '/home/Earth/mfalls/Downloads/junior-data-scientist-test-data-team-master/tlc_yellow_geom.shp'
reader = shpreader.Reader(shpfilename)
zones = reader.records()
zone_ids = []
for zone in zones:
    zone_ids.append(int(zone.attributes['zone_id']))

input = input[input['pickup_location_id'].isin(zone_ids)]
input = input[input['dropoff_location_id'].isin(zone_ids)]

Saved to csv. 

In [None]:
input.to_csv('tlc_yellow_trips_2018_11_22_CLEAN.csv')