A Data analysis report by Giuseppe Di Bernardo  date: "September 06, 2016"

# Exploring the dataset 

## Preparing the notebook

In [1]:
# magic command to display matplotlib plots inline within the ipython notebook webpage
%matplotlib inline
% config InlineBackend.figure_format='retina'

In [2]:
# import relevant modules 
import os

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns 

sns.set(font='sans')

## Reading the input file

In [3]:
# dir paths 
data_dir = ("../data/")
csv = "yellow_tripdata_2015-06.csv"
fullcsv = data_dir + csv
os.path.normpath(fullcsv)
# print(fullcsv)

'../data/yellow_tripdata_2015-06.csv'

The data provided to the *NYC Taxi and Limousine Commission (TLC)* - by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP) - are stored in `CSV` format, and organized by year and month. In each file, each row represents a single taxi trip.  
Let's take a look to the data. To this purpose, we will use `pandas` to do all the big data clean up and preparation. Each row of the `yellow_tripdata_.csv file` represents a trip, and the columns are the attributes for these trips. 

In [4]:
# Create a pandas dataframe from the location data set. 
# Load the location data set and, parse the dates so 
# they're no longer strings but now rather Python datetime objects
# this lets us do date and time based operations on the data set
# our data frame
df = pd.read_csv(fullcsv, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

In [5]:
# uncomment this if you want to get insights of the data types you are dealing with
# df.info() 

In [6]:
# a first glimpse: the first five trips of the file 
df.head() 

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [7]:
databegin = len(df)
print("We have " +str(databegin)+" trips in New York in June 2015")

# a double-check
# df.count(axis=0, level=None, numeric_only=False)  

We have 12324935 trips in New York in June 2015


In [8]:
# check it out if times are converted to datetime objects
df.tpep_pickup_datetime.head()
# df['tpep_pickup_datetime'].head()

0   2015-06-02 11:19:29
1   2015-06-02 11:19:30
2   2015-06-02 11:19:31
3   2015-06-02 11:19:31
4   2015-06-02 11:19:32
Name: tpep_pickup_datetime, dtype: datetime64[ns]

In [9]:
Timedelta = df.tpep_pickup_datetime.iloc[-1] - df.tpep_pickup_datetime.iloc[0]
print("We have " +str(Timedelta)+" of data observation for trips in New York in June 2015")

We have 28 days 10:34:53 of data observation for trips in New York in June 2015


Trip data looks like this. The file relative to the month of June has about ** 12 million rows **,  and each row contains: `vendor id`, `rate code`, `store and forward flag`, `pickup date/time dropoff date/time`, `passenger count`, `trip distance`, and `latitude/longitude` coordinates for the pickup and dropoff locations.  The possibilities are endless! I smell a tip analysis coming on :-) 

In [10]:
# the argument is passed as a dict: 
df.VendorID = df.VendorID.replace({1: 'CMT', 2: 'VFI'})
df.RateCodeID = df.RateCodeID.replace({1: 'STD', 2: 'JFK', 3: 'NEW', 4: 'NOW', 5: 'NEG', 6: 'GRP'})
df.payment_type = df.payment_type.replace({1: 'CRD', 2: 'CSH', 3: 'NOC', 4: 'DIS', 5: 'UNK', 6: 'VOI'})

It is convenient to visualize some of these attributes, e.g., the `payment_type`, to get first insights in the distributions of these data values: 