<center><h1>Exploratory Data Analysis for Taxi Trip Duration Project</h1></center>
In this project, we will perform EDA on the taxi trip duration dataset in order to generate insights from it for subsequent machine learning.

## Reading Files into Python##

In [1]:
# import the pandas library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# read the dataset
data = pd.read_csv('nyc_taxi/nyc_taxi_trip_duration.csv')

In [3]:
# to view the data, we can use the head command
# head(n = "enter the number of rows to want to view from top")

data.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id1080784,2,2016-02-29 16:40:21,2016-02-29 16:47:01,1,-73.953918,40.778873,-73.963875,40.771164,N,400
1,id0889885,1,2016-03-11 23:35:37,2016-03-11 23:53:57,2,-73.988312,40.731743,-73.994751,40.694931,N,1100
2,id0857912,2,2016-02-21 17:59:33,2016-02-21 18:26:48,2,-73.997314,40.721458,-73.948029,40.774918,N,1635
3,id3744273,2,2016-01-05 09:44:31,2016-01-05 10:03:32,6,-73.96167,40.75972,-73.956779,40.780628,N,1141
4,id0232939,1,2016-02-17 06:42:23,2016-02-17 06:56:31,1,-74.01712,40.708469,-73.988182,40.740631,N,848


---

***`DATA DESCRIPTION`***
- ***id*** - a unique identifier for each trip
- ***vendor_id*** - a code indicating the provider associated with the trip record
- ***pickup_datetime*** - date and time when the meter was engaged
- ***dropoff_datetime*** - date and time when the meter was disengaged
- ***passenger_count*** - the number of passengers in the vehicle (driver entered value)
- ***pickup_longitude*** - the longitude where the meter was engaged
- ***pickup_latitude*** - the latitude where the meter was engaged
- ***dropoff_longitude*** - the longitude where the meter was disengaged
- ***dropoff_latitude*** - the latitude where the meter was disengaged
- ***store_and_fwd_flag*** - This flag indicates whether the trip record was held in-vehicle memory before sending to the vendor because the vehicle did not have a connection to the server (Y=store and forward; N=not a store and forward trip)
- ***trip_duration*** - (target) duration of the trip in seconds

In [4]:
# to check the dimension of the data set, we can use the shape 
data.shape

(729322, 11)

In [5]:
#last 5 instances using "tail()" function
data.tail()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
729317,id3905982,2,2016-05-21 13:29:38,2016-05-21 13:34:34,2,-73.965919,40.78978,-73.952637,40.789181,N,296
729318,id0102861,1,2016-02-22 00:43:11,2016-02-22 00:48:26,1,-73.996666,40.737434,-74.00132,40.731911,N,315
729319,id0439699,1,2016-04-15 18:56:48,2016-04-15 19:08:01,1,-73.997849,40.761696,-74.001488,40.741207,N,673
729320,id2078912,1,2016-06-19 09:50:47,2016-06-19 09:58:14,1,-74.006706,40.708244,-74.01355,40.713814,N,447
729321,id1053441,2,2016-01-01 17:24:16,2016-01-01 17:44:40,4,-74.003342,40.743839,-73.945847,40.712841,N,1224


In [6]:
data['id'].str[2:]

0         1080784
1         0889885
2         0857912
3         3744273
4         0232939
           ...   
729317    3905982
729318    0102861
729319    0439699
729320    2078912
729321    1053441
Name: id, Length: 729322, dtype: object

In [7]:
#Printing all the columns present in data
data.columns

Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration'],
      dtype='object')

In [8]:
data.dtypes

id                     object
vendor_id               int64
pickup_datetime        object
dropoff_datetime       object
passenger_count         int64
pickup_longitude      float64
pickup_latitude       float64
dropoff_longitude     float64
dropoff_latitude      float64
store_and_fwd_flag     object
trip_duration           int64
dtype: object

## Variable Identification and Typecasting

### Integer Data Type

In [9]:
# Identifying variables with integer datatype
data.dtypes[data.dtypes == 'int64']

vendor_id          int64
passenger_count    int64
trip_duration      int64
dtype: object

### Float Data Type

In [10]:
# Identifying variables with float datatype
data.dtypes[data.dtypes == 'float64']

pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
dtype: object

### Object Data Type

In [11]:
# Identifying variables with object datatype
data.dtypes[data.dtypes == 'object']

id                    object
pickup_datetime       object
dropoff_datetime      object
store_and_fwd_flag    object
dtype: object

In [12]:
# Manually checking object types
data[['id','pickup_datetime','dropoff_datetime','store_and_fwd_flag']].head(7)

Unnamed: 0,id,pickup_datetime,dropoff_datetime,store_and_fwd_flag
0,id1080784,2016-02-29 16:40:21,2016-02-29 16:47:01,N
1,id0889885,2016-03-11 23:35:37,2016-03-11 23:53:57,N
2,id0857912,2016-02-21 17:59:33,2016-02-21 18:26:48,N
3,id3744273,2016-01-05 09:44:31,2016-01-05 10:03:32,N
4,id0232939,2016-02-17 06:42:23,2016-02-17 06:56:31,N
5,id1918069,2016-02-14 18:31:42,2016-02-14 18:55:57,N
6,id2429028,2016-04-20 20:30:14,2016-04-20 20:36:51,N


In [13]:
# typecasting "store_and_fwd_flag" to category type 
data['store_and_fwd_flag'] = data['store_and_fwd_flag'].astype('category')
data['vendor_id'] = data['vendor_id'].astype('category')
data['id'] = data['vendor_id'].astype('int64')
# checking
data[['store_and_fwd_flag']].dtypes
data[['vendor_id']].dtypes
data[['id']].dtypes

id    int64
dtype: object

### datetime Data Type

In [14]:
# creating two instances (pickup_date and dropoff_date) of DatetimeIndex class using "pickup_datetime" and "dropoff_datetime"
pickup_date = pd.DatetimeIndex(data['pickup_datetime'])
dropoff_date = pd.DatetimeIndex(data['dropoff_datetime'])

In [15]:
# extracting new columns from "pickup_date" and "dropoff_date"

# last day of year for pickup and dropoff
data['doy_pickup'] = pickup_date.dayofyear
data['doy_dropoff'] = dropoff_date.dayofyear

# week of year for pickup and dropoff
data['woy_pickup'] = pickup_date.weekofyear
data['woy_dropoff'] = dropoff_date.weekofyear

# day of week for pickup and dropoff
data['dow_pickup'] = pickup_date.dayofweek
data['dow_dropoff'] = dropoff_date.dayofweek

# month of year for pickup and dropoff
data['moy_pickup'] = pickup_date.month
data['moy_dropoff'] = dropoff_date.month

# hour of day for pickup and dropoff
data['hod_pickup'] = pickup_date.hour
data['hod_dropoff'] = dropoff_date.hour

# minute of day for pickup and dropoff
data['mod_pickup'] = pickup_date.minute
data['mod_dropoff'] = dropoff_date.minute

# second of day for pickup and dropoff
data['sod_pickup'] = pickup_date.second
data['sod_dropoff'] = dropoff_date.second


In [16]:
# checking new extracted columns using datetime
data[['pickup_datetime','dropoff_datetime','doy_pickup','doy_dropoff','woy_pickup','woy_dropoff','moy_pickup','moy_dropoff','dow_pickup','dow_dropoff','hod_pickup','hod_dropoff','mod_pickup','mod_dropoff','sod_pickup','sod_dropoff']].head()

Unnamed: 0,pickup_datetime,dropoff_datetime,doy_pickup,doy_dropoff,woy_pickup,woy_dropoff,moy_pickup,moy_dropoff,dow_pickup,dow_dropoff,hod_pickup,hod_dropoff,mod_pickup,mod_dropoff,sod_pickup,sod_dropoff
0,2016-02-29 16:40:21,2016-02-29 16:47:01,60,60,9,9,2,2,0,0,16,16,40,47,21,1
1,2016-03-11 23:35:37,2016-03-11 23:53:57,71,71,10,10,3,3,4,4,23,23,35,53,37,57
2,2016-02-21 17:59:33,2016-02-21 18:26:48,52,52,7,7,2,2,6,6,17,18,59,26,33,48
3,2016-01-05 09:44:31,2016-01-05 10:03:32,5,5,1,1,1,1,1,1,9,10,44,3,31,32
4,2016-02-17 06:42:23,2016-02-17 06:56:31,48,48,7,7,2,2,2,2,6,6,42,56,23,31


---

***`DATA FOR NEW GENERATED COLUMNS`***
- ***doy_pickup*** - day of the year of pickup
- ***doy_dropoff*** - day of year of dropoff
- ***woy_pickup*** - week of year of pickup
- ***woy_dropoff*** - week of year of dropoff
- ***moy_pickup*** - month of year of pickup
- ***moy_dropoff*** - month of year of dropoff
- ***dow_dropoff*** - day of week of dropoff
- ***dow_pickup*** - day of week of pickup
- ***hod_pickup*** - hour of day of pickup
- ***hod_dropoff*** - hour of day of dropoff
- ***mod_dropoff*** - minute of day of dropoff
- ***mod_pickup*** - minute of day of pickup
- ***sod_dropoff*** - second of day of dropoff
- ***sod_pickup*** - second of day of pickup


In [19]:
#data = data.drop(columns = ['pickup_datetime','dropoff_datetime'])
data.dtypes

id                       int64
vendor_id             category
passenger_count          int64
pickup_longitude       float64
pickup_latitude        float64
dropoff_longitude      float64
dropoff_latitude       float64
store_and_fwd_flag    category
trip_duration            int64
doy_pickup               int64
doy_dropoff              int64
woy_pickup               int64
woy_dropoff              int64
dow_pickup               int64
dow_dropoff              int64
moy_pickup               int64
moy_dropoff              int64
hod_pickup               int64
hod_dropoff              int64
mod_pickup               int64
mod_dropoff              int64
sod_pickup               int64
sod_dropoff              int64
dtype: object

pickup_datetime and dropoff_datetime have bee dropped using data = data.drop(columns = ['pickup_datetime','dropoff_datetime'])

## Univariate Analysis: Numerical Variables

In [20]:
# Numerical datatypes
data.select_dtypes(include=['int64','float64','Int64']).dtypes

id                     int64
passenger_count        int64
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
trip_duration          int64
doy_pickup             int64
doy_dropoff            int64
woy_pickup             int64
woy_dropoff            int64
dow_pickup             int64
dow_dropoff            int64
moy_pickup             int64
moy_dropoff            int64
hod_pickup             int64
hod_dropoff            int64
mod_pickup             int64
mod_dropoff            int64
sod_pickup             int64
sod_dropoff            int64
dtype: object

In [None]:
# seggregating variables into groups
customer_details = ['customer_id','age','vintage']
current_month = ['current_balance','current_month_credit','current_month_debit','current_month_balance']
previous_month = ['previous_month_end_balance','previous_month_credit','previous_month_debit','previous_month_balance']
previous_quarters = ['average_monthly_balance_prevQ','average_monthly_balance_prevQ2']
pickup_date = ['doy_pickup','woy_pickup','moy_pickup','dow_pickup']
dropoff_date = ['doy_dropoff','woy_dropoff','moy_dropoff','dow_dropoff']
pickup_time = ['hod_pickup','mod_pickup','sod_pickup']
dropoff_time = ['hod_dropoff','mod_dropoff','sod_dropoff']