New York City Taxi and Limousine Commission data for Green Taxis
This DS challenge is designed to evaluate your skills and intuition regarding a real world data problem. 
Data set: New York City Taxi and Limousine Commission trip records
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

We'll use data from Green Taxis for September 2015. 

Load data and analyze:

1. Characterize the data and comment about its quality
2. Explore and visualize the data e.g. a histogram of trip distance
3. Find interesting trip statistics grouped by hour
4. The taxi drivers want to know what kind of trip yields better tips. Can you build a model for them and explain the model?
5. Pick one of the options below
(Option 1) Find an anomaly in the data and explain your findings.
(Option 2) Visualize the data to help understand trip patterns

Please submit the result in the form of runnable notebooks or scripts. A link to GitHub or other code repository would be great.
Please let us know if we need to do anything special to run your notebook (install packages, get extra data etc.)

In [103]:
import warnings
warnings.filterwarnings('ignore')

In [100]:
!pip install geopy
!pip install wget
!pip install xgboost



thinc 6.10.3 requires msgpack<1.0.0,>=0.5.6, which is not installed.
spacy 2.0.11 requires pathlib, which is not installed.
smart-open 1.7.1 requires bz2file, which is not installed.
msgpack-numpy 0.4.4.1 requires msgpack>=0.5.2, which is not installed.
distributed 1.21.8 requires msgpack, which is not installed.
spacy 2.0.11 has requirement regex==2017.4.5, but you'll have regex 2017.11.9 which is incompatible.
You are using pip version 10.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import requests
import datetime as dt
import dask.dataframe as dask_dataframe
import dask.distributed
import scipy

import geopandas
import wget
import xgboost
from geopy.distance import vincenty
from shapely.geometry import Point
from sklearn import metrics
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV, train_test_split

In [4]:
url=r'https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv'
s=requests.get(url).content
data=pd.read_csv(io.StringIO(s.decode('utf-8')))

In [2]:
data = pd.read_csv("D:\Google Drive\Applications\Vian.ai\green_tripdata_2015-09.csv")

In [3]:
print(data.shape)
data.head()

(1494926, 21)


Unnamed: 0,VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,...,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
0,2,2015-09-01 00:02:34,2015-09-01 00:02:38,N,5,-73.979485,40.684956,-73.979431,40.68502,1,...,7.8,0.0,0.0,1.95,0.0,,0.0,9.75,1,2.0
1,2,2015-09-01 00:04:20,2015-09-01 00:04:24,N,5,-74.010796,40.912216,-74.01078,40.912212,1,...,45.0,0.0,0.0,0.0,0.0,,0.0,45.0,1,2.0
2,2,2015-09-01 00:01:50,2015-09-01 00:04:24,N,1,-73.92141,40.766708,-73.914413,40.764687,1,...,4.0,0.5,0.5,0.5,0.0,,0.3,5.8,1,1.0
3,2,2015-09-01 00:02:36,2015-09-01 00:06:42,N,1,-73.921387,40.766678,-73.931427,40.771584,1,...,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2,1.0
4,2,2015-09-01 00:00:14,2015-09-01 00:04:20,N,1,-73.955482,40.714046,-73.944412,40.714729,1,...,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2,1.0


In [4]:
data.dtypes

VendorID                   int64
lpep_pickup_datetime      object
Lpep_dropoff_datetime     object
Store_and_fwd_flag        object
RateCodeID                 int64
Pickup_longitude         float64
Pickup_latitude          float64
Dropoff_longitude        float64
Dropoff_latitude         float64
Passenger_count            int64
Trip_distance            float64
Fare_amount              float64
Extra                    float64
MTA_tax                  float64
Tip_amount               float64
Tolls_amount             float64
Ehail_fee                float64
improvement_surcharge    float64
Total_amount             float64
Payment_type               int64
Trip_type                float64
dtype: object

In [None]:
# convert dates off datetime variable in their right format
data['pickup_date'] = data.lpep_pickup_datetime.apply(lambda x:dt.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
data['dropoff_date'] = data.Lpep_dropoff_datetime.apply(lambda x:dt.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
data.head()

In [9]:
data.drop(columns=['lpep_pickup_datetime','Lpep_dropoff_datetime'],inplace=True)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1494926 entries, 0 to 1494925
Data columns (total 21 columns):
VendorID                 1494926 non-null int64
Store_and_fwd_flag       1494926 non-null object
RateCodeID               1494926 non-null int64
Pickup_longitude         1494926 non-null float64
Pickup_latitude          1494926 non-null float64
Dropoff_longitude        1494926 non-null float64
Dropoff_latitude         1494926 non-null float64
Passenger_count          1494926 non-null int64
Trip_distance            1494926 non-null float64
Fare_amount              1494926 non-null float64
Extra                    1494926 non-null float64
MTA_tax                  1494926 non-null float64
Tip_amount               1494926 non-null float64
Tolls_amount             1494926 non-null float64
Ehail_fee                0 non-null float64
improvement_surcharge    1494926 non-null float64
Total_amount             1494926 non-null float64
Payment_type             1494926 non-null int64
Tr

In [11]:
data.isnull().values.any()

True

In [12]:
def distance(df):
    """
    Input: DataFrame with starting and ending location latitude, longitudes
    Ouput: Array of distance geo distance calculated as vincenty library
    """
    df_location = df[['Pickup_latitude','Pickup_longitude','Dropoff_latitude','Dropoff_longitude']].copy()
    try:
        distance = list(df_location.apply(lambda x: vincenty((x['Pickup_latitude'], x['Pickup_longitude']),
                                                            (x['Dropoff_latitude'], x['Dropoff_longitude'])).miles,
                                         axis=1))
        return distance
    except ValueError as ve:
        return 0

In [13]:
# Adapted from Ravi Shekhar's
# https://towardsdatascience.com/geospatial-operations-at-scale-with-dask-and-geopandas-4d92d00eb7e8
def assign_taxi_zones(df, lon_var, lat_var, locid_var):
    """Joins DataFrame with Taxi Zones shapefile.
    This function takes longitude values provided by `lon_var`, and latitude
    values provided by `lat_var` in DataFrame `df`, and performs a spatial join
    with the NYC taxi_zones shapefile.
    Parameters
    ----------
    df : pandas.DataFrame or dask.DataFrame
        DataFrame containing latitudes, longitudes, and location_id columns.
    lon_var : string
        Name of column in `df` containing longitude values. Invalid values
        should be np.nan.
    lat_var : string
        Name of column in `df` containing latitude values. Invalid values
        should be np.nan
    locid_var : string
        Name of series to return.
    """
    localdf = df[[lon_var, lat_var]].copy()

    shape_df = geopandas.read_file('nyu_2451_36743_WGS84/nyu_2451_36743.shp')
    shape_df.drop(['OBJECTID', "Shape_Area", "Shape_Leng", "borough", "zone"],
                  axis=1, inplace=True)
    shape_df = shape_df.to_crs({'init': 'epsg:4326'})

    try:
        local_gdf = geopandas.GeoDataFrame(
            localdf, crs={'init': 'epsg:4326'},
            geometry=[Point(xy) for xy in
                      zip(localdf[lon_var], localdf[lat_var])])

        local_gdf = geopandas.sjoin(
            local_gdf, shape_df, how='left', op='within')

        return local_gdf.LocationID.rename(locid_var)
    except ValueError as ve:
        print(ve)
        print(ve.stacktrace())
        series = localdf[lon_var]
        series = np.nan
        return series

In [None]:
import ssl
import zipfile
ssl._create_default_https_context = ssl._create_unverified_context
wget.download('https://archive.nyu.edu/bitstream/2451/36743/3/nyu_2451_36743_WGS84.zip')
zipfile.ZipFile('nyu_2451_36743_WGS84.zip', 'r').extractall()

In [14]:
import dask.dataframe as dask_DataFrame
import dask.distributed
client = dask.distributed.Client()
func = dask_DataFrame.from_pandas(data, npartitions=30)

func['distance_calculated'] = func.map_partitions(distance)

func['pickup_zone'] = func.map_partitions(assign_taxi_zones,"Pickup_longitude", "Pickup_latitude","pickup_zone", meta=('pickup_zone', np.float32))
func['dropoff_zone'] = func.map_partitions(assign_taxi_zones, "Dropoff_longitude", "Dropoff_latitude","dropoff_zone", meta=('dropoff_zone', np.float32))

data = func.compute()

  if __name__ == '__main__':


In [15]:
data.head()

Unnamed: 0,VendorID,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,...,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type,pickup_date,dropoff_date,distance_calculated,pickup_zone,dropoff_zone
0,2,N,5,-73.979485,40.684956,-73.979431,40.68502,1,0.0,7.8,...,,0.0,9.75,1,2.0,2015-09-01 00:02:34,2015-09-01 00:02:38,0.005281,25.0,25.0
1,2,N,5,-74.010796,40.912216,-74.01078,40.912212,1,0.0,45.0,...,,0.0,45.0,1,2.0,2015-09-01 00:04:20,2015-09-01 00:04:24,0.000841,,
2,2,N,1,-73.92141,40.766708,-73.914413,40.764687,1,0.59,4.0,...,,0.3,5.8,1,1.0,2015-09-01 00:01:50,2015-09-01 00:04:24,0.392664,7.0,7.0
3,2,N,1,-73.921387,40.766678,-73.931427,40.771584,1,0.74,5.0,...,,0.3,6.3,2,1.0,2015-09-01 00:02:36,2015-09-01 00:06:42,0.62612,7.0,179.0
4,2,N,1,-73.955482,40.714046,-73.944412,40.714729,1,0.61,5.0,...,,0.3,6.3,2,1.0,2015-09-01 00:00:14,2015-09-01 00:04:20,0.583141,255.0,80.0


In [16]:
data.describe()

Unnamed: 0,VendorID,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,...,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type,distance_calculated,pickup_zone,dropoff_zone
count,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,...,1494926.0,1494926.0,0.0,1494926.0,1494926.0,1494926.0,1494922.0,1494926.0,1491780.0,1488380.0
mean,1.782045,1.097653,-73.83084,40.69114,-73.83728,40.69291,1.370598,2.968141,12.5432,0.35128,...,1.235727,0.1231047,,0.2920991,15.03215,1.540559,1.022353,12.99532,117.2517,130.4193
std,0.412857,0.6359437,2.776082,1.530882,2.677911,1.476698,1.039426,3.076621,10.08278,0.3663096,...,2.431476,0.8910137,,0.05074009,11.55316,0.5232935,0.1478288,241.7792,77.54385,76.94744
min,1.0,1.0,-83.31908,0.0,-83.42784,0.0,0.0,0.0,-475.0,-1.0,...,-50.0,-15.29,,-0.3,-475.0,1.0,1.0,0.0,1.0,1.0
25%,2.0,1.0,-73.95961,40.69895,-73.96782,40.69878,1.0,1.1,6.5,0.0,...,0.0,0.0,,0.3,8.16,1.0,1.0,0.8118087,52.0,63.0
50%,2.0,1.0,-73.94536,40.74674,-73.94504,40.74728,1.0,1.98,9.5,0.5,...,0.0,0.0,,0.3,11.76,2.0,1.0,1.465467,93.0,129.0
75%,2.0,1.0,-73.91748,40.80255,-73.91013,40.79015,1.0,3.74,15.5,0.5,...,2.0,0.0,,0.3,18.3,2.0,1.0,2.709305,181.0,193.0
max,2.0,99.0,0.0,43.17726,0.0,42.79934,9.0,603.1,580.5,12.0,...,300.0,95.75,,0.3,581.3,5.0,2.0,5394.451,263.0,263.0


In [None]:
data.to_csv("data_after_dask.csv")