# NYC Taxi Trip-Duration Estimation

NOTE: This step is only needed if you haven't already

In [1]:
!pip install --upgrade kaskada

You should consider upgrading via the '/Users/ryan.michael/.pyenv/versions/3.10.5/bin/python3.10 -m pip install --upgrade pip' command.[0m[33m
[0m

In [11]:
!pip install matplotlib seaborn xgboost scikit-learn

You should consider upgrading via the '/Users/ryan.michael/.pyenv/versions/3.10.5/bin/python3.10 -m pip install --upgrade pip' command.[0m[33m
[0m

1. Create a local, in-memory Kaskada instance

In [2]:
from kaskada.api.session import LocalBuilder
from kaskada import table

session = LocalBuilder().build()

INFO:kaskada.api.release:Using latest release version: engine@v0.9.0
INFO:kaskada.api.release:Skipping download. Using binary: /Users/ryan.michael/.cache/kaskada/bin/engine@v0.9.0/kaskada-engine
INFO:kaskada.api.release:Skipping download. Using binary: /Users/ryan.michael/.cache/kaskada/bin/engine@v0.9.0/kaskada-manager
INFO:kaskada.api.local_session.local_service:Initializing manager process
INFO:kaskada.api.local_session.local_service:Logging manager STDOUT to /Users/ryan.michael/.cache/kaskada/logs/2023-07-25T10-20-48-manager-stdout.log
INFO:kaskada.api.local_session.local_service:Logging manager STDERR to /Users/ryan.michael/.cache/kaskada/logs/2023-07-25T10-20-48-manager-stdout.log
INFO:kaskada.api.local_session.local_service:Initializing engine process
INFO:kaskada.api.local_session.local_service:Logging engine STDOUT to /Users/ryan.michael/.cache/kaskada/logs/2023-07-25T10-20-48-engine-stdout.log
INFO:kaskada.api.local_session.local_service:Logging engine STDERR to /Users/ryan.m

In [3]:
%load_ext fenlmagic

INFO:fenlmagic:extension loaded


## Data Prep

We'll be using data originally obtained from https://chriswhong.com/open-data/foil_nyc_taxi/. Recent datasets provided by NYC use more granular geographical location, rather than the fine-grained latitude and longitude data present in the original dataset released in 2013.

The data has been cleaned by converting from CSV to Parquet, and then the "Trip" and "Fare" datasets have been combined with a simple join:

```sql
copy (
    select * from 'trip_data.parquet' 
    join (select * from 'trip_fare.parquet') as fares 
    on  trip_data.medallion = fares.medallion
    and trip_data.hack_license = fares.hack_license 
    and trip_data.pickup_datetime = fares.pickup_datetime
    and trip_data.vendor_id = fares.vendor_id
) to 'combined.parquet' (FORMAT PARQUET)
```

Both the trip and fare data contain 173,179,759 rows, but the join result contains 173,185,091 (5,332 more) rows. These rows are presumed to be duplicates and are ignored for this analysis, as they contitute 0.003% of the full dataset.

Since the dataset contains timestamps for both pickup and dropoff time, the dataset has been split into separate files, one describing pickup events and another containing dropoff events. The fields present in the pickup dataset are filtered to information that could plausibly be known at pickup time, specifically the following fields are omitted:

* rate_code: Rate code is the "final rate code in effect at the end of the trip"
* store_and_fwd_flag: Potentially reflects information about the route take, for example tunnels.
* dropoff_datetime: The trip duration can't be known in advance
* trip_time_in_secs: See above
* trip_distance: See above
* payment_type: Payment is made at the end of the trip
* fare_amount: The fare amount depends on the trip duration
* surcharge: Surcharge depends on the fare amount
* mta_tax: See above
* tip_amount: Tips are recorded at the end of a ride
* tolls_amount: Toll amounts depend on the route taken
* total_amount: Total depends on multiple fields not known in advance

Specifically, the pickup dataset is created from the query:

```sql
copy (
    select medallion, hack_license, vendor_id, pickup_datetime, passenger_count, pickup_longitude, pickup_latitude,
    from 'combined.parquet'
) to 'pickups.parquet' (FORMAT PARQUET)
```

2. Create a table for the data

In [9]:
table.delete_table("Pickup")
table.create_table(
  # The table's name
  table_name = "Pickup",
  # The name of the column in the data that contains the time associated with each row
  time_column_name = "pickup_datetime",
  # The name of the column in the data that contains the entity key associated with each row
  entity_key_column_name = "hack_license",
  grouping_id = "License",
)

0,1
table,table_namePickupentity_key_column_namehack_licensetime_column_namepickup_datetimegrouping_idLicenseversion0create_time2023-07-25T10:22:22.337424update_time2023-07-25T10:22:22.337425
request_details,request_id3af2661c3ba78affbe3918ca6cec6061

0,1
table_name,Pickup
entity_key_column_name,hack_license
time_column_name,pickup_datetime
grouping_id,License
version,0
create_time,2023-07-25T10:22:22.337424
update_time,2023-07-25T10:22:22.337425

0,1
request_id,3af2661c3ba78affbe3918ca6cec6061


In [10]:
table.delete_table("Dropoff")
table.create_table(
  # The table's name
  table_name = "Dropoff",
  # The name of the column in the data that contains the time associated with each row
  time_column_name = "dropoff_datetime",
  # The name of the column in the data that contains the entity key associated with each row
  entity_key_column_name = "hack_license",
  grouping_id = "License",
)

0,1
table,table_nameDropoffentity_key_column_namehack_licensetime_column_namedropoff_datetimegrouping_idLicenseversion0create_time2023-07-25T10:22:25.858957update_time2023-07-25T10:22:25.858958
request_details,request_id3a579366eed811a526afe10929e5d177

0,1
table_name,Dropoff
entity_key_column_name,hack_license
time_column_name,dropoff_datetime
grouping_id,License
version,0
create_time,2023-07-25T10:22:25.858957
update_time,2023-07-25T10:22:25.858958

0,1
request_id,3a579366eed811a526afe10929e5d177


3. Load the files's contents into the Purchase table

In [11]:
table.load(table_name = "Pickup", file = "pickups_100000.parquet")

0,1
data_token_id,3d287d84-4f1e-4670-ae4c-9bac8ccf7133
request_details,request_idac99bee9d6e11eda0572857cf2f88258

0,1
request_id,ac99bee9d6e11eda0572857cf2f88258


In [12]:
table.load(table_name = "Dropoff", file = "combined_100000.parquet")

0,1
data_token_id,2c369e13-b9c2-4c29-b1bc-8aed3ad21913
request_details,request_id463d4cf5431b4404ff4ed8e852a21402

0,1
request_id,463d4cf5431b4404ff4ed8e852a21402


In [19]:
# Downsample to a subset of licenses
from kaskada.slice_filters import EntityPercentFilter
import kaskada.client
filter_percentage = 50
entity_filter = EntityPercentFilter(filter_percentage)
kaskada.client.set_default_slice(entity_filter)

INFO:kaskada.client:Slicing set to: {'percent': {'percent': 50}}


In [17]:
%%fenl --var all
Pickup

Unnamed: 0,_time,_subsort,_key_hash,_key,medallion,hack_license,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_latitude,dropoff_longitude
0,2013-01-01 00:00:00,467252012394183385,15913485641726903213,33276CA24A915CBD668AF96873D07883,08E54F4C460720DDE43460E354486FBC,33276CA24A915CBD668AF96873D07883,CMT,2013-01-01 00:00:00,1,-73.999878,40.743343,40.748280,-74.003708
1,2013-01-01 00:00:00,467252012394183386,9006261832047477220,25BA06A87905667AA1FE5990E33F0E2E,76942C3205E17D7E7FE5A9F709D16434,25BA06A87905667AA1FE5990E33F0E2E,VTS,2013-01-01 00:00:00,3,-73.955925,40.781887,40.777832,-73.963181
2,2013-01-01 00:00:00,467252012394183387,7881235434597582542,252392FA3463A93EC0D0F81A4EE2D05B,5C7501130D5FBCA3C65B6C6A5554D164,252392FA3463A93EC0D0F81A4EE2D05B,VTS,2013-01-01 00:00:00,6,-74.006927,40.740765,40.739616,-73.982994
3,2013-01-01 00:00:00,467252012394183388,1043742014481570536,CABFA7E28A15A656CF1B30D141B72414,18F11AB7B8E48739B14DA4C4314F99FC,CABFA7E28A15A656CF1B30D141B72414,VTS,2013-01-01 00:00:00,1,-73.995056,40.760185,40.754955,-73.968590
4,2013-01-01 00:00:00,467252012394183389,10106009620658921357,BB899DFEA9CC964B50C540A1D685CCFB,468244D1361B8A3EB8D206CC394BC9E9,BB899DFEA9CC964B50C540A1D685CCFB,VTS,2013-01-01 00:00:00,1,-73.955383,40.779728,40.760326,-73.967758
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,2013-01-01 03:22:00,467252012394283380,9094629955177181826,934F55A3C4D15C6AAEA16710A42E297D,BE386D8524FCD16B3727DCF0A32D9B25,934F55A3C4D15C6AAEA16710A42E297D,VTS,2013-01-01 03:22:00,2,-73.937996,40.803677,40.859760,-73.929893
99996,2013-01-01 03:22:00,467252012394283381,18372752153190913550,0FEB05AD962BB9F20DE34A02283DF5FB,390CF993D59D0DC7297CE2A0BF73D4F5,0FEB05AD962BB9F20DE34A02283DF5FB,VTS,2013-01-01 03:22:00,2,-73.973160,40.792767,40.758629,-73.959747
99997,2013-01-01 03:22:00,467252012394283382,4952570644952036406,B2B964896730C55143F4261F988E58B5,9730AFBE8E88604CA7BE8224C3D06F16,B2B964896730C55143F4261F988E58B5,VTS,2013-01-01 03:22:00,5,-73.940659,40.715668,40.661446,-73.982803
99998,2013-01-01 03:22:00,467252012394283383,12766428500220350347,95EA78362BDBD8561F3E9B53E50A0B26,CA2E404BE3197B507CBBA41D1E0EA70F,95EA78362BDBD8561F3E9B53E50A0B26,VTS,2013-01-01 03:22:00,2,-73.980217,40.722229,40.690746,-73.957146

0,1
state,SUCCESS
query_id,5b2c9056-ae42-4b2a-b4a3-a59f0b205b4d
metrics,time_preparing0.069stime_computing0.058soutput_files1
analysis,can_executeTrue
schema,(see Schema tab)
request_details,request_idee9b87c8c625b059cd36af62db9983aa
expression,Pickup

0,1
time_preparing,0.069s
time_computing,0.058s
output_files,1

0,1
can_execute,True

0,1
request_id,ee9b87c8c625b059cd36af62db9983aa

Unnamed: 0,column_name,column_type
0,medallion,string
1,hack_license,string
2,vendor_id,string
3,pickup_datetime,timestamp_us
4,passenger_count,i64
5,pickup_longitude,f64
6,pickup_latitude,f64
7,dropoff_latitude,f64
8,dropoff_longitude,f64


## Explore!

Try some queries of your own in the block below. The [Reference > Queries](https://kaskada.io/docs-site/kaskada/main/developing/queries.html) section of the docs can help you get started.

Predict
* Fare - https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/overview
* Duration - https://www.kaggle.com/c/nyc-taxi-trip-duration
* Wait time
* Tip amount <<<

* Time of day
* Distance
* year
* day of week

Feature engineering based on
* https://www.kaggle.com/code/headsortails/nyc-taxi-eda-update-the-fast-the-curious
* https://www.kaggle.com/code/maheshdadhich/strength-of-visualization-python-visuals-tutorial

In [21]:
%%fenl --var examples

# Cross-entity features

# Build entity keys based on geographic areas
# TODO: GIS bucketing a la H3.
# This just normalizes to reasonable extents for NYC 
# Latitude: [40.650, 40.850] => [0, 200]
# Longitude: [-74.05, -73.75] => [0, 200]
let pu_src_bin_id = {
    lat: (((Pickup.pickup_latitude - 40.65) * (200/0.2)) as i64)  | else(-1), 
    lon: (((Pickup.pickup_longitude + 74.05) * (200/0.3)) as i64) | else(-1),
}
let pu_dst_bin_id = {
    lat: (((Pickup.dropoff_latitude - 40.65) * (200/0.2)) as i64)  | else(-1), 
    lon: (((Pickup.dropoff_longitude + 74.05) * (200/0.3)) as i64) | else(-1),
}
let do_src_bin_id = {
    lat: (((Dropoff.pickup_latitude - 40.65) * (200/0.2)) as i64) | else(-1), 
    lon: (((Dropoff.pickup_longitude + 74.05) * (200/0.3)) as i64) | else(-1),
}
let do_dst_bin_id = {
    lat: (((Dropoff.dropoff_latitude - 40.65) * (200/0.2)) as i64) | else(-1),
    lon: (((Dropoff.dropoff_longitude + 74.05) * (200/0.3)) as i64) | else(-1),
}

# Compute some per-trip metrics for Dropoffs
let trip_dist = sqrt(
        (Dropoff.pickup_latitude - Dropoff.dropoff_latitude) | powf(power=2) +
        (Dropoff.pickup_longitude - Dropoff.dropoff_longitude) | powf(power=2))
let trip_speed = 100000 * trip_dist / Dropoff.trip_time_in_secs
let dropoff_with_metrics = Dropoff | extend({trip_dist, trip_speed})

# Re-key by trip source and destination
let dropoff_by_src_bin = dropoff_with_metrics | with_key(do_src_bin_id)
let dropoff_by_dst_bin = dropoff_with_metrics | with_key(do_dst_bin_id)

# Compute aggregates related to trips departing from a given bin
let departure_mean_speed_10m = dropoff_by_src_bin.trip_speed | mean(window=sliding(10,minutely()))
let departure_mean_speed_60m = dropoff_by_src_bin.trip_speed | mean(window=sliding(60,minutely()))
let departure_mean_speed_1d = dropoff_by_src_bin.trip_speed | mean(window=sliding(24,hourly()))
let departure_count_10m = dropoff_by_src_bin | count(window=sliding(10, minutely()))
let departure_count_60m = dropoff_by_src_bin | count(window=sliding(60, minutely()))
let departure_count_1d = dropoff_by_src_bin | count(window=sliding(24, hourly()))

# Compute aggregates related to trips arriving at a given bin
let arrival_mean_speed_10m = dropoff_by_dst_bin.trip_speed | mean(window=sliding(10,minutely()))
let arrival_mean_speed_60m = dropoff_by_dst_bin.trip_speed | mean(window=sliding(60,minutely()))
let arrival_mean_speed_1d = dropoff_by_dst_bin.trip_speed | mean(window=sliding(24,hourly()))
let arrival_count_10m = dropoff_by_dst_bin | count(window=sliding(10, minutely()))
let arrival_count_60m = dropoff_by_dst_bin | count(window=sliding(60, minutely()))
let arrival_count_1d = dropoff_by_dst_bin | count(window=sliding(24, hourly()))

in {
    # TODO:
    # hour of day
    # day of week
    # geo cluster

    # TODO: GIS functions so we can compute actual distances from latitude / longitude
    # This just takes the euclidean distance - probably OK for this example but terrible 
    # at larger scales or higher latitudes.
    distance: sqrt(
        Pickup.pickup_latitude - Pickup.dropoff_latitude | powf(power=2) +
        Pickup.pickup_longitude - Pickup.dropoff_longitude | powf(power=2)) | else(-1)
    monthday: day_of_month(Pickup.pickup_datetime as timestamp_ns) | else(-1),
    passengers: Pickup.passenger_count | else(-1),
    vendor: Pickup.vendor_id,
    src_lat_bin: pu_src_bin_id.lat,
    src_lon_bin: pu_src_bin_id.lon,
    dst_lat_bin: pu_dst_bin_id.lat,
    dst_lon_bin: pu_dst_bin_id.lon,

    # Features related to recent trips departing from the same area
    departure_mean_speed_10m: departure_mean_speed_10m | lookup(pu_src_bin_id) | else (-1),
    departure_mean_speed_60m: departure_mean_speed_60m | lookup(pu_src_bin_id) | else (-1),
    departure_mean_speed_1d: departure_mean_speed_1d | lookup(pu_src_bin_id) | else (-1),
    departure_count_10m: departure_count_10m | lookup(pu_src_bin_id) | else (-1),
    departure_count_60m: departure_count_60m | lookup(pu_src_bin_id) | else (-1),
    departure_count_1d: departure_count_1d | lookup(pu_src_bin_id) | else (-1),

    # Features related to recent trips arriving in the same area
    arrival_mean_speed_10m: arrival_mean_speed_10m | lookup(pu_dst_bin_id) | else (-1),
    arrival_mean_speed_60m: arrival_mean_speed_60m | lookup(pu_dst_bin_id) | else (-1),
    arrival_mean_speed_1d: arrival_mean_speed_1d | lookup(pu_dst_bin_id) | else (-1),
    arrival_count_10m: arrival_count_10m | lookup(pu_dst_bin_id) | else (-1),
    arrival_count_60m: arrival_count_60m | lookup(pu_dst_bin_id) | else (-1),
    arrival_count_1d: arrival_count_1d | lookup(pu_dst_bin_id) | else (-1),
} 
# We'll make predictions from features computed at the time of each pickup
| when(is_valid(Pickup))

# We'll predict the duration of the trip, which we learn at the time of the next dropoff
| last(window=since(is_valid(Dropoff)))
| when(is_valid(Dropoff))
| extend({target: Dropoff.trip_time_in_secs})

# cleaning
| when($input.distance < 0.5) # distance outliers
| when($input.distance > 0) # very short trips
| when($input.target < 24 * 60 * 60) # trips longer than a day
| when($input.target > 60) # trips shorter than a minute

0,1
state,FAILURE
query_id,
metrics,time_preparing0.0stime_computing0.0soutput_files0
analysis,can_executeFalsefenl_diagnostics(see Diagnostics tab)
request_details,request_id74d3e3283d40b1957d980997e6677d2b
expression,"# TODO: GIS functions so we can compute actual distances from latitude / longitude # This just takes the euclidean distance - probably OK for this example but terrible # at larger scales or higher latitudes. let distance = sqrt(  Pickup.pickup_latitude - Pickup.dropoff_latitude | powf(power=2) +  Pickup.pickup_longitude - Pickup.dropoff_longitude | powf(power=2)) | else(-1) # TODO: GIS bucketing a la H3. # This just normalizes to reasonable extents for NYC # Latitude: [40.650, 40.850] => [0, 200] # Longitude: [-74.05, -73.75] => [0, 200] let pu_src_bin_id = {  lat: (((Pickup.pickup_latitude - 40.65) * (200/0.2)) as i64) | else(-1), lon: (((Pickup.pickup_longitude + 74.05) * (200/0.3)) as i64) | else(-1), } let pu_dst_bin_id = {  lat: (((Pickup.dropoff_latitude - 40.65) * (200/0.2)) as i64) | else(-1), lon: (((Pickup.dropoff_longitude + 74.05) * (200/0.3)) as i64) | else(-1), } let do_src_bin_id = {  lat: (((Dropoff.pickup_latitude - 40.65) * (200/0.2)) as i64) | else(-1), lon: (((Dropoff.pickup_longitude + 74.05) * (200/0.3)) as i64) | else(-1), } let do_dst_bin_id = {  lat: (((Dropoff.dropoff_latitude - 40.65) * (200/0.2)) as i64) | else(-1),  lon: (((Dropoff.dropoff_longitude + 74.05) * (200/0.3)) as i64) | else(-1), } # Compute some per-trip metrics for Dropoffs let trip_dist = sqrt(  (Dropoff.pickup_latitude - Dropoff.dropoff_latitude) | powf(power=2) +  (Dropoff.pickup_longitude - Dropoff.dropoff_longitude) | powf(power=2)) let trip_speed = 100000 * trip_dist / Dropoff.trip_time_in_secs let dropoff_with_metrics = Dropoff | extend({trip_dist, trip_speed}) # Re-key by trip source and destination let dropoff_by_src_bin = dropoff_with_metrics | with_key(do_src_bin_id) let dropoff_by_dst_bin = dropoff_with_metrics | with_key(do_dst_bin_id) # Compute aggregates related to trips departing from a given bin let departure_mean_speed_10m = dropoff_by_src_bin.trip_speed | mean(window=sliding(10,minutely())) let departure_mean_speed_60m = dropoff_by_src_bin.trip_speed | mean(window=sliding(60,minutely())) let departure_mean_speed_1d = dropoff_by_src_bin.trip_speed | mean(window=sliding(24,hourly())) let departure_count_10m = dropoff_by_src_bin | count(window=sliding(10, minutely())) let departure_count_60m = dropoff_by_src_bin | count(window=sliding(60, minutely())) let departure_count_1d = dropoff_by_src_bin | count(window=sliding(24, hourly())) # Compute aggregates related to trips arriving at a given bin let arrival_mean_speed_10m = dropoff_by_dst_bin.trip_speed | mean(window=sliding(10,minutely())) let arrival_mean_speed_60m = dropoff_by_dst_bin.trip_speed | mean(window=sliding(60,minutely())) let arrival_mean_speed_1d = dropoff_by_dst_bin.trip_speed | mean(window=sliding(24,hourly())) let arrival_count_10m = dropoff_by_dst_bin | count(window=sliding(10, minutely())) let arrival_count_60m = dropoff_by_dst_bin | count(window=sliding(60, minutely())) let arrival_count_1d = dropoff_by_dst_bin | count(window=sliding(24, hourly())) in {  distance,  # hour of day  # day of week  # geo cluster  monthday: day_of_month(Pickup.pickup_datetime as timestamp_ns) | else(-1),  passengers: Pickup.passenger_count | else(-1),  vendor: Pickup.vendor_id,  src_lat_bin: pu_src_bin_id.lat,  src_lon_bin: pu_src_bin_id.lon,  dst_lat_bin: pu_dst_bin_id.lat,  dst_lon_bin: pu_dst_bin_id.lon,  # Features related to recent trips departing from the same area  departure_mean_speed_10m: departure_mean_speed_10m | lookup(pu_src_bin_id) | else (-1),  departure_mean_speed_60m: departure_mean_speed_60m | lookup(pu_src_bin_id) | else (-1),  departure_mean_speed_1d: departure_mean_speed_1d | lookup(pu_src_bin_id) | else (-1),  departure_count_10m: departure_count_10m | lookup(pu_src_bin_id) | else (-1),  departure_count_60m: departure_count_60m | lookup(pu_src_bin_id) | else (-1),  departure_count_1d: departure_count_1d | lookup(pu_src_bin_id) | else (-1),  # Features related to recent trips arriving in the same area  arrival_mean_speed_10m: arrival_mean_speed_10m | lookup(pu_dst_bin_id) | else (-1),  arrival_mean_speed_60m: arrival_mean_speed_60m | lookup(pu_dst_bin_id) | else (-1),  arrival_mean_speed_1d: arrival_mean_speed_1d | lookup(pu_dst_bin_id) | else (-1),  arrival_count_10m: arrival_count_10m | lookup(pu_dst_bin_id) | else (-1),  arrival_count_60m: arrival_count_60m | lookup(pu_dst_bin_id) | else (-1),  arrival_count_1d: arrival_count_1d | lookup(pu_dst_bin_id) | else (-1), } # We'll make predictions from features computed at the time of each pickup | when(is_valid(Pickup)) # We'll predict the duration of the trip, which we learn at the time of the next dropoff | last(window=since(is_valid(Dropoff))) | when(is_valid(Dropoff)) | extend({target: Dropoff.trip_time_in_secs}) # cleaning | when($input.distance < 0.5) # distance outliers | when($input.distance > 0) # very short trips | when($input.target < 24 * 60 * 60) # trips longer than a day | when($input.target > 60) # trips shorter than a minute"

0,1
time_preparing,0.0s
time_computing,0.0s
output_files,0

0,1
can_execute,False
fenl_diagnostics,(see Diagnostics tab)

0,1
request_id,74d3e3283d40b1957d980997e6677d2b

0,1
0,"error[E0010]: Invalid argument type(s)  --> Query:37:49  | 37 | let dropoff_by_src_bin = dropoff_with_metrics | with_key(do_src_bin_id)  | ^^^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'with_key'  |  = Expected 'key'"
1,"error[E0010]: Invalid argument type(s)  --> Query:38:49  | 38 | let dropoff_by_dst_bin = dropoff_with_metrics | with_key(do_dst_bin_id)  | ^^^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'with_key'  |  = Expected 'key'"
2,"error[E0010]: Invalid argument type(s)  --> Query:70:58  | 70 | departure_mean_speed_10m: departure_mean_speed_10m | lookup(pu_src_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"
3,"error[E0010]: Invalid argument type(s)  --> Query:71:58  | 71 | departure_mean_speed_60m: departure_mean_speed_60m | lookup(pu_src_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"
4,"error[E0010]: Invalid argument type(s)  --> Query:72:56  | 72 | departure_mean_speed_1d: departure_mean_speed_1d | lookup(pu_src_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"
5,"error[E0010]: Invalid argument type(s)  --> Query:73:48  | 73 | departure_count_10m: departure_count_10m | lookup(pu_src_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"
6,"error[E0010]: Invalid argument type(s)  --> Query:74:48  | 74 | departure_count_60m: departure_count_60m | lookup(pu_src_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"
7,"error[E0010]: Invalid argument type(s)  --> Query:75:46  | 75 | departure_count_1d: departure_count_1d | lookup(pu_src_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"
8,"error[E0010]: Invalid argument type(s)  --> Query:78:54  | 78 | arrival_mean_speed_10m: arrival_mean_speed_10m | lookup(pu_dst_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"
9,"error[E0010]: Invalid argument type(s)  --> Query:79:54  | 79 | arrival_mean_speed_60m: arrival_mean_speed_60m | lookup(pu_dst_bin_id) | else (-1),  | ^^^^^^ ------------- Type: {lat: i64, lon: i64}  | | | Invalid types for call to 'lookup'  |  = Expected 'key'"


In [15]:
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import timedelta
import datetime as dt
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 10]
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')

In [16]:
X = examples.dataframe.drop(["target", "_key", "_key_hash", "_subsort", "_time"], axis=1)
y = examples.dataframe["target"]


columns_to_encode = ['vendor']
ct = ColumnTransformer(
    [('encoder', OneHotEncoder(sparse=False), columns_to_encode)],
    remainder='passthrough' 
)
X = ct.fit_transform(X)

Xtr, Xv, ytr, yv = train_test_split(X, y, train_size=50000, test_size=10000, random_state=42)
dtrain = xgb.DMatrix(Xtr, label=ytr)
dvalid = xgb.DMatrix(Xv, label=yv)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

# Try different parameters! My favorite is random search :)
xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.7, 'max_depth': 5,
            'subsample': 0.7, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree',
            'eval_metric': 'rmsle', 'objective': 'reg:linear'}

In [17]:
# You could try to train with more epoch
# 0.28976 leader, 0.37195 10%
model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)

[0]	train-rmsle:1.18370	valid-rmsle:1.17266
[10]	train-rmsle:0.56332	valid-rmsle:0.56635
[20]	train-rmsle:0.55745	valid-rmsle:0.56367
[30]	train-rmsle:0.55421	valid-rmsle:0.56307
[40]	train-rmsle:0.55128	valid-rmsle:0.56377
[50]	train-rmsle:0.54971	valid-rmsle:0.56464
[54]	train-rmsle:0.54875	valid-rmsle:0.56428
