<img src="assets/title.png" width="800px"/>

<br><br>

# Experiments

In this notebook, we explore data and experiment iteratively.

## Part 1 - Data Exploration

2 datasets are used:
- TLC NYC Taxi trips (2015) - [link](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
- NOAA Climate data of JFK airport, NYC (2015) - [link](https://www.ncei.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094789/detail)

### TLC NYC Taxi trips
Contains taxi trips, whose duration we seek to predict.
<br><br>

| Column name | Description |
| :- | :- |
| vendor_id | TPEP provider that provided the record |
| pickup_datetime | The start date of the ride |
| dropoff_datetime | The end date of the ride |
| passenger_count | Number of passenger |
| trip_distance | The distance in Mile of the ride |
| pickup_longitude | The longitude of starting point of the ride |
| pickup_latitude | The latitude of starting point of the ride |
| rate_code | The rate code |
| store_and_fwd_flag | Trip record held in vehicle memory before sending to the vendor |
| dropoff_longitude | The longitude of end point of the ride |
| dropoff_latitude | The longitude of end point of the ride |
| payment_type | Type of payment |
| fare_amount | Amount of the ride in dollars |

More details on data schema on the [NYC TLC website](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)


### NOAA Climate data of JFK airport, NYC
Contains weather information.
Most 'important' columns are:
<br><br>

| Column name | Description |
| :- | :- |
| TMAX | Maximum temperature |
| TMIN | Minimum temperature |
| PRCP | Precipitation |
| SNOW | Snowfall |
| SNWD | Snow depth |
| ACMH | Average cloudiness midnight to midnight |
| TSUN | Total sunshine for the period |
| AWND | Average wind speed |

Full data schema is available on the [NOAA website](https://www.ncei.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094789/detail)


In [None]:
# Logging is disabled to avoid uncomfortable logs from third party libraries
import logging

logging.disable()


In [None]:
from src.data import get_train_dataset

data = get_train_dataset()


In [None]:
import ydata_profiling

ydata_profiling.ProfileReport(data).to_widgets()


In [None]:
from src.data import get_target

target = get_target(data)
target.head()


## Part 2 : Naive modeling

In [None]:
from sklearn.model_selection import train_test_split

from src.schemas import TaxiColumn
from src.config import config

train_idx, test_idx = train_test_split(
    data.sort_values(TaxiColumn.PICKUP_TIME).index,
    test_size=config.test_size,
    shuffle=False,
)

print(f"Train size: {len(train_idx)} trips")
print(f"Test size: {len(test_idx)} trips")


In [None]:
COLS_TO_EXTRACT = [
    TaxiColumn.VENDOR_ID,
    TaxiColumn.PASSENGER_COUNT,
    TaxiColumn.PICKUP_LON,
    TaxiColumn.PICKUP_LAT,
    TaxiColumn.DROPOFF_LON,
    TaxiColumn.DROPOFF_LAT,
]

features = data.loc[:, COLS_TO_EXTRACT]
features.head()


In [None]:
train_features, train_target = features.iloc[train_idx], target.iloc[train_idx]
test_features, test_target = features.iloc[test_idx], target.iloc[test_idx]


In [None]:
from sklearn.ensemble import RandomForestRegressor

RANDOM_STATE = 42

model = RandomForestRegressor(random_state=RANDOM_STATE).fit(train_features, train_target)


In [None]:
%matplotlib inline

from matplotlib import pyplot as plt

sorted_idx = model.feature_importances_.argsort()

plt.barh(model.feature_names_in_[sorted_idx], model.feature_importances_[sorted_idx])
plt.show()


## Part 2 : Evaluation

In [None]:
import pandas as pd
from sklearn.model_selection import cross_validate, TimeSeriesSplit

SCORING_METHODS = ('neg_mean_absolute_error', 'neg_mean_squared_error')
N_SPLITS = 5

model = RandomForestRegressor(random_state=RANDOM_STATE)
splitter = TimeSeriesSplit(n_splits=N_SPLITS)

cv_scores = cross_validate(
    model,
    features,
    target,
    scoring=SCORING_METHODS,
    cv=splitter,
)


In [None]:
pd.DataFrame(cv_scores).agg(['mean', 'std'])


<img src="assets/nibble.png" width="300px"/>
