# [Question 1]

Download data and unzip archive file commands.

Validate [Question 1], you should get:

```
nyc_tlc
├── misc
│   ├── taxi_zone_lookup.csv
│   ├── taxi_zones
│   │   ├── taxi_zones.dbf
│   │   ├── taxi_zones.prj
│   │   ├── taxi_zones.sbn
│   │   ├── taxi_zones.sbx
│   │   ├── taxi_zones.shp
│   │   ├── taxi_zones.shp.xml
│   │   └── taxi_zones.shx
│   └── taxi_zones.zip
└── trip_data
    ├── yellow_tripdata_2018-04.csv
    ├── yellow_tripdata_2018-05.csv
    └── yellow_tripdata_2018-06.csv

3 directories, 12 files
```

In [None]:
# TODO your solution goes here:


In [None]:
!tree nyc_tlc

## Basic Preparation

We import all useful packages, do some basic/global settings.

In [None]:
# imports

import time
import pickle
import datetime
import numpy as np
import pandas as pd
import geopandas as gp
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import xgboost as xgb
import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error as mae
import logging
import contest_helper

# global setting
logger = logging.getLogger()
logger.setLevel(logging.INFO)

plt.rcParams['agg.path.chunksize'] = 10000
plt.rcParams['figure.figsize'] = [12, 8]
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:.3f}')


## Taxi Zones Shape Preparation

Since newest NYC Taxi dataset only provides `PULocationID` and `DOLocationID`, instead of `pickup_longitude`, `pickup_latitude`, `dropoff_longitude`, and `dropoff_latitude`, we can only predict requests in each `PULocationID` (zone). We load [taxi_zone_lookup.csv] and [taxi_zones.shp], and use `geopandas` to visualize the zones in Manhattan (69 in total).

contest_helper.NycTaxiAnalyzer is a wrapper class to load and present taxi data and zones shape.
1. contest_helper.NycTaxiAnalyzer.taxi_zone_lookup: pandas.DataFrame
1. contest_helper.NycTaxiAnalyzer.taxi_zones_shape: geopandas.GeoDataFrame

In [None]:
nyc_taxi_analyzer = contest_helper.NycTaxiAnalyzer()

nyc_taxi_analyzer.load_shape('nyc_tlc/misc/taxi_zone_lookup.csv',
                            'nyc_tlc/misc/taxi_zones/taxi_zones.shp',
                            borough='Manhattan')

nyc_taxi_analyzer.taxi_zones_shape.plot()


# [Question 2]

1. load Manhattan data: from 2018-04 to 2018-06
2. define a function 'filter_abnormal_data' to filter abnormal data
3. call filter_abnormal_data to filter 'contest_helper.NycTaxiAnalyzer.data'

## Load data

We split the dataset into two parts: train and validate by setting `train_valid_split_datetime` to 2018-06-01 00:00:00.
We set `first_datetime` to 2018-04-01 00:00:00, and `last_datetime` to 2018-07-01 00:00:00.
We load all data from [nyc_tlc/trip_data/] between `first_datetime` and `last_datetime`.
We use `matplotlib` and `geopandas` to visualize some columns and help us to understand the trip data.

In [None]:
# first_datetime '2018-04-01 00:00:00'
fd = datetime.datetime.strptime('2018-04-01 00:00:00', '%Y-%m-%d %H:%M:%S')
# last_datetime '2018-07-01 00:00:00'
ld = datetime.datetime.strptime('2018-07-01 00:00:00', '%Y-%m-%d %H:%M:%S')
# train_valid_split_datetime '2018-06-01 00:00:00'
tvsd = datetime.datetime.strptime('2018-06-01 00:00:00', '%Y-%m-%d %H:%M:%S')

nyc_taxi_analyzer.load_data('nyc_tlc/trip_data/', first_datetime=fd, last_datetime=ld)
nyc_taxi_analyzer.data.head()


## Filter abnormal data

Define a function, and filter abnormal data.
Acceptable data should be validated like below,
1. trip_distance > 0
1. trip_duration > 0
1. 0 < trip_speed <= 200
1. total_amount > 0

Validate [Question 2], you should get: 

```
(24540246, 23)
```

In [None]:

def filter_abnormal_data(data):
    # TODO your solution goes here:


    return data

nyc_taxi_analyzer.data = filter_abnormal_data(nyc_taxi_analyzer.data)

nyc_taxi_analyzer.data.shape


# [Question 3]

Show statistics of the prepared sample data.

In [None]:
# TODO your solution goes here:



In [None]:

# your solution goes here:

def filter_abnormal_data(data):
    return data


sample = filter_abnormal_data(nyc_taxi_analyzer.data)

sample.shape


## [Challenge Question]

Add new prediction algorithm or change parameters of below 4 prediction algorithms

In [None]:
first_30min_id = nyc_taxi_analyzer.get_30min_id(fd)
last_30min_id = nyc_taxi_analyzer.get_30min_id(ld)
train_valid_split_30min_id = nyc_taxi_analyzer.get_30min_id(tvsd)
all_30min_index, all_30min_static = nyc_taxi_analyzer.get_all_index_and_static(last_30min_id, 'tpep_pickup_30min_id')

sample_30min_count, sample_30min_mean, sample_30min_sum, sample_30min_dropoff_count, sample_30min_dropoff_mean, sample_30min_dropoff_sum = nyc_taxi_analyzer.get_sample_group('tpep_pickup_30min_id', nyc_taxi_analyzer.data)
all_30min = nyc_taxi_analyzer.get_all(all_30min_index, sample_30min_count, sample_30min_mean, sample_30min_sum, sample_30min_dropoff_count, sample_30min_dropoff_mean, sample_30min_dropoff_sum)
all_30min_features = nyc_taxi_analyzer.get_all_features(all_30min, all_30min_static, nyc_taxi_analyzer.location_num)

## Train and Validate

We split all data into train and validate part. We demonstrate 3 methods to forecast requests: XGBoost, LightGBM, linear regression implemented using sklearn, and evaluate the models using mean absolute error (MAE). We also visualize the prediction results between 2018-04-01 00:00:00 and 2018-04-01 00:05:00 using `geopandas` (the darker the color, the more demand), and we can visualize any time slot using this method.

In [None]:
manhattan_location_num = nyc_taxi_analyzer.location_num

train_X_30min = all_30min_features[:int(train_valid_split_30min_id)*manhattan_location_num]
print('train_X_30min:', train_X_30min.shape)
valid_X_30min = all_30min_features[int(train_valid_split_30min_id)*manhattan_location_num:int(last_30min_id)*manhattan_location_num]
print('valid_X_30min:', valid_X_30min.shape)
train_Y_30min = train_X_30min['value'].values
print('train_Y_30min:', len(train_Y_30min))
valid_Y_30min = valid_X_30min['value'].values
print('valid_Y_30min:', len(valid_Y_30min))

In [None]:
def xgb_train_validate(train_X, train_Y, test_X, test_Y):
    xg_train = xgb.DMatrix(train_X.drop('value', axis=1), label=train_Y)
    xg_test = xgb.DMatrix(test_X.drop('value', axis=1), label=test_Y)
    # setup parameters for xgboost
    param = {}
    # scale weight of positive examples
    param['eta'] = 0.1  # default
    param['max_depth'] = 6  # default: 6
    param['silent'] = 1  # default
    param['nthread'] = 4  # default
    param['gamma'] = 1
    param['subsample'] = 0.9
    param['min_child_weight'] = 1
    param['colsample_bytree'] = 0.9
    param['lambda'] = 1
    param['booster'] = 'gbtree'
    param['eval_metric'] = 'mae'
    param['objective'] = 'reg:linear'
    
    watchlist = [(xg_train, 'train'), (xg_test, 'test')]
    num_round = 100

    bst = xgb.train(param, xg_train, num_round, watchlist)

    imp = bst.get_fscore()
    print(sorted(imp.items(), key=lambda d: d[1], reverse=True))
    
    pred = bst.predict(xg_test)
    return pred

In [None]:
def lr_train_validate(train_X, train_Y, test_X, test_Y):
    rfc = LinearRegression()
    rfc.fit(train_X.drop('value', axis=1), train_Y.astype(np.float))
    pred = rfc.predict(test_X.drop('value', axis=1))
    return pred

In [None]:
def lgb_train_validate(train_X, train_Y, test_X, test_Y):
    # create dataset for lightgbm
    lgb_train = lgb.Dataset(train_X.drop('value', axis=1), train_Y)
    lgb_eval = lgb.Dataset(test_X.drop('value', axis=1), test_Y, reference=lgb_train)

    # specify your configurations as a dict
    params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': 31,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
    }

    print('Starting training...')
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=100,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=5)
    
    print('Starting predicting...')
    # predict
    pred = gbm.predict(test_X.drop('value', axis=1), num_iteration=gbm.best_iteration)
    # eval
    print('The mae of prediction is:', mae(test_Y, pred))
    return pred

In [None]:
# Add new prediction algorithm
def new_algo_train_validate(train_X, train_Y, test_X, test_Y):
    """
    :param train_X : Dataframe, (?, 35) train data including 'value' column, you should drop the column first (already done)
    :param train_Y: array, train label data, which is actually train_X['value'].values
    :param test_X : Dataframe, (?, 35) test data including 'value' column, you should drop the column first (already done)
    :param test_Y: array, test label data, which is actually test_X['value'].values
    :return: array, test prediction data
    """
    train_X = train_X.drop('value', axis=1)
    test_X = test_X.drop('value', axis=1)
    pred = np.array([0 for _ in test_Y])
    
    # TODO your solution goes here:
    
    
    return pred


In [None]:
# train and validate 30min slot
pred_30min_xgb = xgb_train_validate(train_X_30min, train_Y_30min, valid_X_30min, valid_Y_30min)
valid_30min_xgb_mae = mae(valid_Y_30min, pred_30min_xgb)
print('valid_30min_xgb_mae:', valid_30min_xgb_mae)
pred_30min_lr = lr_train_validate(train_X_30min, train_Y_30min, valid_X_30min, valid_Y_30min)
valid_30min_lr_mae = mae(valid_Y_30min, pred_30min_lr)
print('valid_30min_lr_mae:', valid_30min_lr_mae)
pred_30min_lgb = lgb_train_validate(train_X_30min, train_Y_30min, valid_X_30min, valid_Y_30min)
valid_30min_lgb_mae = mae(valid_Y_30min, pred_30min_lgb)
print('valid_30min_lgb_mae:', valid_30min_lgb_mae)
pred_30min_new_algo = new_algo_train_validate(train_X_30min, train_Y_30min, valid_X_30min, valid_Y_30min)
valid_30min_new_algo_mae = mae(valid_Y_30min, pred_30min_new_algo)
print('valid_30min_new_algo_mae:', valid_30min_new_algo_mae)
valid_pred_30min = pd.DataFrame(valid_X_30min, columns=['value'])
valid_pred_30min.reset_index(inplace=True)
valid_pred_30min['pred_xgb'] = pred_30min_xgb
valid_pred_30min['pred_lr'] = pred_30min_lr
valid_pred_30min['pred_lgb'] = pred_30min_lgb
valid_pred_30min['pred_new_algo'] = pred_30min_lgb
print('valid_pred_30min:', valid_pred_30min.shape)

train_X_30min.to_csv('train_X_30min.csv', index=True)
valid_X_30min.to_csv('valid_X_30min.csv', index=True)
valid_pred_30min.to_csv('valid_pred_30min.csv', index=False)

In [None]:
# show evaluate result
print('valid_30min_xgb_mae:', valid_30min_xgb_mae)
print('valid_30min_lr_mae:', valid_30min_lr_mae)
print('valid_30min_lgb_mae:', valid_30min_lgb_mae)
print('valid_30min_new_algo_mae:', valid_30min_new_algo_mae)