## Create Machine Learning Models

In [1]:
import pandas as pd

### Predict a number of passengers

Loading 2018-taxi-trip-data-clean.csv

Due to dataset size, we have to split data into smaller chunks. Taxi data will be used to predict a number of passengers for each trip.

In [2]:
taxi_data = pd.read_csv('../processed_data/2018-taxi-trip-data-clean.csv', iterator=True, chunksize=1000000)

We have chosen SGDRegressor as an algorithm to predict a number of passengers.
It is a simple model, which allows fitting with chunks. We had to use incremental learning due to enormous dataset

In [3]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split

In [4]:
SGD_pick = SGDRegressor()
SGD_drop = SGDRegressor()

We create empty lists, which will contain fragments from each chunk used in testing.

In [5]:
X_test_full_pick = []
y_test_full_pick = []

In [6]:
X_test_full_drop = []
y_test_full_drop = []

Each chunk of data for pickups and dropoffs is splitted into X's and y's and then into train and test dataframes.
Next SGD models are fitted with a corresponding train data. Test data is added to a list and will be used to evaluate the quality of a model.

In [7]:
for df in taxi_data:
    X_pick = df[['pickup_location', 'day_of_week_pickup_sin', 'day_of_week_pickup_cos', 'pickup_time_sin', 'pickup_time_cos']]
    y_pick = df['passenger_count']

    X_drop = df[['dropoff_location', 'day_of_week_dropoff_sin', 'day_of_week_dropoff_cos', 'dropoff_time_sin', 'dropoff_time_cos']]
    y_drop = df['passenger_count']

    X_train_pick, X_test_pick, y_train_pick, y_test_pick = train_test_split(X_pick, y_pick, test_size=0.05)
    X_train_drop, X_test_drop, y_train_drop, y_test_drop = train_test_split(X_drop, y_drop, test_size=0.05)

    SGD_pick.partial_fit(X_train_pick, y_train_pick)
    SGD_drop.partial_fit(X_train_drop, y_train_drop)

    X_test_full_pick.append(X_test_pick)
    y_test_full_pick.append(y_test_pick)

    X_test_full_drop.append(X_test_drop)
    y_test_full_drop.append(y_test_drop)

Creating dataframes out of lists of test data from each chunk

In [8]:
X_test_full_pick = pd.concat(X_test_full_pick)
y_test_full_pick = pd.concat(y_test_full_pick)

In [9]:
X_test_full_drop = pd.concat(X_test_full_drop)
y_test_full_drop = pd.concat(y_test_full_drop)

To assess the quality of our models, we have chosen mean squared error as our metric.

In [10]:
from sklearn.metrics import mean_squared_error

In [11]:
y_pred_pick = SGD_pick.predict(X_test_full_pick)
mean_squared_error(y_test_full_pick, y_pred_pick)

6.988030428159713e+23

In [12]:
y_pred_drop = SGD_drop.predict(X_test_full_drop)
mean_squared_error(y_test_full_drop, y_pred_drop)

5.078578208053911e+19

Unfortunately our results were not decent. We think it might be a result of quality of our data.

### Predict probability of pickup and dropoff

To predict the probability of a pickup and a dropoff we have chosen XGBRegressor as our model. We were interested in how this algorithm works. We were not forced to used model with incremental learning beacuse this datasets were much smaller.

In [13]:
import xgboost as xgb

In [14]:
xgb_pick = xgb.XGBRegressor(objective = 'reg:squarederror')
xgb_drop = xgb.XGBRegressor(objective = 'reg:squarederror')

Loading taxi_pickup_prob.csv and taxi_dropoff_prob.csv

In [15]:
taxi_pickup_prob = pd.read_csv('../processed_data/taxi_pickup_prob.csv')
taxi_dropoff_prob = pd.read_csv('../processed_data/taxi_dropoff_prob.csv')

Splitting data for pickups and dropoffs into X's and y's.

In [16]:
X_pick = taxi_pickup_prob[['pickup_location', 'day_of_week_pickup_sin', 'day_of_week_pickup_cos', 'pickup_time_sin', 'pickup_time_cos']]
y_pick = taxi_pickup_prob['probability_pickup']

In [17]:
X_drop = taxi_dropoff_prob[['dropoff_location', 'day_of_week_dropoff_sin', 'day_of_week_dropoff_cos', 'dropoff_time_sin', 'dropoff_time_cos']]
y_drop = taxi_dropoff_prob['probability_dropoff']

Creating train and test dataframes.

In [18]:
X_train_pick, X_test_pick, y_train_pick, y_test_pick = train_test_split(X_pick, y_pick, test_size=0.2)
X_train_drop, X_test_drop, y_train_drop, y_test_drop = train_test_split(X_drop, y_drop, test_size=0.2)

Fitting regressors with a corresponding train data.

In [19]:
xgb_pick.fit(X_train_pick, y_train_pick)
xgb_drop.fit(X_train_drop, y_train_drop)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

As with the SGD model, we assess the quality of XGBRegressors using mean squared error metric.

In [20]:
y_pred_pick = xgb_pick.predict(X_test_pick)
mean_squared_error(y_test_pick, y_pred_pick)

0.0029998766992110387

In [21]:
y_pred_drop = xgb_pick.predict(X_test_drop)
mean_squared_error(y_test_drop, y_pred_drop)

0.019340284877479325

This time we gtt better results.