## How long should the training window be (for tips)?

Pulling a year of data takes a lot of time, and we'd rather not pull unnecessary data to train a model if the performance is similar. The following notebook is a very quick example of whether a year of training is worthwhile - but it is not conclusive.

We consider training from `2022-11` to `2023-11` v.s just training on `2023-11`, and then see the MSE of the model for the prediction of `2023-12`.

### Tips

In [34]:
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
import xgboost as xgb
from scipy.stats import zscore
import numpy as np
import statsmodels.api as sm

In [21]:
date_ptr = '2022-11'

month_data = defaultdict()

while date_ptr != '2024-01':
    date_dt = pd.to_datetime(date_ptr)
    month_df = pd.read_parquet(f'../data/tr_data/{date_ptr}.parquet')
    
    # take out outliers
    month_df['tip_amount_zscore'] = zscore(month_df['tip_amount'])
    month_df = month_df[np.abs(month_df['tip_amount_zscore']) < 2]
    month_data[date_ptr] = month_df
    date_dt = date_dt + pd.DateOffset(months=1)
    date_ptr = date_dt.strftime('%Y-%m')

#### Collecting data

In [None]:
date_ptr = '2022-11'
year_data = []
while date_ptr != '2023-12':
    year_data.append(month_data[date_ptr])
    date_ptr = (pd.to_datetime(date_ptr) + pd.DateOffset(months=1)).strftime('%Y-%m')

year_df = pd.concat(year_data)
month_df = month_data['2023-11']

#### Training the models

In [23]:
year_model = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    objective='reg:squarederror',
    random_state=42
).fit(year_df[['trip_distance']], year_df['tip_amount'])

month_model = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    objective='reg:squarederror',
    random_state=42
).fit(month_df[['trip_distance']], month_df['tip_amount'])

#### Checking the MSE

In [24]:
from sklearn.metrics import mean_squared_error

test_data = month_data['2023-12']

year_pred = year_model.predict(test_data[['trip_distance']])
month_pred = month_model.predict(test_data[['trip_distance']])

In [25]:
# Year data
print(f'Year MSE: {mean_squared_error(year_pred, test_data['tip_amount'])}')
print(f'Month MSE: {mean_squared_error(month_pred, test_data['tip_amount'])}')

Year MSE: 9.38975752680463
Month MSE: 8.910082613604944


Actually, the month model is more accurate... either way - the month data seems to be enough to fit an accurate model for tips.

### Fares

Do fares also stay rigid under month data, particularly since we regress on the lower quintile?

In [26]:
date_ptr = '2022-11'

month_fdata = defaultdict()

while date_ptr != '2024-01':
    date_dt = pd.to_datetime(date_ptr)
    month_df = pd.read_parquet(f'../data/tr_data/{date_ptr}.parquet')
    month_fdata[date_ptr] = month_df
    date_dt = date_dt + pd.DateOffset(months=1)
    date_ptr = date_dt.strftime('%Y-%m')

In [29]:
date_ptr = '2022-11'
year_fdata = []
while date_ptr != '2023-12':
    year_fdata.append(month_data[date_ptr])
    date_ptr = (pd.to_datetime(date_ptr) + pd.DateOffset(months=1)).strftime('%Y-%m')

year_fdf = pd.concat(year_fdata)
month_fdf = month_fdata['2023-11']

#### Fit the models

In [33]:
year_fdf['fare/distance'] = year_fdf['fare_amount'] / year_fdf['trip_distance']
month_fdf['fare/distance'] = month_fdf['fare_amount'] / month_fdf['trip_distance']

quantile_year = year_fdf[year_fdf['fare/distance'] <= year_fdf['fare/distance'].quantile(0.03)]
quantile_month = month_fdf[month_fdf['fare/distance'] <= month_fdf['fare/distance'].quantile(0.03)]

In [35]:
year_fare_model = sm.OLS(quantile_year['fare_amount'], sm.add_constant(quantile_year['trip_distance'])).fit()
month_fare_model = sm.OLS(quantile_month['fare_amount'], sm.add_constant(quantile_month['trip_distance'])).fit()

year_fare_params = year_fare_model.params.values
month_fare_params = month_fare_model.params.values

#### Find the MSE

In [43]:
test_fdf = month_fdata['2023-12']

year_fpred = test_fdf['trip_distance'] * year_fare_params[1] + year_fare_params[0]
month_fpred = test_fdf['trip_distance'] * month_fare_params[1] + month_fare_params[0]

print(f'Year MSE: {mean_squared_error(test_fdf['fare_amount'], year_fpred)}')
print(f'Year MSE: {mean_squared_error(test_fdf['fare_amount'], month_fpred)}')

Year MSE: 284.2980587957826
Year MSE: 67.46598843222199


Again, the recency of the month data performs better.