## Benchmark 5: Structured bootstrapping
OK, so internal benchmarking strategy #2 here. Going to start calling it bootstrapping because I am using random sampling with replacement and am envisioning a bagging approach to the final model. Big takeaway from the first cross validation-like approach (really ended up being more like bootstrapping anyway...) was that to get good approximations of the public leader board score we should not be aggregating SMAPE scores across forecast origins. This seems to result in worse performance, the explanation being that we are including data from some anomalous and some nonanomalous timepoints. When, in reality a given timepoint is either an anomaly or not. This causes the naive model to do somewhat badly all the time rather than OK on nonanomalous test time points and very badly on anomalous ones.

So, the strategy here will be to structure our bootstrapping while still treating each county as an individual datapoint. To generate a sample of size n the procedure will be as follows:
1. Remove and sequester some timepoints dataset-wide for test data.
2. From the remaining, pick a random timepoint.
3. From the randomly chosen timepoint randomly choose n counties.
4. Go to step 2.

This procedure could be repeated with different test-train splits, or even test train splits on non-overlapping subsets of the data. Does that make it bootstrapping inside of cross-validation fold? I don't know - I think the specific terminology here is less important than a precise description of what we are actually doing.


1. [Abbreviations & definitions](#abbrevations_definitions)
2. [Load & inspect](#load_inspect)
3. [Build data structure](#build_data_structure)

<a name="abbreviations_definitions"></a>
### 1. Abbreviations & definitions
+ MBD: microbusiness density
+ MBC: microbusiness count
+ OLS: ordinary least squares
+ Model order: number of past timepoints used as input data for model training
+ Origin (forecast origin): last known point in the input data
+ Horizon (forecast horizon): number of future data points predicted by the model
+ SMAPE: Symmetric mean absolute percentage error

<a name="load_inspect"></a>
### 2. Load & inspect

In [1]:
# Add parent directory to path to allow import of config.py
import sys
sys.path.append('..')
import config as conf
import functions.data_manipulation_functions as data_funcs

import numpy as np
import pandas as pd
import multiprocessing as mp
from statistics import NormalDist

print(f'Python: {sys.version}')
print()
print(f'Numpy {np.__version__}')
print(f'Pandas {pd.__version__}')

Python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0]

Numpy 1.23.5
Pandas 1.4.3


In [3]:
# Load up training file, set dtype for first day of month
training_df = pd.read_csv(f'{conf.KAGGLE_DATA_PATH}/train.csv.zip', compression='zip')
training_df['first_day_of_month'] =  pd.to_datetime(training_df['first_day_of_month'])

training_df.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


<a name="build_data_structure"></a>
### 1. Build data structure
First thing to do is build a data structure that will make our structured sampling approach easy and persist it to disk. Then we will work on our sampling strategy. This way when experimenting models, it will be easy to generate samples on they fly without having to rebuild the whole thing in memory each time.

The trick here is that in the first dimension we want the timepoints, then the counties, then the data columns. But, the 'timepoints' we want in the first dimension are really the forecast origin for a subset of the timecourse.

Maybe the best way to think about it is to forget the model order, forecast horizon and forecast origin location for now and just pick a data instance size. Then, when feeding the sample to a model we can break each block into input and forecast halves on the fly however we want. This will also solve the problem we had in notebook #07 where if model order != forecast horizon NumPy complains about ragged arrays.

Not sure if there maybe would be a speed advantage of sharding the data to multiple files here - i.e. each timepoint gets its own file and then worker processes can be assigned to sample from and train on the timepoints independently. Let's save that for a later optimization to minimize complexity out of the gate and not spend time prematurely optimizing something that ends up not being the right idea.