## Benchmark 5: Structured bootstrapping
OK, so internal benchmarking strategy #2 here. Going to start calling it bootstrapping because I am using random sampling with replacement and am envisioning a bagging approach to the final model. Big takeaway from the first cross validation-like approach (really ended up being more like bootstrapping anyway...) was that to get good approximations of the public leader board score we should not be aggregating SMAPE scores across forecast origins. This seems to result in worse performance, the explanation being that we are including data from some anomalous and some nonanomalous timepoints. When, in reality a given timepoint is either an anomaly or not. This causes the naive model to do somewhat badly all the time rather than OK on nonanomalous test time points and very badly on anomalous ones.

So, the strategy here will be to structure our bootstrapping while still treating each county as an individual datapoint. To generate a sample of size n the procedure will be as follows:
1. Remove and sequester some timepoints dataset-wide for test data.
2. From the remaining, pick a random timepoint.
3. From the randomly chosen timepoint randomly choose n counties.
4. Go to step 2.

This procedure could be repeated with different test-train splits, or even test train splits on non-overlapping subsets of the data. Does that make it bootstrapping inside of cross-validation fold? I don't know - I think the specific terminology here is less important than a precise description of what we are actually doing.


1. [Abbreviations & definitions](#abbrevations_definitions)
2. [Load & inspect](#load_inspect)
3. [Build data structure](#build_data_structure)

<a name="abbreviations_definitions"></a>
### 1. Abbreviations & definitions
+ MBD: microbusiness density
+ MBC: microbusiness count
+ OLS: ordinary least squares
+ Model order: number of past timepoints used as input data for model training
+ Origin (forecast origin): last known point in the input data
+ Horizon (forecast horizon): number of future data points predicted by the model
+ SMAPE: Symmetric mean absolute percentage error

<a name="load_inspect"></a>
### 2. Load & inspect

In [None]:
# Add parent directory to path to allow import of config.py
import sys
sys.path.append('..')
import config as conf
import functions.data_manipulation_functions as data_funcs

import numpy as np
import pandas as pd
import multiprocessing as mp
from statistics import NormalDist

print(f'Python: {sys.version}')
print()
print(f'Numpy {np.__version__}')
print(f'Pandas {pd.__version__}')

In [None]:
# Load up training file, set dtype for first day of month
paths = conf.DataFilePaths()

training_df = pd.read_csv(f'{paths.KAGGLE_DATA_PATH}/train.csv.zip', compression='zip')
training_df['first_day_of_month'] =  pd.to_datetime(training_df['first_day_of_month']).astype(int)
training_df.drop(['row_id', 'county','state'], axis=1, inplace=True)
training_df.head()

<a name="build_data_structure"></a>
### 1. Build data structure
First thing to do is build a data structure that will make resampling as easy as possible and persist it to disk. Then we will implement resampling. This way when experimenting with/training models, it will be easy to generate samples on they fly without having to rebuild the whole thing in memory each time.

The trick here is that in the first dimension we want the timepoints, then the counties, then the data columns. But, the 'timepoints' we want in the first dimension are really the forecast origins for a subset of the timecourse.

Maybe the best way to think about it is to forget the model order, forecast horizon and forecast origin location for now and just pick a data instance size. Then, when feeding the sample to a model we can break each block into input and forecast halves on the fly however we want. This will also solve the problem we had in notebook #07 where if model order != forecast horizon NumPy complains about ragged arrays.

Not sure if there maybe would be a speed advantage of sharding the data to multiple files here - i.e. each timepoint gets its own file and then worker processes can be assigned to sample from and train on the timepoints independently. Let's save that for a later optimization to minimize complexity out of the gate and not spend time prematurely optimizing something that ends up not being the right idea.

In [None]:
# First thing, let's give each timepoint an integer number so we don't have to 
# work with the strings/datetimes in 'first_day_of_month' directly and clean up
# columns we won't need.

training_df['timepoint'] = training_df.groupby(['cfips']).cumcount()
training_df.head()

In [None]:
# Let's also add a column with difference detrended data (see notebook #02.2)

# Makes sure the rows are in chronological order within each county
training_df = training_df.sort_values(by=['cfips', 'first_day_of_month'])

# Calculate and add column for month to month change in MBD
training_df['microbusiness_density_change'] = training_df.groupby(['cfips'])['microbusiness_density'].diff() # type: ignore
training_df.dropna(inplace=True)
training_df.head()

In [None]:
# Going to need a list of unique cfips IDs to retrieve the counties
cfips_list = training_df['cfips'].drop_duplicates(keep='first').to_list()
print(f'Num counties: {len(cfips_list)}')

# The possible block left edges are the timepoints so get those in a list too
num_timepoints = training_df['timepoint'].nunique()
print(f'Num timepoints: {num_timepoints}\n')

for block_size in range(2,39):

    timepoints = []

    for left_edge in range(1, (num_timepoints - block_size + 1)):

        # The right edge of this block is the left edge number
        # plus the blocksize
        right_edge = left_edge + block_size
        print(f'\tBlock range: {left_edge} - {right_edge}')

        # Holder for individual blocks
        blocks = []

        # Now we go through each county and get this block range
        for cfips in cfips_list:

            # Get data for just this county
            county_data = training_df[training_df['cfips'] == cfips]

            # Get rows for the block range
            county_data = county_data.loc[(county_data['timepoint'] >= left_edge) & (county_data['timepoint'] <= right_edge)]

            # Convert block range rows to numpy and collect
            blocks.append(county_data.to_numpy())
        
        # Convert list of blocks to numpy and collect
        print(f'\tBlock shape: {np.array(blocks).shape}\n')

        timepoints.append(np.array(blocks))

    # Convert final result to numpy
    timepoints = np.array(timepoints)

    # Write to disk
    output_file = f'{paths.DATA_PATH}/parsed_data/structured_bootstrap_blocksize{block_size}.npy'
    np.save(output_file, timepoints)

In [None]:
print(f'Timepoints shape: {timepoints.shape}') # type: ignore
print()
print('Column types:')

for column in timepoints[0,0,0,0:]: # type: ignore
    print(f'\t{type(column)}')

print()
print(f'Example block:\n{timepoints[0,0,0:,]}') # type: ignore

OK, looks good. We could use int for some of these columns, but we need float for the MBD, so let's leave it alone rather than using mixed types in a NumPy array. So, that's it - easy. Let's round trip it and try recovering some our values back into a nice pandas dataframe as a sanity check.

In [None]:
# Check round-trip
loaded_timepoints = np.load(output_file)

# Inspect
print(f'Timepoints shape: {loaded_timepoints.shape}')
print(f'Example block:\n{loaded_timepoints[0,0,0:,]}')

OK, not surprisingly - we got the same thing back. Last thing to do before we call this done is to test if we can get our dates, row_ids and cfips back into a format that matches the original data.

In [None]:
# Grab an example date column
test_dates = loaded_timepoints[0,0,0:,1] # type: ignore
print(f'Test dates: {test_dates}\ndtype: {type(test_dates)}\nelement dtype: {type(test_dates[0])}\n')

# Cast float64 to int64
test_dates = test_dates.astype(np.int64)
print(f'Test dates: {test_dates}\ndtype: {type(test_dates)}\nelement dtype: {type(test_dates[0])}\n')

# Convert to pandas dataframe with dtype datetime64[ns] and column name 'first_day_of_month'
test_dates_df = pd.DataFrame(pd.to_datetime(test_dates), columns=['first_day_of_month']).astype('datetime64')
test_dates_df.info()

In [None]:
test_dates_df.head()

In [None]:
# Grab an example cfips column
test_cfips = loaded_timepoints[0,0,0:,0] # type: ignore
print(f'Test cfips: {test_cfips}\ndtype: {type(test_cfips)}\nelement dtype: {type(test_cfips[0])}\n')

# Cast float64 to int64
test_cfips = test_cfips.astype(np.int64)
print(f'Test cfips: {test_cfips}\ndtype: {type(test_cfips)}\nelement dtype: {type(test_cfips[0])}\n')

# Convert to pandas dataframe with dtype int64 and column name 'cfips'
test_cfips_df = pd.DataFrame(test_cfips, columns=['cfips']).astype('int64')
test_cfips_df.info()

In [None]:
test_cfips_df.head()

Ok, happy - making the rwo ID is just a string join from here, so no problem. An if for some reason we want the string county or state back, we can use a CFIPS lookup table. Time to move on.