## Benchmark 5: Structured bootstrapping
OK, so internal benchmarking strategy #2 here. Going to start calling it bootstrapping because I am using random sampling with replacement and am envisioning a bagging approach to the final model. Big takeaway from the first cross validation-like approach (really ended up being more like bootstrapping anyway...) was that to get good approximations of the public leader board score we should not be aggregating SMAPE scores across forecast origins. This seems to result in worse performance, the explanation being that we are including data from some anomalous and some nonanomalous timepoints. When, in reality a given timepoint is either an anomaly or not. This causes the naive model to do somewhat badly all the time rather than OK on nonanomalous test time points and very badly on anomalous ones.

So, the strategy here will be to structure our bootstrapping while still treating each county as an individual datapoint. To generate a sample of size n the procedure will be as follows:
1. Remove and sequester some timepoints dataset-wide for test data.
2. From the remaining, pick a random timepoint.
3. From the randomly chosen timepoint randomly choose n counties.
4. Go to step 2.

This procedure could be repeated with different test-train splits, or even test train splits on non-overlapping subsets of the data. Does that make it bootstrapping inside of cross-validation fold? I don't know - I think the specific terminology here is less important than a precise description of what we are actually doing.


1. [Abbreviations & definitions](#abbrevations_definitions)
2. [Load & inspect](#load_inspect)
3. [Build data structure](#build_data_structure)

<a name="abbreviations_definitions"></a>
### 1. Abbreviations & definitions
+ MBD: microbusiness density
+ MBC: microbusiness count
+ OLS: ordinary least squares
+ Model order: number of past timepoints used as input data for model training
+ Origin (forecast origin): last known point in the input data
+ Horizon (forecast horizon): number of future data points predicted by the model
+ SMAPE: Symmetric mean absolute percentage error

<a name="load_inspect"></a>
### 2. Load & inspect

In [1]:
# Add parent directory to path to allow import of config.py
import sys
sys.path.append('..')
import config as conf
#import functions.data_manipulation_functions as data_funcs

import shelve
import numpy as np
import pandas as pd
#import multiprocessing as mp
#from statistics import NormalDist

print(f'Python: {sys.version}')
print()
print(f'Numpy {np.__version__}')
print(f'Pandas {pd.__version__}')

Python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0]

Numpy 1.23.5
Pandas 1.4.3


In [2]:
# Load up training file, set dtype for first day of month
paths = conf.DataFilePaths()

training_df = pd.read_csv(f'{paths.KAGGLE_DATA_PATH}/train.csv.zip', compression='zip')

# Convert timepoint to datetime int because we are going to use numpy for the final data structure
training_df['first_day_of_month'] =  pd.to_datetime(training_df['first_day_of_month']).astype(int)

# Prune redundant columns
training_df.drop(['row_id', 'county','state'], axis=1, inplace=True)

# Let's also add a column with difference detrended data (see notebook #02.2)

# Makes sure the rows are in chronological order within each county
training_df = training_df.sort_values(by=['cfips', 'first_day_of_month'])

# Calculate and add column for month to month change in MBD (first order difference)
training_df['microbusiness_density_change'] = training_df.groupby(['cfips'])['microbusiness_density'].diff() # type: ignore

# Calculate and add column for month to month change in MBD change (second order difference)
training_df['microbusiness_density_change_change'] = training_df.groupby(['cfips'])['microbusiness_density_change'].diff() # type: ignore

# Now the first two rows of each counties time series has a NAN, because there is no preceding timepoint to subtract.
# So just drop those rows resulting in a total timeseries size of 37
training_df.dropna(inplace=True)

# Inspect
training_df.head()

Unnamed: 0,cfips,first_day_of_month,microbusiness_density,active,microbusiness_density_change,microbusiness_density_change_change
2,1001,1569888000000000000,3.055843,1269,0.170973,0.293785
3,1001,1572566400000000000,2.993233,1243,-0.06261,-0.233583
4,1001,1575158400000000000,2.993233,1243,0.0,0.06261
5,1001,1577836800000000000,2.96909,1242,-0.024143,-0.024143
6,1001,1580515200000000000,2.909326,1217,-0.059764,-0.035621


<a name="build_data_structure"></a>
### 1. Build data structure
First thing to do is build a data structure that will make resampling as easy as possible and persist it to disk. Then we will implement resampling. This way when experimenting with/training models, it will be easy to generate samples on they fly without having to rebuild the whole thing in memory each time.

The trick here is that in the first dimension we want the timepoints, then the counties, then the data columns. But, the 'timepoints' we want in the first dimension are really the forecast origins for a subset of the timecourse.

Maybe the best way to think about it is to forget the model order, forecast horizon and forecast origin location for now and just pick a data instance size. Then, when feeding the sample to a model we can break each block into input and forecast halves on the fly however we want. This will also solve the problem we had in notebook #07 where if model order != forecast horizon NumPy complains about ragged arrays.

Not sure if there maybe would be a speed advantage of sharding the data to multiple files here - i.e. each timepoint gets its own file and then worker processes can be assigned to sample from and train on the timepoints independently. It also might not be a bad idea to us h5py. Let's save that for a later optimization to minimize complexity out of the gate and not spend time prematurely optimizing something that ends up not being the right idea.

In [4]:
# Going to need a list of unique cfips IDs to retrieve the counties
cfips_list = training_df['cfips'].drop_duplicates(keep='first').to_list()
print(f'Num counties: {len(cfips_list)}')

# Get number of timepoints per county
num_timepoints = training_df['first_day_of_month'].nunique()
print(f'Num timepoints: {num_timepoints}')

# Start loop with three points - two for input and one for forecast
# this is the smallest possible block for any sort of model that
# does more than carry forward the last datapoint
for block_size in range(2, num_timepoints + 1):

    timepoints = []

    for left_edge in range(0, (num_timepoints - block_size + 1)):

        # Holder for individual blocks
        blocks = []

        # Now we go through each county and get this block range
        for cfips in cfips_list:

            # Get data for just this county
            county_data = training_df[training_df['cfips'] == cfips]

            # Get rows for the block range
            county_data = county_data.iloc[left_edge:(left_edge + block_size)]

            # Convert block range rows to numpy and collect
            blocks.append(county_data)
        
        # Convert list of blocks to numpy and collect
        timepoints.append(np.array(blocks))

    # Convert final result to numpy
    timepoints = np.array(timepoints)
    print(f'Timepoints shape: {timepoints.shape}')

    # Write to disk
    output_file = f'{paths.DATA_PATH}/parsed_data/structured_bootstrap_blocksize{block_size}.npy'
    np.save(output_file, timepoints)

Num counties: 3135
Num timepoints: 37
Timepoints shape: (36, 3135, 2, 6)
Timepoints shape: (35, 3135, 3, 6)
Timepoints shape: (34, 3135, 4, 6)
Timepoints shape: (33, 3135, 5, 6)
Timepoints shape: (32, 3135, 6, 6)
Timepoints shape: (31, 3135, 7, 6)
Timepoints shape: (30, 3135, 8, 6)
Timepoints shape: (29, 3135, 9, 6)
Timepoints shape: (28, 3135, 10, 6)
Timepoints shape: (27, 3135, 11, 6)
Timepoints shape: (26, 3135, 12, 6)
Timepoints shape: (25, 3135, 13, 6)
Timepoints shape: (24, 3135, 14, 6)
Timepoints shape: (23, 3135, 15, 6)
Timepoints shape: (22, 3135, 16, 6)
Timepoints shape: (21, 3135, 17, 6)
Timepoints shape: (20, 3135, 18, 6)
Timepoints shape: (19, 3135, 19, 6)
Timepoints shape: (18, 3135, 20, 6)
Timepoints shape: (17, 3135, 21, 6)
Timepoints shape: (16, 3135, 22, 6)
Timepoints shape: (15, 3135, 23, 6)
Timepoints shape: (14, 3135, 24, 6)
Timepoints shape: (13, 3135, 25, 6)
Timepoints shape: (12, 3135, 26, 6)
Timepoints shape: (11, 3135, 27, 6)
Timepoints shape: (10, 3135, 28, 6

In [5]:
# Build index for column names

with shelve.open(paths.PARSED_DATA_COLUMN_INDEX, 'c') as col_index:
    for i, col_name in enumerate(county_data.columns):
        col_index[col_name] = i

In [6]:
# Print out some diagnostic info about the last set of timepoints
print(f'Timepoints shape: {timepoints.shape}') # type: ignore
print()
print('Column types:')

for column in timepoints[0,0,0,0:]: # type: ignore
    print(f'\t{type(column)}')

print()
print(f'Example block:\n{timepoints[0,0,0:,]}') # type: ignore

Timepoints shape: (1, 3135, 37, 6)

Column types:
	<class 'numpy.float64'>
	<class 'numpy.float64'>
	<class 'numpy.float64'>
	<class 'numpy.float64'>
	<class 'numpy.float64'>
	<class 'numpy.float64'>

Example block:
[[ 1.0010000e+03  1.5698880e+18  3.0558431e+00  1.2690000e+03
   1.7097300e-01  2.9378470e-01]
 [ 1.0010000e+03  1.5725664e+18  2.9932332e+00  1.2430000e+03
  -6.2609900e-02 -2.3358290e-01]
 [ 1.0010000e+03  1.5751584e+18  2.9932332e+00  1.2430000e+03
   0.0000000e+00  6.2609900e-02]
 [ 1.0010000e+03  1.5778368e+18  2.9690900e+00  1.2420000e+03
  -2.4143200e-02 -2.4143200e-02]
 [ 1.0010000e+03  1.5805152e+18  2.9093256e+00  1.2170000e+03
  -5.9764400e-02 -3.5621200e-02]
 [ 1.0010000e+03  1.5830208e+18  2.9332314e+00  1.2270000e+03
   2.3905800e-02  8.3670200e-02]
 [ 1.0010000e+03  1.5856992e+18  3.0001674e+00  1.2550000e+03
   6.6936000e-02  4.3030200e-02]
 [ 1.0010000e+03  1.5882912e+18  3.0049484e+00  1.2570000e+03
   4.7810000e-03 -6.2155000e-02]
 [ 1.0010000e+03  1.5909

OK, looks good. We could use int for some of these columns, but we need float for the MBD, so let's leave it alone rather than using mixed types in a NumPy array. So, that's it - easy. Let's round trip it and try recovering some our values back into a nice pandas dataframe as a sanity check.

In [7]:
# Check round-trip
loaded_timepoints = np.load(output_file)

# Inspect
print(f'Timepoints shape: {loaded_timepoints.shape}')
print(f'Example block:\n{loaded_timepoints[0,0,0:,]}')

Timepoints shape: (1, 3135, 37, 6)
Example block:
[[ 1.0010000e+03  1.5698880e+18  3.0558431e+00  1.2690000e+03
   1.7097300e-01  2.9378470e-01]
 [ 1.0010000e+03  1.5725664e+18  2.9932332e+00  1.2430000e+03
  -6.2609900e-02 -2.3358290e-01]
 [ 1.0010000e+03  1.5751584e+18  2.9932332e+00  1.2430000e+03
   0.0000000e+00  6.2609900e-02]
 [ 1.0010000e+03  1.5778368e+18  2.9690900e+00  1.2420000e+03
  -2.4143200e-02 -2.4143200e-02]
 [ 1.0010000e+03  1.5805152e+18  2.9093256e+00  1.2170000e+03
  -5.9764400e-02 -3.5621200e-02]
 [ 1.0010000e+03  1.5830208e+18  2.9332314e+00  1.2270000e+03
   2.3905800e-02  8.3670200e-02]
 [ 1.0010000e+03  1.5856992e+18  3.0001674e+00  1.2550000e+03
   6.6936000e-02  4.3030200e-02]
 [ 1.0010000e+03  1.5882912e+18  3.0049484e+00  1.2570000e+03
   4.7810000e-03 -6.2155000e-02]
 [ 1.0010000e+03  1.5909696e+18  3.0192919e+00  1.2630000e+03
   1.4343500e-02  9.5625000e-03]
 [ 1.0010000e+03  1.5935616e+18  3.0838373e+00  1.2900000e+03
   6.4545400e-02  5.0201900e-02]


OK, not surprisingly - we got the same thing back. Last thing to do before we call this done is to test if we can get our dates, row_ids and cfips back into a format that matches the original data.

In [8]:
# Grab an example date column
test_dates = loaded_timepoints[0,0,0:,1] # type: ignore
print(f'Test dates: {test_dates}\ndtype: {type(test_dates)}\nelement dtype: {type(test_dates[0])}\n')

# Cast float64 to int64
test_dates = test_dates.astype(np.int64)
print(f'Test dates: {test_dates}\ndtype: {type(test_dates)}\nelement dtype: {type(test_dates[0])}\n')

# Convert to pandas dataframe with dtype datetime64[ns] and column name 'first_day_of_month'
test_dates_df = pd.DataFrame(pd.to_datetime(test_dates), columns=['first_day_of_month']).astype('datetime64')
test_dates_df.info()

Test dates: [1.5698880e+18 1.5725664e+18 1.5751584e+18 1.5778368e+18 1.5805152e+18
 1.5830208e+18 1.5856992e+18 1.5882912e+18 1.5909696e+18 1.5935616e+18
 1.5962400e+18 1.5989184e+18 1.6015104e+18 1.6041888e+18 1.6067808e+18
 1.6094592e+18 1.6121376e+18 1.6145568e+18 1.6172352e+18 1.6198272e+18
 1.6225056e+18 1.6250976e+18 1.6277760e+18 1.6304544e+18 1.6330464e+18
 1.6357248e+18 1.6383168e+18 1.6409952e+18 1.6436736e+18 1.6460928e+18
 1.6487712e+18 1.6513632e+18 1.6540416e+18 1.6566336e+18 1.6593120e+18
 1.6619904e+18 1.6645824e+18]
dtype: <class 'numpy.ndarray'>
element dtype: <class 'numpy.float64'>

Test dates: [1569888000000000000 1572566400000000000 1575158400000000000
 1577836800000000000 1580515200000000000 1583020800000000000
 1585699200000000000 1588291200000000000 1590969600000000000
 1593561600000000000 1596240000000000000 1598918400000000000
 1601510400000000000 1604188800000000000 1606780800000000000
 1609459200000000000 1612137600000000000 1614556800000000000
 16172352000

In [9]:
test_dates_df.head()

Unnamed: 0,first_day_of_month
0,2019-10-01
1,2019-11-01
2,2019-12-01
3,2020-01-01
4,2020-02-01


In [10]:
# Grab an example cfips column
test_cfips = loaded_timepoints[0,0,0:,0] # type: ignore
print(f'Test cfips: {test_cfips}\ndtype: {type(test_cfips)}\nelement dtype: {type(test_cfips[0])}\n')

# Cast float64 to int64
test_cfips = test_cfips.astype(np.int64)
print(f'Test cfips: {test_cfips}\ndtype: {type(test_cfips)}\nelement dtype: {type(test_cfips[0])}\n')

# Convert to pandas dataframe with dtype int64 and column name 'cfips'
test_cfips_df = pd.DataFrame(test_cfips, columns=['cfips']).astype('int64')
test_cfips_df.info()

Test cfips: [1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001.
 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001.
 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001. 1001.
 1001.]
dtype: <class 'numpy.ndarray'>
element dtype: <class 'numpy.float64'>

Test cfips: [1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001
 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001
 1001 1001 1001 1001 1001 1001 1001 1001 1001]
dtype: <class 'numpy.ndarray'>
element dtype: <class 'numpy.int64'>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   cfips   37 non-null     int64
dtypes: int64(1)
memory usage: 424.0 bytes


In [11]:
test_cfips_df.head()

Unnamed: 0,cfips
0,1001
1,1001
2,1001
3,1001
4,1001


Ok, happy - making the rwo ID is just a string join from here, so no problem. An if for some reason we want the string county or state back, we can use a CFIPS lookup table. Time to move on.