## Cross validation data preparation
Lots of choices to make here. One thing which stands out is the small number of timepoints. I think this is a major issue. The other way to think about the problem is to consider each county as an individual data point with 39 measurements. Then break the time series for each county into blocks and consider each on a unique data instance. The final model will be trained to make predictions for any of the counties, at any point in time, one at a time. This might be a tall order given the variation between counties, but it gives us a lot more datapoints to work with: 39 timepoints x 3,135 counties = 122,265 datapoints. Hopefully, exogenous data sources will help with the between county variability somewhat - e.g. demographic, economic or geographic data.

The plan is to use a rolling origin strategy with a forecast horizon of four months and a fixed model order. We should set things up so that the model order is a optimizable parameter, but let's pick four to start. So, we will predict four future values based on four past values across the time series for each county. This does raise a concern about data leakage, as data from the first forecast will be used as input to the next block, but I think it's necessary to maximize the use of the small dataset. A few things we can do to mitigate: 
1. Keep a real out-of sample test set.
2. Do not perform any feature engineering, smoothing or normalization on the whole dataset before the split.
3. Randomize, randomize, randomize.

Ok, enough ramble - let's get started.

In [8]:
# Add parent directory to path to allow import of config.py
import sys
sys.path.append('..')
import config as conf

import numpy as np
import pandas as pd

print(f'Python: {sys.version}')
print()
print(f'NumPy: {np.__version__}')
print(f'Pandas: {pd.__version__}')

Python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0]

NumPy: 1.23.5
Pandas: 1.4.3


In [9]:
# Load up training file, set dtype for first day of month

training_df = pd.read_csv(f'{conf.DATA_PATH}/train.csv.zip', compression='zip')
training_df['first_day_of_month'] =  pd.to_datetime(training_df['first_day_of_month'])

training_df.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


In [10]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122265 entries, 0 to 122264
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   row_id                 122265 non-null  object        
 1   cfips                  122265 non-null  int64         
 2   county                 122265 non-null  object        
 3   state                  122265 non-null  object        
 4   first_day_of_month     122265 non-null  datetime64[ns]
 5   microbusiness_density  122265 non-null  float64       
 6   active                 122265 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 6.5+ MB


Luckily, this code won't have to run many times - basically once for each model order we want to test, so it's ok if it's a little inefficient. first step will be to scan across each counties timecourse and generate input and prediction block pairs with a static size for both to start.

In [11]:
# Get list of unique cfips
cfips_list = training_df['cfips'].drop_duplicates(keep='first').to_list()
print(f'Num counties: {len(cfips_list)}')
print(cfips_list[:10])

Num counties: 3135
[1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 1019]


Can't believe I'm only noticing this now - but I wonder if there is a reason why it seems like only odd cfips are included in the dataset? Let's ignore it for now, but definitely mental note material.

In [12]:
# Get rid of unnecessary string columns
trimmed_training_df = training_df.drop(['row_id', 'county', 'state'], axis=1)

# Convert first day of month to int
trimmed_training_df['first_day_of_month'] =  trimmed_training_df['first_day_of_month'].astype(int)

model_order = 4
forecast_horizon = 4
input_blocks = []
forecast_blocks = []

for cfips in cfips_list:
    timeseries = trimmed_training_df[trimmed_training_df['cfips'] == cfips]
    timeseries = timeseries.to_numpy()
    # print(timeseries)
    
    block_count = 0

    for origin in range(model_order - 1, len(timeseries) - forecast_horizon):
        
        block_count += 1

        # print(f'Input-forecast block {block_count}')
        # print(f'\tInput block: {origin - model_order + 1} - {origin + 1}')
        input_block = timeseries[(origin - model_order + 1):(origin + 1)]
        # print(f'\t{input_block}')
        # print()
        # print(f'\tForecast block: {origin + 1} - {origin + forecast_horizon + 1}')
        forecast_block = timeseries[(origin + 1):(origin + forecast_horizon + 1)]
        # print(f'\t{forecast_block}')
        # print()

        input_blocks.append(input_block)
        forecast_blocks.append(forecast_block)

In [13]:
input_blocks = np.array(input_blocks)
forecast_blocks = np.array(forecast_blocks)
print(input_blocks.shape)
print(forecast_blocks.shape)

(100320, 4, 4)
(100320, 4, 4)


In [14]:
training_data = np.array([input_blocks, forecast_blocks])
training_data.shape

(2, 100320, 4, 4)