## Cross validation data preparation
Lots of choices to make here. One thing which stands out is the small number of timepoints. I think this is a major issue. The other way to think about the problem is to consider each county as an individual data point with 39 measurements. Then break the time series for each county into blocks and consider each on a unique data instance. The final model will be trained to make predictions for any of the counties, at any point in time, one at a time. This might be a tall order given the variation between counties, but it gives us a lot more datapoints to work with: 39 timepoints x 3,135 counties = 122,265 datapoints. Hopefully, exogenous data sources will help with the between county variability somewhat - e.g. demographic, economic or geographic data.

The plan is to use a rolling origin strategy with a forecast horizon of four months and a fixed model order. We should set things up so that the model order is a optimizable parameter, but let's pick four to start. So, we will predict four future values based on four past values across the time series for each county. This does raise a concern about data leakage, as data from the first forecast will be used as input to the next block, but I think it's necessary to maximize the use of the small dataset. A few things we can do to mitigate: 
1. Keep a real out-of sample test set.
2. Do not perform any feature engineering, smoothing or normalization on the whole dataset before the split.
3. Randomize, randomize, randomize.

Ok, enough ramble - let's get started.

In [10]:
# Add parent directory to path to allow import of config.py
import sys
sys.path.append('..')
import config as conf

import numpy as np
import pandas as pd

print(f'Python: {sys.version}')
print()
print(f'NumPy: {np.__version__}')
print(f'Pandas: {pd.__version__}')

Python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0]

NumPy: 1.23.5
Pandas: 1.4.3


In [11]:
# Load up training file, set dtype for first day of month

training_df = pd.read_csv(f'{conf.DATA_PATH}/train.csv.zip', compression='zip')
training_df['first_day_of_month'] =  pd.to_datetime(training_df['first_day_of_month'])

training_df.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


In [12]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122265 entries, 0 to 122264
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   row_id                 122265 non-null  object        
 1   cfips                  122265 non-null  int64         
 2   county                 122265 non-null  object        
 3   state                  122265 non-null  object        
 4   first_day_of_month     122265 non-null  datetime64[ns]
 5   microbusiness_density  122265 non-null  float64       
 6   active                 122265 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 6.5+ MB


Luckily, this code won't have to run many times - basically once for each model order we want to test, so it's ok if it's a little inefficient. It's more important that we nail this down in a way that is flexible and transparent. First step will be to scan across each counties timecourse and generate input and forecast block pairs.

Let's start by getting a list of unique county cfips id numbers to loop on.

In [13]:
# Get list of unique cfips
cfips_list = training_df['cfips'].drop_duplicates(keep='first').to_list()
print(f'Num counties: {len(cfips_list)}')
print(cfips_list[:10])

Num counties: 3135
[1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 1019]


Can't believe I'm only noticing this now - but I wonder if there is a reason why it seems like only odd cfips are included in the dataset? Let's ignore it for now, but definitely mental note material.

In [14]:
# Get rid of unnecessary string columns: row id is redundant to cfips and 
# first_day_of_month, county and state are both redundant to cfips
trimmed_training_df = training_df.drop(['row_id', 'county', 'state'], axis=1)

# Convert first day of month to int so we don't have pandas data types
# in our result - want to use numpy
trimmed_training_df['first_day_of_month'] =  trimmed_training_df['first_day_of_month'].astype(int)

# Data metaparameters
model_order = 4
forecast_horizon = 4
block_width = model_order + forecast_horizon

# Temporary holders for blocks
blocks = []

# Loop on county
for cfips in cfips_list:

    # Get data for current county
    timeseries = trimmed_training_df[trimmed_training_df['cfips'] == cfips]

    # Convert to numpy array
    timeseries = timeseries.to_numpy()

    # Loop on the timeseries index, up to the desired block with
    # before the end
    for i in range(len(timeseries) - (block_width - 1)):

        # Get block, starting from current index
        block = timeseries[i:(i + block_width)]

        # Split block into input and forecast
        block = np.array_split(block, 2)

        # Add to list
        blocks.append(block)

# Convert the whole thing to numpy
blocks = np.array(blocks)

# Inspect
print(f'Shape: {blocks.shape}')
print()
print(blocks[0])

Shape: (100320, 2, 4, 4)

[[[1.0010000e+03 1.5646176e+18 3.0076818e+00 1.2490000e+03]
  [1.0010000e+03 1.5672960e+18 2.8848701e+00 1.1980000e+03]
  [1.0010000e+03 1.5698880e+18 3.0558431e+00 1.2690000e+03]
  [1.0010000e+03 1.5725664e+18 2.9932332e+00 1.2430000e+03]]

 [[1.0010000e+03 1.5751584e+18 2.9932332e+00 1.2430000e+03]
  [1.0010000e+03 1.5778368e+18 2.9690900e+00 1.2420000e+03]
  [1.0010000e+03 1.5805152e+18 2.9093256e+00 1.2170000e+03]
  [1.0010000e+03 1.5830208e+18 2.9332314e+00 1.2270000e+03]]]


Next step would be to partition the result into working and test sets, and then within the working set partition into training and validation sets. Then permute the training and validation sets during model training. Let's leave that for the training run.

Last thing to do before we write this to disk is a sanity test. Let's see if we can get our dates, row_ids and cfips back into a format that matches the original data.

In [15]:
# Grab an example date column
test_dates = blocks[0,0,0:,1] # type: ignore
print(f'Test dates: {test_dates}\ndtype: {type(test_dates)}\nelement dtype: {type(test_dates[0])}\n')

# Cast float64 to int64
test_dates = test_dates.astype(np.int64)
print(f'Test dates: {test_dates}\ndtype: {type(test_dates)}\nelement dtype: {type(test_dates[0])}\n')

# Convert to pandas dataframe with dtype datetime64[ns] and column name 'first_day_of_month'
test_dates_df = pd.DataFrame(pd.to_datetime(test_dates), columns=['first_day_of_month']).astype('datetime64')
test_dates_df.info()

Test dates: [1.5646176e+18 1.5672960e+18 1.5698880e+18 1.5725664e+18]
dtype: <class 'numpy.ndarray'>
element dtype: <class 'numpy.float64'>

Test dates: [1564617600000000000 1567296000000000000 1569888000000000000
 1572566400000000000]
dtype: <class 'numpy.ndarray'>
element dtype: <class 'numpy.int64'>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   first_day_of_month  4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 160.0 bytes


In [16]:
test_dates_df.head()

Unnamed: 0,first_day_of_month
0,2019-08-01
1,2019-09-01
2,2019-10-01
3,2019-11-01


Ok, looks good. A little bit convoluted to get it back but at least we know we can do it if we need to.

In [17]:
# Grab an example cfips column
test_cfips = blocks[0,0,0:,0] # type: ignore
print(f'Test cfips: {test_cfips}\ndtype: {type(test_cfips)}\nelement dtype: {type(test_cfips[0])}\n')

# Cast float64 to int64
test_cfips = test_cfips.astype(np.int64)
print(f'Test cfips: {test_cfips}\ndtype: {type(test_cfips)}\nelement dtype: {type(test_cfips[0])}\n')

# Convert to pandas dataframe with dtype int64 and column name 'cfips'
test_cfips_df = pd.DataFrame(test_cfips, columns=['cfips']).astype('int64')
test_cfips_df.info()

Test cfips: [1001. 1001. 1001. 1001.]
dtype: <class 'numpy.ndarray'>
element dtype: <class 'numpy.float64'>

Test cfips: [1001 1001 1001 1001]
dtype: <class 'numpy.ndarray'>
element dtype: <class 'numpy.int64'>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   cfips   4 non-null      int64
dtypes: int64(1)
memory usage: 160.0 bytes


In [18]:
test_cfips_df.head()

Unnamed: 0,cfips
0,1001
1,1001
2,1001
3,1001


Also, not too bad. From here I am confident about 'row_id' and the county and state if we need them. Row id is just a string concatenation of the cfips and the first day of the month. With the cfips, we can easily look up the county and state if we need it.

Calling this good. Let's write to disk and move on.

In [20]:
# Write to disk
output_file = f'{conf.DATA_PATH}/parsed_data/order{model_order}_horizon{forecast_horizon}.npy'
np.save(output_file, blocks)

# Check round-trip
loaded_blocks = np.load(output_file)

# Inspect
print(f'Shape: {loaded_blocks.shape}')
print()
print(loaded_blocks[0])

Shape: (100320, 2, 4, 4)

[[[1.0010000e+03 1.5646176e+18 3.0076818e+00 1.2490000e+03]
  [1.0010000e+03 1.5672960e+18 2.8848701e+00 1.1980000e+03]
  [1.0010000e+03 1.5698880e+18 3.0558431e+00 1.2690000e+03]
  [1.0010000e+03 1.5725664e+18 2.9932332e+00 1.2430000e+03]]

 [[1.0010000e+03 1.5751584e+18 2.9932332e+00 1.2430000e+03]
  [1.0010000e+03 1.5778368e+18 2.9690900e+00 1.2420000e+03]
  [1.0010000e+03 1.5805152e+18 2.9093256e+00 1.2170000e+03]
  [1.0010000e+03 1.5830208e+18 2.9332314e+00 1.2270000e+03]]]


OK, looks good. Time to start making some models.