## Revealed test data exploration
Some time in late February the leaderboard was updated and the original test data revealed. Let's take a look at it and see what they gave ud. We also are going to need to concatenate the new data to the old and regenerate our bootstrapping blocks.

1.  [Abbreviations & definitions](#abreviations_definitions)
2. [Load and inspect](#load_and_inspect)

<a name="abbreviations_definitions"></a>
### 1. Abbreviations & definitions
+ MBD: microbusiness density
+ MBC: microbusiness count

<a name="load_and_inspect"></a>
### 2. Load and inspect
Let's take a look at the new data and see what we got!

In [1]:
# Add parent directory to path to allow import of config.py
import sys
sys.path.append('..')
import config as conf
#import functions.plotting_functions as plot_funcs

# instantiate paths
paths = conf.DataFilePaths()

import numpy as np
import pandas as pd
# import matplotlib
# import statsmodels.api as sm

print(f'Python: {sys.version}')
print()
print(f'Numpy {np.__version__}')
print(f'Pandas {pd.__version__}')
# print(f'Matplotlib: {matplotlib.__version__}') # type: ignore
# print(f'Statsmodels: {sm.__version__}')

# # Replace matplotlib with pyplot interface
# del matplotlib
# import matplotlib.pyplot as plt

Python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0]

Numpy 1.23.5
Pandas 1.4.3


In [6]:
# Load up new data
revealed_test_df = pd.read_csv(f'{paths.KAGGLE_DATA_PATH}/revealed_test.csv')

# Load up old data
training_df = pd.read_csv(f'{paths.KAGGLE_DATA_PATH}/train.csv.zip', compression='zip')

# Print out some metadata and sample rows
revealed_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6270 entries, 0 to 6269
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   row_id                 6270 non-null   object 
 1   cfips                  6270 non-null   int64  
 2   county                 6270 non-null   object 
 3   state                  6270 non-null   object 
 4   first_day_of_month     6270 non-null   object 
 5   microbusiness_density  6270 non-null   float64
 6   active                 6270 non-null   int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 343.0+ KB


Looks like this file follows the same format as the original training data - as advertised.

In [7]:
# Print out some metadata and sample rows
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122265 entries, 0 to 122264
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   row_id                 122265 non-null  object 
 1   cfips                  122265 non-null  int64  
 2   county                 122265 non-null  object 
 3   state                  122265 non-null  object 
 4   first_day_of_month     122265 non-null  object 
 5   microbusiness_density  122265 non-null  float64
 6   active                 122265 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 6.5+ MB


In [8]:
training_df['microbusiness_density'].describe()

count    122265.000000
mean          3.817671
std           4.991087
min           0.000000
25%           1.639344
50%           2.586543
75%           4.519231
max         284.340030
Name: microbusiness_density, dtype: float64

In [9]:
revealed_test_df['microbusiness_density'].describe()

count    6270.000000
mean        4.025670
std         6.239498
min         0.000000
25%         1.709230
50%         2.695358
75%         4.685238
max       238.060420
Name: microbusiness_density, dtype: float64

Ok, so mean a bit higher and standard deviation larger. I wonder if this is why all of the public leaderboard scores seemed to drop after the switch over? Let's see what months they gave us.

In [10]:
new_timepoints = revealed_test_df['first_day_of_month']
new_timepoints.drop_duplicates(keep='first', inplace=True)
print(f'Num new timepoints: {len(new_timepoints)}')
new_timepoints.head()

Num new timepoints: 2


0    2022-11-01
1    2022-12-01
Name: first_day_of_month, dtype: object

OK, so looks like just november and december. Interesting - i guess this means that for now, we are predicting January? Also, is this all of the data that we are going to see? I.e. once submission closes, are we predicting March, April and May 2023 from December 2022 as the last known datapoint? I think so. Sounds like we need to just fill the submission template file and hope for the best. I am starting to think we should begin doing our own internal experiments this way so that we don't have to do a major pivot and re-write just before the submission deadline. So for a given model order, we would predict a forecast horizon of 5. The public leaderboard score will come from the first timepoint in the forecast window, while the private leaderboard score will come from the last three months.

Ok, so what now? First thing to do is let's combine these dataset and then regenerate our bootstrapping blocks. Then I think we should generate a new naive control submission with all 5 forecast months populated. Then we can do the same thing with our best ARIMA model and then finally move on back to the GRU.

In [17]:
updated_training_df = pd.concat([training_df, revealed_test_df], ignore_index=True)
updated_training_df.sort_values(['cfips', 'first_day_of_month'], inplace=True)
updated_training_df.reset_index(inplace=True, drop=True)
updated_training_df.head(42)


Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243
5,1001_2020-01-01,1001,Autauga County,Alabama,2020-01-01,2.96909,1242
6,1001_2020-02-01,1001,Autauga County,Alabama,2020-02-01,2.909326,1217
7,1001_2020-03-01,1001,Autauga County,Alabama,2020-03-01,2.933231,1227
8,1001_2020-04-01,1001,Autauga County,Alabama,2020-04-01,3.000167,1255
9,1001_2020-05-01,1001,Autauga County,Alabama,2020-05-01,3.004948,1257


In [21]:
# Write the new training data to disk
output_file = f'{paths.KAGGLE_DATA_PATH}/updated_train.csv.zip'
updated_training_df.to_csv(output_file, index=False, compression='zip')

In [22]:
# Check the round trip
new_training_df = pd.read_csv(output_file, compression='zip')
new_training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128535 entries, 0 to 128534
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   row_id                 128535 non-null  object 
 1   cfips                  128535 non-null  int64  
 2   county                 128535 non-null  object 
 3   state                  128535 non-null  object 
 4   first_day_of_month     128535 non-null  object 
 5   microbusiness_density  128535 non-null  float64
 6   active                 128535 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 6.9+ MB


In [23]:
new_training_df.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


Ok, good enough for now.