## Test Data Exploration
Now let's look at the test datafile and make some decisions about how we will structure our predictions.

1. [Load and inspect](#load_and_inspect)
2. [Dataset structure](#dataset_structure)
3. [The Plan](#the_plan)
4. [TODO](#TODO)

In [1]:
# Add parent directory to path to allow import of config.py
import sys
sys.path.append('..')
import config as conf

import pandas as pd

print(f'Python: {sys.version}')
print()
print(f'Pandas: {pd.__version__}')

Python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0]

Pandas: 1.4.3


<a name="load_and_inspect"></a>
### 1. Load and inspect

In [2]:
# Read csv into pandas dataframe
test_df = pd.read_csv(f'{conf.KAGGLE_DATA_PATH}/test.csv')

# Print out some metadata and sample rows
print(test_df.info())
print()
print(test_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25080 entries, 0 to 25079
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   row_id              25080 non-null  object
 1   cfips               25080 non-null  int64 
 2   first_day_of_month  25080 non-null  object
dtypes: int64(1), object(2)
memory usage: 587.9+ KB
None

            row_id  cfips first_day_of_month
0  1001_2022-11-01   1001         2022-11-01
1  1003_2022-11-01   1003         2022-11-01
2  1005_2022-11-01   1005         2022-11-01
3  1007_2022-11-01   1007         2022-11-01
4  1009_2022-11-01   1009         2022-11-01


In [3]:
# Set dtype on first day of month column
test_df['first_day_of_month'] =  pd.to_datetime(test_df['first_day_of_month'])

Here are the column descriptions from the Kaggle competition site:
+ **row_id** - An ID code for the row.
+ **cfips** - A unique identifier for each county using the Federal Information Processing System. The first two digits correspond to the state FIPS code, while the following 3 represent the county.
+ **first_day_of_month** - The date of the first day of the month.

<a name="dataset_structure"></a>
### 2. Dataset structure

In [10]:
timepoints = test_df['first_day_of_month']
timepoints.drop_duplicates(keep='first', inplace=True)
timepoints.reset_index(inplace=True, drop=True)
print(f'Num timepoints: {len(timepoints)}')
timepoints.head(8)

Num timepoints: 8


0   2022-11-01
1   2022-12-01
2   2023-01-01
3   2023-02-01
4   2023-03-01
5   2023-04-01
6   2023-05-01
7   2023-06-01
Name: first_day_of_month, dtype: datetime64[ns]

In [8]:
county_counts = test_df.groupby(['cfips']).size()
print(f'Number of counties: {len(county_counts)}')
county_counts.head()

Number of counties: 3135


cfips
1001    8
1003    8
1005    8
1007    8
1009    8
dtype: int64

OK - so 8 timepoints for prediction: November 2022 to June 2023. County count matches the training set.

<a name="the_plan"></a>
### 3. The plan

The competition description says that during the active phase (before March 14th?) only the most recent month of data will be used for the public leaderboard. I take that to mean if submitting in February, January will be scored? Not completely sure. The contest description notes that old test data will be published in mid February - presumably through January? But then that means we need to predict February in February. I guess that's not crazy since the actual timepoint is the first of the month. Here an excerpt from a comment by a Kaggle staff member on the discussion board:

+ The private leaderboard will include March, April, and May. June was included in the submission file due to an error on my part. I'm inclined to leave it in since the extra submissions don't technically hurt anything and removing it would invalidate older submissions. That is confusing though so I would be open to biting the bullet and making the change.
+ Yes, on March 13th the public LB will be the month of February.

Right, seems like everyone is confused by this. Here is my understanding - the test/submission file doesn't change, but the date range being scored does. Up to mid February only November 2022 will be scored. At that point, new data up to and including January 2023 will be released and February 2023 will be the month scored up to the close of the contest. The final private leaderboard score will then be derived from the March, April and May data as it becomes available.

It's not clear to me wether or not the February data will be released in March - it kind of sounds like maybe not. This means we really need to predict 4 months into the future, with only the last three being scored.

This contest seems like a bit of a mess - lots of people commenting like they know what is going on but I'm not sure anyone really does. I think the safest thing to do here is work with predicting 4 months. This way, if the February data is released in March and we only have to predict 3 months, we suddenly have more training data and can easily adapt to predict a smaller timespan. This change would be much better than the other way around - i.e. working to predict 3 months for weeks and then having to switch to 4 at the last minute.

I'd also love to know for sure what month(s) is/are being actively scored right now. Seems like it should be a simple bit of info to post! We could probably figure it out with a few test submissions containing zeros or NANs for all but one month... maybe I will have to look into it.

<a name="TODO"></a>
### 4. TODO
1. Attribute quote
2. Proof/edit summary