In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt

In [3]:
data = pd.read_csv('data/aqi_daily_1980_to_2021.csv')
data['Date'] = data['Date'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))
data.loc[data['AQI']>500, 'AQI'] = 500  # truncate AQI values at 500

In [4]:
location_groups = data.groupby(['Latitude', 'Longitude', 'State Name', 'County Name']).groups
locations = list(location_groups)

In [5]:
len(locations)

1048

### Test if there is missing data

In [6]:
subset = data.loc[location_groups[locations[0]], 'Date']
np.setdiff1d(
    pd.date_range(subset.min(), subset.max()).unique(),
    subset.unique()
).shape[0] > 0

True

## Reasoning about the distributions
Here we have daily AQI readings (with some days missing) from 1048 locations across the US. That is, we have 1048 AQI time series with missing data conditioned on location:

$$
A_L=\left(A_L^{(t)} : t \in T_L\right) \\
T_L=\{t_1, ..., t_{n_L}\}
$$

Really the goal here is to predict future AQI agnostic of location. So the task is to approximate a function $f$ defined as:

$$
f\left(\left(A_L^{(t)} : t \in \{t_i\}_{i=k}^{k+s_{in}-1}\}\right)\right)
    =\left(A_L^{(t)} : t \in \{t_{i}\}_{i=k+s_{in}}^{k+s_i+s_{out}-1}\right)
$$

Where $s_i$ is the number of days of data we desire to provide as input to the function, $s_o$ is the number of days of prediction we desire to receive as output from the function, and $k$ is an arbitrary starting index.

Note that $f$ is agnostic of location and our goal is to predict *general* AQI trends. So our model (our learned function $\tilde{f}\approx f$) need not take location in to account, and should approximate $f$ equally well regardless of location. As such we can orchestrate our train-test splitting such that we have train and test sets comprised of time series from distinct locations. That is, our train and test sets $D_{train}, D_{test}$ are constructed as follows:

$$
D_{train}=\{A_L : L \in \mathcal{L}_{train}\} \\
D_{test}=\{A_L : L \in \mathcal{L}_{test}\}
$$

Where $\mathcal{L}_{train}, \mathcal{L}_{test}$ are comprised of distinct locations.