# Dataset 
---------

In [1]:
import numpy as np 

Here are the three splits of the dataset; namely, training data, validation data, and test data.

In [2]:
train = np.load('./data/train.npz', allow_pickle=True) # train dataset
val = np.load('./data/val.npz', allow_pickle=True) # validation dataset 
test = np.load('./data/test.npz', allow_pickle=True) # test dataset 

We are using data based on rainfall during Hurricane Harvey

In [3]:
train = train['harvey']
val = val['harvey']
test = test['harvey']

Number of samples in each split.

In [4]:
len(train), len(val), len(test)

(1063, 228, 228)

Let's look at one sample

In [5]:
sample = train[0]

Each sample is a dictionary with the following keys

In [6]:
# static 
sample['static'].shape

(1519, 3)

`static` has two dimensions/axes: number of nodes/grid cells, and three features.  

* feature 1: DEM of cell  
* feature 2: Distance of cell to closest stream  
* feature 3: Manning's coefficient of friction of cell 

In [7]:
# seq 
sample['seq'].shape

(132,)

`seq` has one dimension/axis: the time axis. This is boolean (True/False data), which becomes more relevant if many time-series are used in a batch for training. It is optional here because we are using one rainfall event (i.e., Hurricane Harvey)

In [8]:
# s_edges 
sample['s_edges'].shape

(5916, 2)

`s_edges` is the edge list of the sample.

In [9]:
# bin 
sample['bin'].shape

(1519, 132, 1)

`bin` (optional). Binary data for wet versus dry cells. It has three dimensions: number of nodes, time axis, and binary value (0 or 1).

In [10]:
# data 
sample['data'].shape

(1519, 132, 8)

`data` has three dimensions: number of nodes, time axis, and 8 features. 

* feature 1: water depth  
* feature 2 & 3: in velocity vector 
* feature 4 & 5: out velocity vector 
* feature 6: in velocity norm 
* feature 7: out velocity norm 
* feature 8: rainfall