Faster DataLoader class #57

nbren12 · 2019-01-14T22:56:00Z

I think the main bottleneck in our training speed is the training data loader. This is probably because we are getting cache-misses when collecting the data at random indices. To speed this up, we will probably need to store the pre-shuffled data on disk.

I tried writing some code to do this, but I was running out of memory on the last step:

from itertools import product
from random import shuffle

import numpy as np
import xarray as xr

ds = xr.open_dataset("./data/processed/training.nc")

# construct the indices
x = range(len(ds.x))
y = range(len(ds.y))
z = range(len(ds.z))
time = range(len(ds.time))

indices = list(product(x, y, time))
shuffle(indices)
transposed = list(zip(*indices))

# construct xarray indexers following 
# http://xarray.pydata.org/en/stable/indexing.html#more-advanced-indexing
dims = ['x', 'y', 'time']
indexers = {
    dim: xr.DataArray(
        np.array(index),
        dims="sample",
        coords={'sample': np.arange(len(indices))})
    for dim, index in zip(dims, transposed)
}

# This step runs out of memory
shuffled_ds = ds.isel(**indexers)

To speed this up, we will probably need to do a couple of steps, with on-disk caching for each step:

Transpose the data (time, z, y, x) --> (time, y, x, z)
Reshape (time, y, x, z) --> (batch, time_and_next, z)
Shuffle along batch dimension.

The text was updated successfully, but these errors were encountered:

This commits starts my work on a faster data loader. Related to #57.

nbren12 added a commit that referenced this issue Jan 22, 2019

WIP save/load batches to pytorch pickle files

0eed9c3

This commits starts my work on a faster data loader. Related to #57.

nbren12 added a commit that referenced this issue Jan 25, 2019

WIP save/load batches to pytorch pickle files

bb18bbb

This commits starts my work on a faster data loader. Related to #57.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster DataLoader class #57

Faster DataLoader class #57

nbren12 commented Jan 14, 2019

Faster DataLoader class #57

Faster DataLoader class #57

Comments

nbren12 commented Jan 14, 2019