# Saving/loading numpy arrays efficiently.

How should we save/load relatively large datasets efficiently to disk?
Here we explore a bunch of options, and show some tradeoffs.
Here we will be using the built-in MNIST dataset from Tensorflow, to realistically test compression.
All tests here were run on an i3-6100 with a modern SSD.

_Warning: Running this notebook will write a few hundred megabytes to disk._

In [1]:
import os, numpy
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data")

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


### Option 1: Use `numpy.save`

Uses the numpy "`.npy`" format, which is very simple: basically a trivial ~80 byte header encoding the array data type and shape followed by the raw numpy array contents.
[Documented here.](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.save.html)

In [2]:
%%time
numpy.save("/tmp/dump1.npy", mnist.train.images)
print "Megabytes on disk:", os.path.getsize("/tmp/dump1.npy") * 2**-20.0

Megabytes on disk: 164.489822388
CPU times: user 0 ns, sys: 112 ms, total: 112 ms
Wall time: 114 ms


### Option 2: Use `numpy.savez`

Creates a _non_-compressed zip archive that contains the array as a `.npy`. Lets you easily save multiple arrays into one archive, and give them names, which is pretty nifty. [Documented here.](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.savez.html)

In [3]:
%%time
numpy.savez("/tmp/dump2.npz", dataset=mnist.train.images)
print "Megabytes on disk:", os.path.getsize("/tmp/dump2.npz") * 2**-20.0

Megabytes on disk: 164.489936829
CPU times: user 156 ms, sys: 244 ms, total: 400 ms
Wall time: 447 ms


### Option 3: Use `numpy.savez_compressed`

Basically the same as `numpy.savez`, except it enables zip DEFLATE compression.

In [4]:
%%time
numpy.savez_compressed("/tmp/dump3.npz", dataset=mnist.train.images)
print "Megabytes on disk:", os.path.getsize("/tmp/dump3.npz") * 2**-20.0

Megabytes on disk: 13.8666152954
CPU times: user 2.37 s, sys: 180 ms, total: 2.55 s
Wall time: 2.55 s


### Option 4: Take things into our own hands and use zlib compression.

Another option is to take things into our own hands and compress the array ourself using zlib.
This doesn't store any information about the datatype or array shape, so we have to remember those separately ourselves!

In [5]:
%%time
import zlib
compressor = zlib.compressobj(9)
with open("/tmp/dump4.z", "wb") as f:
    f.write(compressor.compress(buffer(mnist.train.images)))
    f.write(compressor.flush())
print "Megabytes on disk:", os.path.getsize("/tmp/dump4.z") * 2**-20.0

Megabytes on disk: 13.7638893127
CPU times: user 11.2 s, sys: 32 ms, total: 11.2 s
Wall time: 13.3 s


### Option 5: Take things into our own hands and use bz2 compression.

Same as above, but with another compression library that is also built into Python (and has a nicer interface).

In [6]:
%%time
import bz2
with bz2.BZ2File("/tmp/dump5.bz2", "wb") as f:
    f.write(buffer(mnist.train.images))
print "Megabytes on disk:", os.path.getsize("/tmp/dump5.bz2") * 2**-20.0

Megabytes on disk: 7.82636451721
CPU times: user 3.62 s, sys: 16 ms, total: 3.64 s
Wall time: 3.64 s


## Loading data.

Here are corresponding methods for loading our arrays back again.

In [7]:
%%time
load1 = numpy.load("/tmp/dump1.npy")

CPU times: user 4 ms, sys: 60 ms, total: 64 ms
Wall time: 61.9 ms


In [8]:
%%time
archive2 = numpy.load("/tmp/dump2.npz")
load2 = archive2["dataset"]

CPU times: user 140 ms, sys: 52 ms, total: 192 ms
Wall time: 193 ms


In [9]:
%%time
archive3 = numpy.load("/tmp/dump3.npz")
load3 = archive3["dataset"]

CPU times: user 392 ms, sys: 56 ms, total: 448 ms
Wall time: 454 ms


In [10]:
%%time
with open("/tmp/dump4.z", "rb") as f:
    data = f.read().decode("zlib")
# Note how we have to separately remember the array data type and shape. :(
load4 = numpy.fromstring(data, dtype=numpy.float32).reshape((55000, 784))

CPU times: user 304 ms, sys: 64 ms, total: 368 ms
Wall time: 375 ms


In [11]:
%%time
with open("/tmp/dump5.bz2", "rb") as f:
    data = f.read().decode("bz2")
load5 = numpy.fromstring(data, dtype=numpy.float32).reshape((55000, 784))

CPU times: user 2.93 s, sys: 96 ms, total: 3.03 s
Wall time: 3.82 s


In [12]:
# Final sanity check that all loads were correct.
for test_array in (load1, load2, load3, load4, load5):
    assert numpy.array_equal(mnist.train.images, test_array)
print "All loaded arrays loaded correctly."

All loaded arrays loaded correctly.


# Conclusion.

We saw that `numpy.load` was the fastest option for both saving and loading, but offers no compression and doesn't let us save multiple arrays with distinct names.

The compression offered by `numpy.savez_compressed` reduced the file to about 8.4% of the size at the cost of taking about a dozen times as long to save and a few times longer to load.

Finally, by taking things into our own hands and using bz2 to compress we managed to get the file compressed down to about 4.8% of its original size, with a save time only a little longer than `numpy.savez_compressed`, but a load time that was over 8 times longer, and over 60 times longer than loading the uncompressed result of `numpy.save`.
However, we unfortunately have to keep track of the array shape and datatype ourself, which sucks a lot.

Option 4 (using zlib manually ourselves) was lackluster.