# Saving and Loading NumPy Arrays

While not part of Python itself, the NumPy array library forms the basis for nearly all numeric computation within Python.  A few core features of the Python language have been specialized to accomodate the NumPy community and library.  The most notable examples of the language definition being modified for the sake of NumPy are the extended slice notation and the matrix multiply operator.

The *stride* argument to slices was added in Python 1.4 (long ago), but was not used by Python lists, tuples, or strings until 2.3 (still a long time).  The use of commas within compound slice descriptions is not used anywhere in Python's standard library but exists so that NumPy (and later other libraries) can utilize it.  Similarly, the operator `@` is not used anywhere in Python itself or its standard library, but was added so that NumPy can use it to denote matrix multiplication (some other libraries have utilized it for other purposes since then).

In [1]:
import pickle
import numpy as np
!rm tmp/*

# Serializing with Pickle

Most Python objects, even those in extension libraries, can be serialized and deserialized with `pickle` module.  Classes are able to define a few protocol methods that allow them to interoperate with pickling.  For most purposes, pickle is fine for representing NumPy arrays.  Let us create one and roundtrip it.

In [2]:
arr = np.random.randint(1, 100, 1_000_000).reshape(100, 100, 100)
arr.shape

(100, 100, 100)

In [3]:
# Syntax extras for NumPy (matrix multiply dimensional slices)
arr[2:6, 4, 3:4] @ arr[8:9, 3, 4:8]

array([[1456,  520, 3900, 4992],
       [ 420,  150, 1125, 1440],
       [1736,  620, 4650, 5952],
       [1204,  430, 3225, 4128]])

Dumping and loading a pickle of a NumPy array is the same as for any Python object.

In [4]:
pickle.dump(arr, open('tmp/arr.pkl', 'wb'))
arr2 = pickle.load(open('tmp/arr.pkl', 'rb'))
arr2[:2]

array([[[87, 32,  7, ..., 25, 94, 44],
        [12, 13, 87, ..., 56, 97, 61],
        [36, 19, 75, ..., 41, 69, 95],
        ...,
        [61,  7, 25, ...,  7, 63, 29],
        [ 7, 24, 75, ..., 86, 81, 13],
        [75, 55, 87, ..., 62, 50, 54]],

       [[40, 15, 79, ..., 39, 74, 18],
        [64,  5, 18, ..., 20, 63, 34],
        [43, 91, 70, ..., 64, 48, 95],
        ...,
        [60,  8, 21, ...,  3, 34, 20],
        [ 9,  7, 32, ..., 59, 75,  8],
        [60, 12, 54, ...,  6, 29, 67]]])

An advantage of pickles is that you might embed an array inside other structures, and pickle will handle that.

In [5]:
data = {'array': arr,
        'description': "A million random integers",
        'list': [5.4, 9.1, 3.4],
        'another_array': np.arange(5.0)}

data_bytes = pickle.dumps(data)
print(data_bytes[:35])

new_data = pickle.loads(data_bytes)
print("Description:", new_data['description'])
print("Extra data:", new_data['another_array'])

b'\x80\x04\x95\x92\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x05array\x94\x8c\x15numpy.core.'
Description: A million random integers
Extra data: [0. 1. 2. 3. 4.]


# Serializing with `np.savetxt()`

For NumPy arrays that are 1-D or 2-D, you can save them as delimited files with the `savetxt()` function.  This is a convenient way to save data to CSV or TSV that might be read by DataFrame libraries or similar tools.  However, multi-dimensional arrays need to be reduced to 2-D to be stored in this manner.

In [6]:
np.savetxt('tmp/arr-txt.tsv', 
           arr.reshape(100_000, 10),
           delimiter='\t',
           fmt='%d',
           header='Original shape: (100, 100, 100)',
           comments='# ')

In [7]:
!head tmp/arr-txt.tsv

# Original shape: (100, 100, 100)
87	32	7	44	5	52	55	47	71	1
31	71	64	13	68	66	76	28	47	20
93	21	74	20	34	29	4	72	17	68
36	72	57	76	69	62	39	83	57	72
91	45	25	82	94	25	34	36	50	98
11	62	71	94	61	28	30	94	61	93
72	2	47	67	56	10	50	45	82	35
93	66	98	35	63	64	36	42	4	82
62	18	44	14	48	14	99	77	43	95


# Serializing with `np.save()`

The native NumPy serialization format is very simple and directly represents arrays on disk.  An `.npy` file is *slightly* faster to write than a pickle, and is *slightly* smaller on disk.  These differences are minimal and are swamped by disk caching effects and data size, respectively.  The actual advantage of `.npy` is precisely what it *does not* do; reading a serialized array will never instantiate custom classes, will never execute arbitrary code, and will never contain structures other than arrays.

In [8]:
%time np.save('tmp/arr.npy', arr)

CPU times: user 5.81 ms, sys: 8.29 ms, total: 14.1 ms
Wall time: 12.9 ms


In [9]:
arr3 = np.load('tmp/arr.npy')
arr3[:2]

array([[[87, 32,  7, ..., 25, 94, 44],
        [12, 13, 87, ..., 56, 97, 61],
        [36, 19, 75, ..., 41, 69, 95],
        ...,
        [61,  7, 25, ...,  7, 63, 29],
        [ 7, 24, 75, ..., 86, 81, 13],
        [75, 55, 87, ..., 62, 50, 54]],

       [[40, 15, 79, ..., 39, 74, 18],
        [64,  5, 18, ..., 20, 63, 34],
        [43, 91, 70, ..., 64, 48, 95],
        ...,
        [60,  8, 21, ...,  3, 34, 20],
        [ 9,  7, 32, ..., 59, 75,  8],
        [60, 12, 54, ...,  6, 29, 67]]])

# Serializing with `np.savez` and `np.savez_compressed`

An enhancement to the `.npy` format is the `.npz` format.  This uses a zipfile wrapper to aggregate multiple arrays in the same file.  Again, pickle could do this by putting them inside a dict or a list; the restriction is exactly the advantage for some cases.  In general, the compressed version is to be preferred in almost all cases; for the last decade, the extra CPU cycles to perform compression have been almost always faster than the extra time required to write more data to disk.

In [10]:
np.savez('tmp/arr', arr, data['another_array'])
np.savez_compressed('tmp/arr-compress', arr, data['another_array'])

The multiple arrays are available in a dict-like interface, and simply named as `arr_0`, `arr_1`, and so on.  You must store any mapping to the variable names used for these by separate means.

In [11]:
arr_data = np.load('tmp/arr.npz')
for name in arr_data:
    print(name, arr_data[name].shape, arr_data[name].dtype)

print(arr_data['arr_1'])

arr_0 (100, 100, 100) int64
arr_1 (5,) float64
[0. 1. 2. 3. 4.]


# File sizes

We serialized the same data using several different formats.  The CPU times taken for all of them are neglibigle; there are some notable patterns in disk usage.  

In [12]:
!ls -la tmp

total 27516
drwxr-xr-x 2 dmertz dmertz    4096 Jun 28 15:11 .
drwxr-xr-x 7 dmertz dmertz    4096 Jun 28 13:56 ..
-rw-r--r-- 1 dmertz dmertz 1244755 Jun 28 15:11 arr-compress.npz
-rw-r--r-- 1 dmertz dmertz 8000128 Jun 28 15:11 arr.npy
-rw-r--r-- 1 dmertz dmertz 8000546 Jun 28 15:11 arr.npz
-rw-r--r-- 1 dmertz dmertz 8000165 Jun 28 15:11 arr.pkl
-rw-r--r-- 1 dmertz dmertz 2909133 Jun 28 15:11 arr-txt.tsv


Notice the initially surprising fact that the text format is not the largest.  Because all of our integers were only two digits, they were each stored with two bytes for the numbers plus one for the delimiter.  In contrast, an int64 value requires 8 bytes to store uncompressed.  If the data contained values closer to `sys.maxsize`, i.e. 9,223,372,036,854,775,807, the size of the text version could easily become larger.