<a id="home"></a>
# How to save and load NumPy data (the simple way)

In this document, we describe

- How to load and save data from a variety of built-in and external formats into NumPy arrays;
- How to save NumPy arrays to disk;
- The simplest solutions and best formats for your type of data.

### Contents:
- [An example of reading a netCDF file into NumPy](#example)
- NumPy built-in functionality: 
    - [Saving a single array to disk](#builtin): the binary `npy` format (`np.load`, `np.save`)
    - [Saving several arrays in the same file](#npz): `np.savez` and `np.savez_compressed`.
    - [Plain text data](#text): `np.savetxt`, `np.loadtxt` and `np.genfromtxt`
    - [Saving structured arrays](#structured)
    - [Zarr](#zarr): `save` and `load` functions for quick manipulation of NumPy arrays.
    - [Final thoughts](#final): common problems and how to solve them; other options and formats.
    
For other information not covered in this page, go to the [Further Reading](#further_reading) section.

<a id="example"></a>
## Example: Reading a netCDF file data and saving it as a NumPy array

For this example, we'll take a file from [gridMET](http://www.climatologylab.org/gridmet.html) containing meteorological data from the continental United States. 

gridMET is a

> dataset of daily high-spatial resolution (~4-km, 1/24th degree) surface meteorological data covering the contiguous US from 1979-yesterday.

This dataset is available as a set of [netCDF](https://www.unidata.ucar.edu/software/netcdf/) files corresponding to meteorological data from a selected region of the globe for a certain time period. For this document, we'll choose a dataset of maximum near-surface temperature data for a rectangle containing the United States for the year 2019.

### Reading the data

We'll use the file `tmmx_2019.nc` containing the temperature data for the year 2019. You can download the file manually by going to 

http://www.northwestknowledge.net/metdata/data/tmmx_2019.nc

or use the complete url for the file in the `Dataset` function of the `netCDF4` module.

In [None]:
import numpy as np
from netCDF4 import Dataset

If you have downloaded the file manually, remove the `#` from the line below and run it:

In [None]:
#f = Dataset('tmmx_2019.nc')

If you want to download it automatically, run the cell below (**You need to have an active internet connection for this to work**):

In [None]:
f = Dataset('http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/tmmx/tmmx_2019.nc'+'#fillmismatch', 'r', format="NETCDF4")

Now, the air temperature information (in Kelvin) can be accessed by

In [None]:
air_temperature = f.variables['air_temperature']

The data is stored as a three-dimensional array; in the first axis we have the day when the temperature was measured, and the second and third axes are for latitude and longitude, respectively.

In [None]:
air_temperature.shape

Note that `air_temperature` is a [netCDF variable](https://www.unidata.ucar.edu/software/netcdf/docs/group__variables.html). We can read it as a NumPy array using the following syntax:
```python
temperatures = air_temperature[:]
```

To simplify our example, we'll read only the temperature data between July 1st and July 31st, 2019. These are indexed by 181 and 211, respectively, in the first axis of our `air_temperature` array (remember that when slicing the stop value - 212 in the expression below - is not included in the final result.)

In [None]:
temperatures = air_temperature[181:212, ...]

**Note:** You may see a `DeprecationWarning: tostring() is deprecated. Use tobytes() instead.` message after executing the command above. You can safely ignore this warning, as it is a [netCDF4 issue](https://github.com/Unidata/netcdf4-python/commit/b2f8d7e73c0df7e334f4280c7e206ec3684f499d).

Now, since there are missing data in this file, `temp` is a [masked array](https://numpy.org/doc/stable/reference/maskedarray.generic.html). If we had used
```python
temperatures = np.asarray(air_temperature)
```

we would see the missing values filled with the `missing_value` flag defined by the netCDF data:

In [None]:
f.variables['air_temperature'].missing_value

<a id="builtin"></a>
## Saving a single array to disk

**The simplest recommended way to save an array to disk is by using the `numpy.save` function.** This allows you to save any ndarray to a binary file in NumPy [`.npy` format](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html). The `.npy` format is the standard binary file format in NumPy for persisting a single arbitrary NumPy array on disk. The format stores all of the shape and dtype information necessary to reconstruct the array correctly even on another machine with a different architecture. 

It's worth noting that since `NPY` is a binary file format, the contents of the file are not human-readable when opened with a text editor, for example. This is compensated by its read and write speed. If you need the data to be human-readable, see [np.savetxt](#text) below. 

Finally, if you need to save this data to be opened with other programming languages or applications, you might want to choose another format.

**Note:** `numpy.save` doesn't support saving arbitrary subclasses of NumPy arrays; in particular, [masked arrays](https://numpy.org/devdocs/reference/maskedarray.generic.html) are not supported.

In our example, we'll save only `temperatures.data` to avoid problems with masked arrays:

In [None]:
np.save("temperature_data.npy", temperatures.data)

Now, the data can be retrieved from this binary file by using the `np.load` function:

In [None]:
temps_read = np.load("temperature_data.npy")

<a id="npz"></a>
## Storing several arrays in the same file: `numpy.savez`

To store several NumPy arrays in the same file, the recommended way is to use the `np.savez` function. As a convention, the data will be stored in a file with extension `.npz`. This file is a zip file containing multiple `.npy` files, one for each array. To recover the arrays stored in this file, we can once more use the `load` function.

For example, we could save the `.data` and `.mask` for the `temperatures` array in the same file:

In [None]:
np.savez("tempdata.npz", data=temperatures.data, mask=temperatures.mask)

Note that `numpy.savez` takes keyword arguments that are used to save labels to each array saved in the file. In our case, we are saving arrays which will be referred to in the file as `data` and `mask`. 

Later, we can recover the arrays with `np.load`:

In [None]:
stored = np.load("tempdata.npz")

While `np.load` will directly return an array for `.npy` files, it will return a `NpzFile` object with a `.file` attribute for `.npz` files:

In [None]:
stored.files

The original data for `temperatures` can be recovered by

In [None]:
stored['data']

You can also save several arrays into a single file in compressed `npz` format with `savez_compressed`:

In [None]:
np.savez_compressed("tempdata_compressed.npz", data=temperatures.data, mask=temperatures.mask)

While `savez_compressed` can take longer than `savez`, the resulting file sizes can be much smaller. In our example, the file sizes show a big difference:

| File name               | Size |
| -----------             | -----|
| tempdata.npz            | 216Mb|
| tempdata_compressed.npz |  21Mb|

**Note:** Using the `.npy` and `.npz` file extensions is a convention, not a requirement. Applications may wish to use these file formats but use an extension specific to the application. 

<a id="text"></a>
### Plain text data: `np.savetxt`, `np.loadtxt` and `np.genfromtxt`

If you are looking to load data from a `.csv` or `.txt` file, the best options might be `np.loadtxt` (if your dataset has no missing values) or `np.genfromtxt` (if your dataset has missing values or you need more flexibility when reading them).

In our case, we are looking to save data to a file, so we'll consider the `np.savetxt` function. However, we have a three-dimensional array, which is not supported by `np.savetxt` directly: it only accepts 1D and 2D arrays. One option would be to slice the data and save it separately. Another option would be to reshape the array so that the data from each day is presented as a (585, 1386) block, as demonstrated below:


In [None]:
temperatures.shape

We could do something like
```python
np.savetxt("temperature_data.csv", temperatures.reshape(31*585, 1386))
```
but be aware that this would take a long time. So to simplify, we demonstrate saving only a slice of the data. Let's say we wish to save the temperature data for July 8th, 2019 (day 7 of the month, starting from 0). We can save this data by doing

In [None]:
np.savetxt("temperature_data.csv", temperatures[7, ...])

This approach has the advantage of being human-readable, but has the disadvantage of being slow, creating large files and potentially causing loss of precision and information on the data. Use it with caution!

In [None]:
data = np.loadtxt("temperature_data.csv")

The `numpy.savetxt` and `numpy.loadtxt` functions accept additional optional parameters such as header, footer, and delimiter, which may be useful if your original data comes from a pandas dataframe or a plain `.csv` file.

While text files can be easier for sharing, `.npy` and `.npz` files (which we'll se below) are smaller and faster to read. If you need more sophisticated handling of your text file (for example, if you need to work with lines that contain missing values), you should use the [genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) function.

<a id="structured"></a>
## Structured arrays

So far, we've only dealt with an array of [float64](link) values, as can be seen from the `dtype` attribute of the `temperatures.data` array:

In [None]:
temperatures.data.dtype

Suppose we wished to store data with different `dtypes` in the same array, like we would for a table or [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). NumPy's [structured arrays](https://numpy.org/devdocs/user/basics.rec.html) are `ndarrays` whose datatype is a composition of simpler datatypes organized as a sequence of named fields. 

In our example, we can store the temperature information in a more human-readable way. Since we want measures of temperature for each day of 2019, we define the two types of data we wish to represent: a `date` and `temperatures` measured in each of the coordinates previously defined on the map.

In [None]:
dtypes = [('date', np.dtype('datetime64[D]')), ('temperatures', np.float64, (585, 1386))]

For more information on creating and manipulating dtypes, see the [dtypes documentation](https://numpy.org/doc/stable/reference/arrays.dtypes.html).

Now, we'll create a structured array with information organized with the `dtypes` we just defined:

In [None]:
temp_table = np.zeros(31, dtype=dtypes)
for i in range(31):
    temp_table[i] = (np.datetime64('2019-07-01') + np.timedelta64(i, 'D'), temperatures.data[i])

In [None]:
temp_table.dtype

All we have to do is save `temp_table` to a file with `np.save` and the dtypes information is preserved for later reading.

In [None]:
np.save("temp_table.npy", temp_table)

Reading back:

In [None]:
temp_table_read = np.load("temp_table.npy")

we can check that

In [None]:
temp_table_read.dtype

which is the same dtype we had for `temp_table`.

<a id="zarr"></a>
## Zarr: an option for large arrays

[Zarr](https://zarr.readthedocs.io/en/stable/) is a Python package providing an implementation of chunked, compressed, N-dimensional arrays. If you are already familiar with HDF5 then Zarr arrays provide similar functionality, but with some additional flexibility. It is not a part of NumPy, but can be used to store very large arrays with fast data access and modest file sizes. Zarr also lets you read and write files to cloud storage systems.

In [None]:
import zarr

Zarr offers a number of options for persistent array storage. However, it has a different model of persistent storage than what we usually see. For example, the `zarr.convenience.save()` - or simply `zarr.save()` - function can be used to save an array to disk as a `DirectoryStore` object from Zarr. This means that if we do 

In [None]:
zarr.save('data/temperatures.zarr', temperatures)

our data has been stored in a file called `example.zarr` in a subdirectory `data` of the local path (if this path does not exist, it is created at the time of writing).

Now, we can recover the data using

In [None]:
temperatures_zarr = zarr.load('data/temperatures.zarr')

Note that

In [None]:
temperatures_zarr.dtype

as expected. 

As an alternative, if we wish to have a `.zip` file containing our array data, we can use

In [None]:
zarr.save('temperatures.zip', temperatures)

To read the data back, we can once again use `zarr.load()`:

In [None]:
temp_zarr = zarr.load('temperatures.zip')

<a id="final"></a>
## Final thoughts

Some things to be aware of when choosing a file format or storage strategy:

- Depending on the format chosen for the output file, you may experience loss of precision for the saved data.
- Choose a format that will be portable for your needs. If you wish to share your data with people using other tools or programming languages, make sure to choose an appropriate file format.
- Like in the masked array example, there are workarounds that can be used when NumPy functionality is not complete. 
- If you are looking to process a `.csv` file with headers and column labels, you might want to look at [Pandas](https://pandas.pydata.org/) instead of NumPy.

### Pickle and JSON

If you are familiar with Python, you may have heard of the [pickle](https://docs.python.org/3/library/pickle.html) module. This allows a user to *serialize* a Python object, converting it to a byte stream (in our case, a binary file). This is, in general, not recommended, because pickled data can include malicious code which will be executed during unpickling. Furthermore, there may be portability issues: pickled objects may not be loadable on different Python installations. 

Some of the problems in the pickle module can be solved with a better serialization format, JSON. Unfortunately, [NumPy arrays are not JSON serializable](https://github.com/numpy/numpy/issues/12481) at this point.

### A warning: `np.tofile` and `np.fromfile`

These are convenience functions for quick storage of array data as text or binary format (default), but should not be used except in very specific cases. Information on endianness and precision is lost, so this method is **not** a good choice for files intended to archive data or transport data between machines with different endianness. Some of these problems can be overcome by outputting the data as text files, at the expense of speed and file size.

Use `np.save` and `np.load` instead.

**Links to Reference Documentation**:
- [`ndarray.tofile`](https://numpy.org/devdocs/reference/generated/numpy.ndarray.tofile.html)
- [`numpy.fromfile`](https://numpy.org/devdocs/reference/generated/numpy.fromfile.html)

### Other formats

There are a number of well-known and popular formats for storing large arrays that can also be used together with NumPy, including

- [HDF5](https://www.hdfgroup.org/) (using [h5py](https://www.h5py.org/))
- [netCDF](https://www.unidata.ucar.edu/software/netcdf/)
- [bloscpack](https://github.com/Blosc/bloscpack)

As a general rule, you should try NumPy's built-in functionality first, and use other formats only when that does not satisfy your requirements. 

<a id="further_reading"></a>
## Further Reading

- ["How to save and load NumPy objects"](https://numpy.org/devdocs/user/absolute_beginners.html#how-to-save-and-load-numpy-objects) section from the *NumPy: the absolute basics for beginners* tutorial.
- [I/O with NumPy (User Documentation)](https://numpy.org/devdocs/user/basics.io.html): a tutorial focused on the `np.genfromtxt` function
- [Input and output (Reference Documentation)](https://numpy.org/devdocs/reference/routines.io.html): a comprehensive list of I/O routines from NumPy
- [NPY Binary format (Reference Documentation)](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)
- [Zarr Tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html)