# Missing data: masked arrays

When working with real oceanographic data sets, there are often gaps.  One way of handling them is to fill them with NaN, and then write functions that treat NaNs however we desire.  For example, we might want calculate the mean of all values that are not NaN.  In some such cases, there is already a numpy function to do this type of calculation: numpy now includes `nanmean`, `nanmax`, `nanmin`, `nanargmax`, `nanargmin`, `nanstd`, `nanvar`, and `nansum`.  The use of NaN as a bad value flag is typical in Matlab code.  

Numpy, however, provides an alternative way to handle missing data: the `numpy.ma` module, with its `MaskedArray` subclass of the fundamental `numpy.ndarray` class.  There are a few rough edges in `numpy.ma`, but it has some substantial advantages over relying on NaN, so I use it extensively.  It is well supported in matplotlib, and in the two Python modules that I recommend for working with netCDF data.

The most obvious advantage of using masked arrays is that they work with any type of data, not just floating point.  A second advantage is that by always carrying along a Boolean mask array, they often simplify calculations.

Regardless of the degree to which you end up using masked arrays in your own code, you will encounter them, so you need to know at least a few things about them.

## The bare minimum: conversion to ndarray

Suppose you are using a library that reads a file (e.g., netCDF) and returns the results as a masked array if any missing values were encountered.  So, you don't know at the outset whether your function will be getting a masked array or an ndarray, you don't know whether it will be integer or floating point, and you want to do your work with NaN-filled ndarrays.  Here is what you can do:

In [2]:
import numpy as np

# make an example of an integer masked array:
x = np.ma.array([1, 100, 2, 3], mask=[False, True, False, False],
                dtype=int)
print("The input integer masked array is:")
print(x)

xnan = np.ma.filled(x.astype(float), np.nan)
print("Converted to double precision, with nan:")
print(xnan)


The input integer masked array is:
[1 -- 2 3]
Converted to double precision, with nan:
[  1.  nan   2.   3.]


We first used the `astype(float)` method call to generate a double precision array.  This method is available to ndarrays and to masked arrays, so it would work even if x were an ndarray.  Next, this floating point array is used as the first argument to the `np.ma.filled` function, which returns an ndarray of the same dtype, but with its second argument used to replace the masked values.  If its first argument is already an ndarray, `np.ma.filled` returns that argument unchanged.

There are other ways of accomplishing this nan-filled conversion, sometimes more efficiently (that is, without copying the data unnecessarily), but the method above is adequate for now.

## Taking advantage of masked arrays

Now let's see how we can take advantage of masked arrays instead of immediately converting them into ndarrays.

If we have input that might be a masked array or an ndarray, possibly with NaN and/or inf values, we can start by converting, if necessary, to a masked array:

In [3]:
# sample input ndarray:
x = np.array([1.0, 2.5, np.nan, 1.3, np.inf, 7.2])
print("input array with bad values:")
print(x)

xm = np.ma.masked_invalid(x)
print("masked version:")
print(xm)

input array with bad values:
[ 1.   2.5  nan  1.3  inf  7.2]
masked version:
[1.0 2.5 -- 1.3 -- 7.2]


The masked array has nearly all of the methods that an ndarray has, and a few special ones of its own.  For example, to find out how many unmasked values it contains, there is the `count` method:

In [4]:
print("xm has", xm.count(), "unmasked values")

xm has 4 unmasked values


To extract an ndarray containing only the unmasked values, use the `compressed` method:

In [5]:
print("unmasked values are", xm.compressed())

unmasked values are [ 1.   2.5  1.3  7.2]


For both ndarrays and masked arrays, there are often functions that correspond to methods, and vice versa. An advantage of using methods is that they inherently "do the right thing"--the method of a masked array includes functionality to deal with the mask.  With both methods and functions, it is not always obvious when the returned object will be an ndarray and when it will be a masked array, so when it matters it is wise to check, either with a test or by reading the documentation.

Sometimes it is useful to extract the mask, perhaps to use for masking another variable.  Use the `np.ma.getmaskarray` function to get a full-size boolean mask corresponding to a given array, masked or not:

In [14]:
x = np.arange(12).reshape(3, 4)
print("sample ndarray, x:")
print(x)
print("\nnp.ma.getmaskarray(x):")
print(np.ma.getmaskarray(x))

sample ndarray, x:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

np.ma.getmaskarray(x):
[[False False False False]
 [False False False False]
 [False False False False]]


When a masked array is created with no masked values, it's `mask` attribute does not contain a full Boolean array; this is to save space and time, in case it turns out that nothing ever needs to be masked:

In [15]:
xm = np.ma.arange(6).reshape(2, 3)
print("fresh masked array, xm:")
print(xm)
print("\nxm.mask is actually np.ma.nomask, but prints as:")
print(xm.mask)

fresh masked array, xm:
[[0 1 2]
 [3 4 5]]

xm.mask is actually np.ma.nomask, but prints as:
False


## Masking values in a masked array

We have already seen one way of ending up with masked values: using `np.ma.masked_invalid`.  There are many more, e.g.:

In [17]:
x = np.arange(10)
xm = np.ma.masked_greater(x, 5)
print(xm)

[0 1 2 3 4 5 -- -- -- --]


In [16]:
x = np.arange(10);x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

We can include a mask at array creation time:

In [9]:
xm = np.ma.array([1, 2, 3], mask=[True, False, False])
print(xm)

[-- 2 3]


Mask a set of values in an existing masked array using indexing:

In [10]:
xm = np.ma.arange(5)
xm[[1, 3]] = np.ma.masked
print(xm)

[0 -- 2 -- 4]


## Math

For operations like addition and multiplication, a masked value acts like a NaN: the output is masked. Division is more interesting: division by zero yields a masked value, not an error:

In [11]:
x = np.ma.array([1, 2, 3], mask=[False, False, True])
y = np.ma.array([1, 0, 1])
print(x * y)
print(x/y)

[1 0 --]
[1.0 -- --]


Similarly, evaluating a function with arguments outside the domain of that function simply yields masked values:

In [12]:
print(np.ma.arcsin([0.8, 1, 1.5]))

[0.9272952180016123 1.5707963267948966 --]
