numpy.unique on masked arrays #16972

pulkin · 2020-07-29T14:27:29Z

I have several 1D arrays of varying but comparable lengths to be merged (vstack) into a contiguous 2D array.
I merge them into a masked array where padding entries are masked out.
I simply run np.unique(return_inverse=True) on the masked array.
The output is two arrays: a masked key array with unique entries which optionally includes a single masked padding entry - and a plain inverse array with the size corresponding to the input.

I would expect the other way around: key array to be a plain 1D array while inverse to be masked. There are two separate issues here:

len(key) should represent the number of unique entries. Right now it does not: the masking element (999999 in the example below) may be present or may be not, depending on whether the mask is empty or not. This makes masking pretty much useless for np.unique: if I pass a masked array I clearly want to avoid masked entries in the key entries. I could equally just do np.unique(masked_array.data) otherwise.
Given np.unique is a transparent operation (i.e. I can run it on both arrays and masked arrays) I would expect transparent output. Without knowing anything what unique does and what is it for, inverse should definitely be a masked array because it has its elements corresponding one-to-one to the input.

As a result of this inconsistency I have to (a) check whether anything has been masked at all (b) conditionally pick out the padding entry from key output (c) apply a mask to inverse output. Something like the following.

def masked_unique(a):
    a = np.ma.masked_array(data=a.data, mask=a.mask, fill_value=a.data.max() + 1)
    key, inverse = np.unique(a, return_inverse=True)
    if np.any(a.mask):
        key = key[:len(key) - 1]
    return np.array(key), np.ma.masked_array(data=inverse, mask=a.mask)

Strictly speaking, I could equally run np.unique on raw a.data in the above example to fix this. I pretty much do all the job by myself.

Reproducing code example:

>>> import numpy as np
>>> a = np.ma.masked_array([1, 2, 3, 4], [0, 1, 0, 0])
>>> np.unique(a, return_inverse=True)
(masked_array(data=[1, 3, 4, --],
             mask=[False, False, False,  True],
       fill_value=999999), array([0, 3, 1, 2]))

Numpy/Python version information:

1.18.4 3.8.3 (default, May 29 2020, 00:00:00) 
[GCC 10.1.1 20200507 (Red Hat 10.1.1-1)]

The text was updated successfully, but these errors were encountered:

rossbar · 2020-08-03T17:11:53Z

Note there is a unique in the ma namespace as well, but the behavior (at least for this example) is the same.

pulkin · 2020-08-06T12:54:22Z

This is what I ended up with. It is easier to implement for numeric arrays because there, as far as I remember, the mask entry is always at the end. Unfortunately, I am working with char arrays which behave differently (probably, at the masked sort level).

def masked_unique(a, return_inverse=False, fill_value=None):
    """
    A proper implementation of `np.unique` for masked arrays.

    Parameters
    ----------
    a : np.ma.masked_array
        The array to process.
    return_inverse : bool
        If True, returns the masked inverse.
    fill_value : int
        An optional value to fill the `return_inverse` array.

    Returns
    -------
    key : np.ndarray
        Unique entries.
    inverse : np.ma.masked_array, optional
        Integer masked array with the inverse.
    """
    key = np.unique(a, return_inverse=return_inverse)
    if return_inverse:
        key, inverse = key
        barrier = np.argwhere(key.mask)
        if len(barrier) > 0:
            barrier = barrier.squeeze()  # all indices after the barrier have to be shifted (char only?)
            inverse[inverse > barrier] -= 1  # shift everything after the barrier
            if fill_value is None:
                inverse[a.mask.reshape(-1)] = len(key) - 1  # shift masked stuff to the end
            else:
                inverse[a.mask.reshape(-1)] = fill_value
        inverse = np.ma.masked_array(data=inverse, mask=a.mask)
    key = key.data[np.logical_not(key.mask)]
    if return_inverse:
        return key, inverse
    else:
        return key

rossbar added the component: numpy.ma masked arrays label Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy.unique on masked arrays #16972

numpy.unique on masked arrays #16972

pulkin commented Jul 29, 2020

rossbar commented Aug 3, 2020

pulkin commented Aug 6, 2020

numpy.unique on masked arrays #16972

numpy.unique on masked arrays #16972

Comments

pulkin commented Jul 29, 2020

Reproducing code example:

Numpy/Python version information:

rossbar commented Aug 3, 2020

pulkin commented Aug 6, 2020