Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy.unique on masked arrays #16972

Open
pulkin opened this issue Jul 29, 2020 · 2 comments
Open

numpy.unique on masked arrays #16972

pulkin opened this issue Jul 29, 2020 · 2 comments
Labels
component: numpy.ma masked arrays

Comments

@pulkin
Copy link

pulkin commented Jul 29, 2020

  • I have several 1D arrays of varying but comparable lengths to be merged (vstack) into a contiguous 2D array.
  • I merge them into a masked array where padding entries are masked out.
  • I simply run np.unique(return_inverse=True) on the masked array.
  • The output is two arrays: a masked key array with unique entries which optionally includes a single masked padding entry - and a plain inverse array with the size corresponding to the input.

I would expect the other way around: key array to be a plain 1D array while inverse to be masked. There are two separate issues here:

  • len(key) should represent the number of unique entries. Right now it does not: the masking element (999999 in the example below) may be present or may be not, depending on whether the mask is empty or not. This makes masking pretty much useless for np.unique: if I pass a masked array I clearly want to avoid masked entries in the key entries. I could equally just do np.unique(masked_array.data) otherwise.
  • Given np.unique is a transparent operation (i.e. I can run it on both arrays and masked arrays) I would expect transparent output. Without knowing anything what unique does and what is it for, inverse should definitely be a masked array because it has its elements corresponding one-to-one to the input.

As a result of this inconsistency I have to (a) check whether anything has been masked at all (b) conditionally pick out the padding entry from key output (c) apply a mask to inverse output. Something like the following.

def masked_unique(a):
    a = np.ma.masked_array(data=a.data, mask=a.mask, fill_value=a.data.max() + 1)
    key, inverse = np.unique(a, return_inverse=True)
    if np.any(a.mask):
        key = key[:len(key) - 1]
    return np.array(key), np.ma.masked_array(data=inverse, mask=a.mask)

Strictly speaking, I could equally run np.unique on raw a.data in the above example to fix this. I pretty much do all the job by myself.

Reproducing code example:

>>> import numpy as np
>>> a = np.ma.masked_array([1, 2, 3, 4], [0, 1, 0, 0])
>>> np.unique(a, return_inverse=True)
(masked_array(data=[1, 3, 4, --],
             mask=[False, False, False,  True],
       fill_value=999999), array([0, 3, 1, 2]))

Numpy/Python version information:

1.18.4 3.8.3 (default, May 29 2020, 00:00:00) 
[GCC 10.1.1 20200507 (Red Hat 10.1.1-1)]
@rossbar rossbar added the component: numpy.ma masked arrays label Aug 3, 2020
@rossbar
Copy link
Contributor

rossbar commented Aug 3, 2020

Note there is a unique in the ma namespace as well, but the behavior (at least for this example) is the same.

@pulkin
Copy link
Author

pulkin commented Aug 6, 2020

This is what I ended up with. It is easier to implement for numeric arrays because there, as far as I remember, the mask entry is always at the end. Unfortunately, I am working with char arrays which behave differently (probably, at the masked sort level).

def masked_unique(a, return_inverse=False, fill_value=None):
    """
    A proper implementation of `np.unique` for masked arrays.

    Parameters
    ----------
    a : np.ma.masked_array
        The array to process.
    return_inverse : bool
        If True, returns the masked inverse.
    fill_value : int
        An optional value to fill the `return_inverse` array.

    Returns
    -------
    key : np.ndarray
        Unique entries.
    inverse : np.ma.masked_array, optional
        Integer masked array with the inverse.
    """
    key = np.unique(a, return_inverse=return_inverse)
    if return_inverse:
        key, inverse = key
        barrier = np.argwhere(key.mask)
        if len(barrier) > 0:
            barrier = barrier.squeeze()  # all indices after the barrier have to be shifted (char only?)
            inverse[inverse > barrier] -= 1  # shift everything after the barrier
            if fill_value is None:
                inverse[a.mask.reshape(-1)] = len(key) - 1  # shift masked stuff to the end
            else:
                inverse[a.mask.reshape(-1)] = fill_value
        inverse = np.ma.masked_array(data=inverse, mask=a.mask)
    key = key.data[np.logical_not(key.mask)]
    if return_inverse:
        return key, inverse
    else:
        return key

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: numpy.ma masked arrays
Projects
None yet
Development

No branches or pull requests

2 participants