Skip to content

Discussion: Use of masked arrays internally #31008

@scottshambaugh

Description

@scottshambaugh

Chatting with @timhoffm in #31004 (comment), I'd like to more widely discuss our use of masked arrays in internal code. My position is that they add a lot of maintenance headache and fragility, and we would be best off excising them from the codebase altogether.

Currently we have 185 counts of np.ma in the core library. Most of them are pass-through checks, or are immediately .filled() with nans.

Problems

  • Internal APIs do not usually document whether they expect or can handle masked arrays. It's easy to miss needing to handle the masked array case, contributors may not be familiar with them, and numpy functions still work so we can fail silently
  • Lots of code clutter and specific paths needed to handle masked arrays
  • Poor test coverage of masked array inputs/outputs
  • Performance degradation from passing around extra data and not being able to rely on more efficient array operations.
    Note also that a = np.ma.masked_array(a, mask=mask).filled() is slower than a = np.where(mask, np.nan, a) is slower than a[mask] = np.nan.

Proposal

  • Replace all instances of masked arrays with float arrays filled with NaNs, or in instances where the underlying data needs to be preserved (eg when clipping) then use an explicitly defined boolean mask variable
  • Sanitize all user inputs to pure numpy arrays and do not return masked arrays
  • Document a dev policy to minimize internal use of masked arrays, and to never return them

Migration issues

  • cbook.safe_masked_invalid() is used a few places and causes some other functions to return a masked array. I haven't traced this all the way through to see where those come out
  • All colors.Normalize subclasses preserve input masks in their output (this is the only docstring I see with a masked array output documented for the public API)

Deprecation warnings might be difficult to raise without spamming regular uses of these functions, since we can't tell when users use the mask on returned arrays. We may need to accept some silent breakage and just make the release notes clear.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions