Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: allow mask to be optional in our masked ExtensionArrays #30435

Open
jorisvandenbossche opened this issue Dec 23, 2019 · 3 comments
Open
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 23, 2019

Our nullable, mask-based ExtensionArrays (currently integer and boolean, inheriting from MaskedArray) have a _data and _mask numpy arrays stored under the hood. SO we use a numpy boolean array as mask (8bit), also when there are no missing values.

One, relatively easy, memory + performance improvement could be achieved by allowing the mask to be None when there are no missing data. Since the mask data is completely internal to the Array implementations, this should be possible to do.

(to be checked how involved the ops code would become to handle this as optional)

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 23, 2019
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Dec 23, 2019
@jorisvandenbossche jorisvandenbossche added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jan 24, 2020
@jorisvandenbossche jorisvandenbossche changed the title ENH/PERF: allow mask to be None in our masked ExtensionArrays ENH/PERF: allow mask to be optional in our masked ExtensionArrays Jan 24, 2020
@jorisvandenbossche jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020
@mzeitlin11
Copy link
Member

@jorisvandenbossche @jbrockmendel @jreback planning to start looking into this, but want to check first that there's not a newly discussed preferred solution first.

@jbrockmendel
Copy link
Member

not that im aware of

@jbrockmendel
Copy link
Member

I think this suffers from the same invalidation problem as DTA/TDA.freq xref #31218. Consider:

arr = pd.array([1, 2, 3])
arr2 = arr[1:]
arr2[1] = pd.NA

If a mask isnt allocated until the value is set into arr2, then we need to make sure to allocate a mask for arr as well. Certainly possible, but not for the faint of heart.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants