Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: IntegerArray/FloatingArray constructors mismatched NAs #44514

Merged
merged 17 commits into from Dec 1, 2021

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Nov 18, 2021

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

Same yak-shaving as #44495 (which turned out to be a dead end for this particular yak, but still a perf bump)

pandas/_libs/missing.pyx Outdated Show resolved Hide resolved
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Nov 20, 2021
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setitem change / bug fix is unrelated to the constructor fix? Or it's because you are testing that through setitem as well?

mask2 = isna(values)
if not (mask == mask2).all():
# e.g. if we have a timedelta64("NaT")
raise TypeError(f"{values.dtype} cannot be converted to a FloatingDtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could it be libmissing.is_numeric_na that already raises on encountering a "non-numeric NA"? (is there a use case for is_numeric_na to not be strict about this, i.e. to get a "partial" mask?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you planning to address this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning to address in a follow-up

pandas/core/arrays/integer.py Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

The setitem change / bug fix is unrelated to the constructor fix? Or it's because you are testing that through setitem as well?

The setitem bug was identified first and the cause tracked back to the constructor.

@jreback jreback added this to the 1.4 milestone Nov 21, 2021
@jreback
Copy link
Contributor

jreback commented Nov 21, 2021

ok to merge, @jorisvandenbossche comments could be afollow up (or here is ok too)

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Nov 21, 2021 via email

@@ -357,6 +364,31 @@ def test_setitem_series(self, data, full_indexer):
)
self.assert_series_equal(result, expected)

def test_setitem_frame_2d_values(self, data, using_array_manager, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this test out of the extension base tests, or remove the need to use using_array_manager? (this is not defined by external users of those tests, and would be a bit annoying to replicate)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

pandas/tests/frame/indexing/test_indexing.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

gentle ping; this is a blocker for fixing a bug in Series.where, which in turn should allow us to share some more Block methods.

df = pd.DataFrame({"A": data})

# Avoiding using_array_manager fixture
# https://github.com/pandas-dev/pandas/pull/44514#discussion_r754002410
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changing to not use the fixture. But generally, is this actually needed to have as base extension test? (since the fix was inside the Blocks code, it's not really testing a specific behaviour that the EA needs to have?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually needed to have as base extension test?

This seems like the best way to systematically test it for all EA dtypes.

it's not really testing a specific behaviour that the EA needs to have?

That seems like it applies to most tests that aren't directly testing EA methods.

@jorisvandenbossche
Copy link
Member

Two questions, but feel free to merge as well

@jbrockmendel
Copy link
Member Author

updated to raise inside is_numeric_na

@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

@jorisvandenbossche if you want to look or can merge

@jbrockmendel
Copy link
Member Author

i think comments have all been addressed here? bugfix follow-up is ready.

@jorisvandenbossche
Copy link
Member

@jbrockmendel
Copy link
Member Author

Yah, looks like a lot of time is being taken in is_numeric_na.

from asv_bench.benchmarks.groupby import *
self = Cumulative()
self.setup("Float64", "cumsum")

%timeit self.time_frame_transform("Float64", "cumsum")

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.791    1.791 {built-in method builtins.exec}
        1    0.000    0.000    1.791    1.791 <string>:1(<module>)
       10    0.014    0.001    1.791    0.179 groupby.py:569(time_frame_transform)
       10    0.000    0.000    1.773    0.177 generic.py:1179(transform)
       10    0.000    0.000    1.773    0.177 groupby.py:1608(_transform)
       10    0.000    0.000    1.773    0.177 groupby.py:3117(cumsum)
       10    0.000    0.000    1.773    0.177 generic.py:1103(_cython_transform)
       10    0.000    0.000    1.762    0.176 managers.py:1266(grouped_reduce)
       50    0.000    0.000    1.762    0.035 blocks.py:381(apply)
       50    0.000    0.000    1.758    0.035 generic.py:1118(arr_func)
       50    0.000    0.000    1.758    0.035 ops.py:919(_cython_operation)
       50    0.000    0.000    1.699    0.034 ops.py:587(cython_operation)
       50    0.001    0.000    1.697    0.034 ops.py:319(_ea_wrap_cython_operation)
       50    0.000    0.000    1.559    0.031 ops.py:376(_reconstruct_ea_result)
       50    0.000    0.000    1.558    0.031 floating.py:263(_from_sequence)
       50    0.001    0.000    1.557    0.031 floating.py:85(coerce_to_array)
       50    1.551    0.031    1.551    0.031 {pandas._libs.missing.is_numeric_na}
       50    0.000    0.000    0.134    0.003 ops.py:434(_cython_op_ndim_compat)

Looks like we are calling is_numeric_na in cases where we have float64 dtype, so can just use np.isnan. Should be an easy patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants