Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/TST: Don't require length for construct_1d_arraylike_from_scalar cast to float64 #47393

Merged
merged 11 commits into from
Jun 22, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Jun 16, 2022

@phofl
Copy link
Member

phofl commented Jun 17, 2022

Most of the remaining ones look like things we want to change anyway? Saw one test that was not supposed to raise a FutureWarning

@mroeschke mroeschke added this to the 1.4.3 milestone Jun 17, 2022
@mroeschke mroeschke added the Compat pandas objects compatability with Numpy or Python functions label Jun 17, 2022
@mroeschke
Copy link
Member Author

Most of the remaining ones look like things we want to change anyway? Saw one test that was not supposed to raise a FutureWarning

Correct, the numpy RuntimeWarnings align with our 1.4 deprecation of converting np.nan to i8 dtype: #45136

The additional length change should also be backwards compatible so I think these changes can be backported so 1.4.3 and be compatible with numpy 1.24

if is_integer_dtype(dtype) and isna(value):
if not length:
# GH 47391: numpy > 1.24 will raise filling np.nan into int dtypes
return np.array([], dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a weird case. how does it happen? is it clear that we'd want to prioritize the dtype as being "right" instead of the value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted an example of the state that's reached in numpy/numpy#21784

Namely before this change, length=0 would pass all the if checks down to np.empty(0, dtype=integer).fill(np.nan) which in numpy < 1.24 would just return np.array([], dtype=integer) but in numpy >=1.24 will raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question (also for me from NumPy!) is which pandas code runs into this path. For pandas, the question is what the actual result should be (in the future). For me the question is how bad it will be if that code path breaks. Because especially if it is bad, we may want to make sure it doesn't break yet (from within NumPy).

(At this point I suspect that at least the NaN case may need a work-around in NumPy as well.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an example test where this is hit is

        s = Series([], index=pd.date_range(start="2018-01-01", periods=0), dtype=int)
        result = s.apply(lambda x: x)
        tm.assert_series_equal(result, s)

so I think when operating over these empty Series/DataFrames, the value representation is np.nan when construct_1d_arraylike_from_scalar(np.nan, length=0, dtype=dtype) is called.

construct_1d_arraylike_from_scalar has a if length and is_integer_dtype(dtype) and isna(value) condition to ensure that integer dtypes were not coerced to float64 (because we relied on np.empty(0, dtype=integer).fill(np.nan) == np.array([], dtype=integer)) for these empty Series/DataFrames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for these empty-like cases in pandas, it appears pandas was relying on np.empty(0, dtype=integer).full(np.nan) to preserve integer dtypes. Having pandas explicitly preserve the dtype for these empty cases e.g. np.array([], dtype=integer) is an okay change to make on our end IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. (but why not keep this bit unchanged and just put the subarr.fill(value) on L1715 inside a if length: to skip for all zero-length arrays?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure I can make the change there instead

@mroeschke mroeschke marked this pull request as ready for review June 17, 2022 20:18
@jreback
Copy link
Contributor

jreback commented Jun 18, 2022

 =================================== FAILURES ===================================
______________ TestMergeDtypes.test_merge_on_ints_floats_warning _______________
[gw1] linux -- Python 3.10.4 /usr/share/miniconda/envs/test/bin/python
self = <pandas.tests.reshape.merge.test_merge.TestMergeDtypes object at 0x7f2644979420>
    def test_merge_on_ints_floats_warning(self):
        # GH 16[57](https://github.com/pandas-dev/pandas/runs/6940154831?check_suite_focus=true#step:8:59)2
        # merge will produce a warning when merging on int and
        # float columns where the float values are not exactly
        # equal to their int representation
        A = DataFrame({"X": [1, 2, 3]})
        B = DataFrame({"Y": [1.1, 2.5, 3.0]})
        expected = DataFrame({"X": [3], "Y": [3.0]})
        with tm.assert_produces_warning(UserWarning):
            result = A.merge(B, left_on="X", right_on="Y")
            tm.assert_frame_equal(result, expected)
        with tm.assert_produces_warning(UserWarning):
            result = B.merge(A, left_on="Y", right_on="X")
            tm.assert_frame_equal(result, expected[["Y", "X"]])
        # test no warning if float has NaNs
        B = DataFrame({"Y": [np.nan, np.nan, 3.0]})
      with tm.assert_produces_warning(None):

this is failing

"are not equal to their int representation.",
UserWarning,
)
# GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 1.5 we ought to actually remove the nans first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in 1.5. add a deprecation noting that nans will be dropped?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no i mean i think u can remove the nans before comparing to avoid the warning (this is all internal anyhow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. Yeah can clean this for 1.5 in a separate PR

@@ -1696,7 +1696,7 @@ def construct_1d_arraylike_from_scalar(

else:

if length and is_integer_dtype(dtype) and isna(value):
if is_integer_dtype(dtype) and isna(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think still need the length here as this part of the code is this logic to determine the the dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. revert to original.

@simonjayhawkins
Copy link
Member

will merge later today if no objections

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@simonjayhawkins simonjayhawkins merged commit 2f3ac16 into pandas-dev:main Jun 22, 2022
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jun 22, 2022
…uct_1d_arraylike_from_scalar cast to float64
@simonjayhawkins
Copy link
Member

Thanks @mroeschke

@mroeschke mroeschke deleted the ci/fix/numpy-dev branch June 22, 2022 16:47
simonjayhawkins added a commit that referenced this pull request Jun 22, 2022
…construct_1d_arraylike_from_scalar cast to float64) (#47460)

* Backport PR #47393: CI/TST: Don't require length for construct_1d_arraylike_from_scalar cast to float64

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: nighlty numpy broke ci
6 participants