Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: treat nan-objects the same way float64-nans are treated - all na… #22207

Merged

Conversation

realead
Copy link
Contributor

@realead realead commented Aug 5, 2018

…ns are from the same equivalence class (GH22119)

@realead
Copy link
Contributor Author

realead commented Aug 5, 2018

asv_benchmark:
asv continuous -f 1.01 upstream/master HEAD -b ^series_methods -b ^algorithms
....                                                           
       before           after         ratio
     [d30c4a06]       [e583a0c9]
-      2.14±0.04s         672±10μs     0.00  series_methods.IsInForObjects.time_isin_nans

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

series_methods.IsInForObjects.time_isin_nans is a new test showing a test case which leads to O(n^2) for hashmap generation with the current implementation.

@realead realead closed this Aug 5, 2018
@realead realead reopened this Aug 5, 2018
@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 6, 2018
@gfyoung gfyoung requested a review from jreback August 6, 2018 04:37
@codecov
Copy link

codecov bot commented Aug 8, 2018

Codecov Report

Merging #22207 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22207      +/-   ##
==========================================
+ Coverage   92.06%   92.06%   +<.01%     
==========================================
  Files         169      169              
  Lines       50694    50698       +4     
==========================================
+ Hits        46671    46676       +5     
+ Misses       4023     4022       -1
Flag Coverage Δ
#multiple 90.47% <ø> (ø) ⬆️
#single 42.32% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 97.15% <0%> (ø) ⬆️
pandas/io/common.py 70.65% <0%> (ø) ⬆️
pandas/core/internals/blocks.py 94.45% <0%> (ø) ⬆️
pandas/core/missing.py 91.7% <0%> (+0.04%) ⬆️
pandas/core/indexes/base.py 96.43% <0%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 776fed3...5cc5e0f. Read the comment docs.

@@ -1661,9 +1662,13 @@ def test_isin_nan_common_object(self, nulls_fixture, nulls_fixture2):
# Test cartesian product of null fixtures and ensure that we don't
# mangle the various types (save a corner case with PyPy)

if PYPY and nulls_fixture is np.nan: # np.nan is float('nan') on PyPy
# all nans are the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this test seems now very restrictive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you remove the PYPY on purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mattip can you comment on the equivalence of nulls here? we don't test on PYPY, can you see if this still fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback After the fix the behavior is less subtle:

Index([nulls_fixture]).isin([nulls_fixture2])

is always True as long as nulls_fixture and nulls_fixture2 are nans, no matter whether those are the same nan-objects or different nan-objects. Thus PyPy is no longer a special case. But I must confess, I didn't test with PyPy and assumed this would be done somewhere in CI.

It seems as if the purpose of this test is to check, that the identity of the object is preserved (something similar to bug #22160). In this case np.nan is no longer a good tool to do this check, because isin bcomes agnostic to the exact identity of the nan-object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check it on PyPy, thanks for the heads up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular test in its 25 variants passes on PyPy. There are other failures but they seem unrelated.

@jreback jreback added this to the 0.24.0 milestone Aug 8, 2018
@jreback jreback merged commit 9b18811 into pandas-dev:master Aug 9, 2018
@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

thanks @realead nice patch
and thanks @mattip for testing

@realead realead deleted the fix_hashmap_for_nan_objects_GH22119 branch August 9, 2018 19:34
realead added a commit to realead/pandas that referenced this pull request Aug 9, 2018
realead added a commit to realead/pandas that referenced this pull request Aug 12, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Aug 21, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Aug 23, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 5, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 8, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 12, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 15, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 17, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
4 participants