Skip to content

Conversation

realead
Copy link
Contributor

@realead realead commented Jul 13, 2018

The following issues

  1. hash(0.0) != hash(-0.0)
  2. hash(x) != hash(y) for different x,y which are nans

are solved by setting:

  1. hash(-0.0):=hash(0.0)
  2. hash(x):=hash(np.nan) for every x which is nan

@jbrockmendel
Copy link
Member

Test failure appears unrelated. Can you push again to re-run it?

@realead realead force-pushed the hash_for_float64_GH21866 branch from 0d7fe27 to 2cec96e Compare July 14, 2018 05:16
@codecov
Copy link

codecov bot commented Jul 14, 2018

Codecov Report

Merging #21904 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21904   +/-   ##
=======================================
  Coverage   91.99%   91.99%           
=======================================
  Files         167      167           
  Lines       50578    50578           
=======================================
  Hits        46530    46530           
  Misses       4048     4048
Flag Coverage Δ
#multiple 90.4% <ø> (ø) ⬆️
#single 42.17% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 322dbf4...124b095. Read the comment docs.

@realead realead force-pushed the hash_for_float64_GH21866 branch from 2cec96e to 94b7087 Compare July 14, 2018 13:41
@jreback
Copy link
Contributor

jreback commented Jul 14, 2018

can you run an asv and report any anomalies

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have any user level visible effects?
can you add a whatsnew note (0.24.0), depending on your answer above either API changes or other Enhancements section.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 14, 2018
@realead
Copy link
Contributor Author

realead commented Jul 14, 2018

@jbrockmendel No, it was a problem in the one of the added test cases: somewhat naively 16GB memory were reserved (but not committed!), yet obviously different operating systems with different resources reacts differently to such a request.

It is a little bit strange, that the testing just died and didn't recover from that...

I removed these silly tests and now it looks better (at least it is clear, what goes wrong).

#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)

// correct for all inputs but not -0.0 and NaNs
#define kh_float64_hash_func_0_NAN(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a blank between cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Sorry for the silly question: Do you expect me to add a new commit with the improvements to the branch and you will fixup it when merging or should I amend the current commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally push new commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to add a whatsnew in any event


// correct for all
#define kh_float64_hash_func(key) ((key) != (key) ? \
kh_float64_hash_func_NAN(Py_NAN) : \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about Py_NAN:

  1. Must the case of Py_NO_NAN be taken into account?
  2. There is PANDAS_NAN, but here Py_NAN didn't require additional includes.

PS: NAN from math.h isn't defined for some plattforms.

- Invalid construction of ``IntervalDtype`` will now always raise a ``TypeError`` rather than a ``ValueError`` if the subdtype is invalid (:issue:`21185`)
- Trying to reindex a ``DataFrame`` with a non unique ``MultiIndex`` now raises a ``ValueError`` instead of an ``Exception`` (:issue:`21770`)
-
- :class:`Float64HashTable` handles zeros/signed zeros and all flavors of NaNs consistently: it is no longer possible to have both, zero and signed-zero, as keys at the same time in a table, also there can be at most one NaN-key in a table (:issue:`21866`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not public, my question below was whether this has a public change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fixes the bug #21866, i.e. produces right results for some esoteric corner cases. This change of behavior can be observed by the end-user, but is this then a public change worth mentioning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21866 (comment)

is this the public case? you can list this, but just make it related to the effect on .unique()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add a test w.r.t. unique

@realead realead force-pushed the hash_for_float64_GH21866 branch from 4bf5983 to 0f77145 Compare July 16, 2018 18:42
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep`` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
- :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`)
- :class:`Float64HashTable` handles zeros/signed zeros and all flavors of NaNs consistently: it is no longer possible to have both, zero and signed-zero, as keys at the same time in a table, also there can be at most one NaN-key in a table (:issue:`21866`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, can you reword to just focus on .unique()

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comments. rebase & ping on green.

tm.assert_numpy_array_equal(result, expected, strict_nan=True)

def test_signed_zero(self):
a = np.array([-0.0, 0.0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number here as a comment (and on test below)

- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep`` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
- :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`)
- :func:`unique` handles signed zeros consistently: it is no longer possible to have both, 0.0 and -0.0, in the same resulting array (:issue:`21866`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to bug fix / Numeric section.

@jreback jreback added this to the 0.24.0 milestone Jul 20, 2018
realead added 2 commits July 21, 2018 08:45
The following issues

   1)  hash(0.0) != hash(-0.0)
   2)  hash(x) != hash(y) for different x,y which are nans

are solved by setting:

   1) hash(-0.0):=hash(0.0)
   2) hash(x):=hash(np.nan) for every x which is nan
@realead realead force-pushed the hash_for_float64_GH21866 branch from 11131c9 to 124b095 Compare July 21, 2018 06:46
@realead
Copy link
Contributor Author

realead commented Jul 21, 2018

@jreback Done.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2018

@jbrockmendel any comments? if ok pls merge.

@jbrockmendel
Copy link
Member

@realead good job catching a subtle bug. Thanks for taking point on this.

@jbrockmendel jbrockmendel merged commit 2c7c797 into pandas-dev:master Jul 25, 2018
@realead realead deleted the hash_for_float64_GH21866 branch August 9, 2018 19:34
realead added a commit to realead/pandas that referenced this pull request Aug 12, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Aug 21, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Aug 23, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 5, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 8, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 12, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 15, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 17, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
…-dev#21904)

* BUG: issues with hash-function for Float64HashTable (GH21866)

The following issues

   1)  hash(0.0) != hash(-0.0)
   2)  hash(x) != hash(y) for different x,y which are nans

are solved by setting:

   1) hash(-0.0):=hash(0.0)
   2) hash(x):=hash(np.nan) for every x which is nan

* add the id of the issue to tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issues with hash-function for float64 version of klib's hash-map

3 participants