Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix numpy DeprecationWarning in hashing.py #308

Closed
lesteve opened this issue Feb 4, 2016 · 1 comment · Fixed by #309
Closed

Fix numpy DeprecationWarning in hashing.py #308

lesteve opened this issue Feb 4, 2016 · 1 comment · Fixed by #309
Assignees

Comments

@lesteve
Copy link
Member

lesteve commented Feb 4, 2016

From nilearn tests:

/home/ubuntu/virtualenvs/venv-system/local/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:178: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
@aabadie
Copy link
Contributor

aabadie commented Feb 4, 2016

For the record, this deprecation warning announce on numpy github project: https://github.com/numpy/numpy/blob/master/doc/release/1.10.2-notes.rst#deprecate-views-changing-dimensions-in-fortran-order

jdanbrown added a commit to jdanbrown/joblib that referenced this issue Feb 16, 2020
- Previously, Hasher would collect the full bytes output of `self.dump(obj)` into the `BytesIO` at `self.stream`, and then `self.stream.getvalue()` the bytes to put into `self._hash.update(bytes)`. This requires O(n) space inside `self.stream`.
- This PR proposes a change that hands off all the Pickler's `file.write(bytes)` calls directly to `self._hash.update(bytes)`, without first accumulating O(n) of space in an in-memory stream. This requires O(1) space.
- I haven't measured performance improvements (or regressions?), but I'd expect this to be a meaningful win when hashing large objects (~gigs), since they no longer have to allocate a ton of ram
- And even when runtime performance isn't noticeably better, this at least makes hashing large objects memory safe, in that you don't need ~1x extra ram to compute the hash of a large object

Questions:
- Do you see any reasons why this wouldn't be a desirable change?
- Also, this makes 1/1185 tests fail. Any guess as to why?

```
___________________________________________________ test_hashes_stay_the_same_with_numpy_objects ____________________________________________________

    @with_numpy
    def test_hashes_stay_the_same_with_numpy_objects():
        # We want to make sure that hashes don't change with joblib
        # version. For end users, that would mean that they have to
        # regenerate their cache from scratch, which potentially means
        # lengthy recomputations.
        rng = np.random.RandomState(42)
        # Being explicit about dtypes in order to avoid
        # architecture-related differences. Also using 'f4' rather than
        # 'f8' for float arrays because 'f8' arrays generated by
        # rng.random.randn don't seem to be bit-identical on 32bit and
        # 64bit machines.
        to_hash_list = [
            rng.randint(-1000, high=1000, size=50).astype('<i8'),
            tuple(rng.randn(3).astype('<f4') for _ in range(5)),
            [rng.randn(3).astype('<f4') for _ in range(5)],
            {
                -3333: rng.randn(3, 5).astype('<f4'),
                0: [
                    rng.randint(10, size=20).astype('<i8'),
                    rng.randn(10).astype('<f4')
                ]
            },
            # Non regression cases for joblib#308.
            # Generated with joblib 0.9.4.
            np.arange(100, dtype='<i8').reshape((10, 10)),
            # Fortran contiguous array
            np.asfortranarray(np.arange(100, dtype='<i8').reshape((10, 10))),
            # Non contiguous array
            np.arange(100, dtype='<i8').reshape((10, 10))[:, :2],
        ]

        # These expected results have been generated with joblib 0.9.0
        expected_dict = {'py2': ['80f2387e7752abbda2658aafed49e086',
                                 '0d700f7f25ea670fd305e4cd93b0e8cd',
                                 '83a2bdf843e79e4b3e26521db73088b9',
                                 '63e0efd43c0a9ad92a07e8ce04338dd3',
                                 '03fef702946b602c852b8b4e60929914',
                                 '07074691e90d7098a85956367045c81e',
                                 'd264cf79f353aa7bbfa8349e3df72d8f'],
                         'py3': ['10a6afc379ca2708acfbaef0ab676eab',
                                 '988a7114f337f381393025911ebc823b',
                                 'c6809f4b97e35f2fa0ee8d653cbd025c',
                                 'b3ad17348e32728a7eb9cda1e7ede438',
                                 '927b3e6b0b6a037e8e035bda134e0b05',
                                 '108f6ee98e7db19ea2006ffd208f4bf1',
                                 'bd48ccaaff28e16e6badee81041b7180']}

        py_version_str = 'py3' if PY3_OR_LATER else 'py2'
        expected_list = expected_dict[py_version_str]

        for to_hash, expected in zip(to_hash_list, expected_list):
>           assert hash(to_hash) == expected
E           AssertionError: assert '81b225876779e9dfd235bfd83bb555cb' == '10a6afc379ca2708acfbaef0ab676eab'
E             - 81b225876779e9dfd235bfd83bb555cb
E             + 10a6afc379ca2708acfbaef0ab676eab

joblib/test/test_hashing.py:459: AssertionError
========================================= 1 failed, 1184 passed, 21 skipped, 1 xfailed in 140.11s (0:02:20) =========================================
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants