Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementations of type hashing. #3703

Merged
merged 28 commits into from
Feb 14, 2019
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
1d17146
Implementations of Number type hashing.
stuartarchibald Jan 15, 2019
9310fdf
Fix up set IR hash function resolution.
stuartarchibald Jan 23, 2019
fbe464b
Fix callconv mismatch in set relection logic.
sklam Jan 23, 2019
c7dfb7a
Fix set_hash_item function definition
sklam Jan 23, 2019
a900f42
Fixes on tests and add hash on return for new strings.
stuartarchibald Jan 23, 2019
835290c
Fix Python 2.7.
stuartarchibald Jan 23, 2019
b9a4412
Flake 8 fixes.
stuartarchibald Jan 23, 2019
3389e97
Adjust return types.
stuartarchibald Jan 24, 2019
666e45a
Update to use C type names.
stuartarchibald Jan 24, 2019
871900d
32 bit fixes
stuartarchibald Jan 25, 2019
9c86d0a
Missed the var module.
stuartarchibald Jan 25, 2019
0b384b6
Fix up for windows pointer sizes.
stuartarchibald Jan 28, 2019
a47d464
Fix up int overflows.
stuartarchibald Jan 28, 2019
0888bcc
Fix i64 min hash on 32bit
stuartarchibald Jan 29, 2019
8a36707
Add checks for hashing on Py2.
stuartarchibald Jan 29, 2019
8c6bdfd
Show numba -s in azure.
stuartarchibald Jan 29, 2019
e4b0da0
limit top end of ints for windows
stuartarchibald Jan 29, 2019
7e46bd9
tmp commit, only run windows
stuartarchibald Jan 30, 2019
0c54e9c
add numba -s
stuartarchibald Jan 30, 2019
08befe5
debug
stuartarchibald Jan 30, 2019
34a15cd
Only guarantee int hash = int matches up to 32bit max.
stuartarchibald Jan 30, 2019
a9e81ff
Only guarantee int hash = int matches up to signed 32bit max
stuartarchibald Jan 30, 2019
e184a24
Revert "tmp commit, only run windows"
stuartarchibald Jan 30, 2019
f8ddce8
reenable all tests
stuartarchibald Jan 30, 2019
65196c6
Add documentation
stuartarchibald Jan 31, 2019
3143774
Remove debug
stuartarchibald Jan 31, 2019
b82113d
Respond to PR feedback
stuartarchibald Feb 1, 2019
5ab801f
Fail on invalid cast.
stuartarchibald Feb 13, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions buildscripts/azure/azure-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ jobs:
# One of the tbb tests is failing on Azure. Removing tbb from testing
# until we can figure out why
conda remove -y tbb tbb-devel
pushd bin
numba -s
popd
python -m numba.tests.test_runtests
python runtests.py -m 2 -b -- numba.tests
displayName: 'Test'
57 changes: 57 additions & 0 deletions docs/source/developer/hashing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@

================
Notes on Hashing
================

Numba supports the built-in :func:`hash` and does so by simply calling the
:func:`__hash__` member function on the supplied argument. This makes it
trivial to add hash support for new types as all that is required is the
application of the extension API :func:`overload_method` decorator to overload
a function for computing the hash value for the new type registered to the
type's :func:`__hash__` method. For example::

from numba.extending import overload_method

@overload_method(myType, '__hash__')
def myType_hash_overload(obj):
# implementation details


The Implementation
==================

The implementation of the Numba hashing functions strictly follows that of
Python 3. The only exception to this is that for hashing Unicode and bytes (for
content longer than ``sys.hash_info.cutoff``) the only supported algorithm is
``siphash24`` (default in CPython 3). As a result Numba will match Python 3
hash values for all supported types under the default conditions described.
Python 2 hashing support is set up to follow Python 3 and similar defaults are
hard coded for this purpose, including, perhaps most noticeably,
``sys.hash_info.cutoff`` is set to zero.

Unicode hash cache differences
------------------------------

Both Numba and CPython Unicode string internal representations have a ``hash``
member for the purposes of caching the string's hash value. This member is
always checked ahead of computing a hash value the with view of simply providing
a value from cache as it is considerably cheaper to do so. The Numba Unicode
string hash caching implementation behaves in a similar way to that of
CPython's. The only notable behavioral change (and its only impact is a minor
potential change in performance) is that Numba always computes and caches the
hash for Unicode strings created in ``nopython mode`` at the time they are boxed
for reuse in Python, this is too eager in some cases in comparison to CPython
which may delay hashing a new Unicode string depending on creation method. It
should also be noted that Numba copies in the ``hash`` member of the CPython
internal representation for Unicode strings when unboxing them to its own
representation so as to not recompute the hash of a string that already has a
hash value associated with it.

The accommodation of ``PYTHONHASHSEED``
---------------------------------------

The ``PYTHONHASHSEED`` environment variable can be used to seed the CPython
hashing algorithms for e.g. the purposes of reproduciblity. The Numba hashing
implementation directly reads the CPython hashing algorithms' internal state and
as a result the influence of ``PYTHONHASHSEED`` is replicated in Numba's
hashing implementations.
1 change: 1 addition & 0 deletions docs/source/developer/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@ Developer Manual
stencil.rst
custom_pipeline.rst
environment.rst
hashing.rst
roadmap.rst
20 changes: 20 additions & 0 deletions docs/source/reference/pysupported.rst
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,7 @@ The following built-in functions are supported:
* :func:`divmod`
* :func:`enumerate`
* :class:`float`
* :func:`hash` (see :ref:`pysupported-hashing` below)
* :class:`int`: only the one-argument form
* :func:`iter`: only the one-argument form
* :func:`len`
Expand All @@ -353,6 +354,25 @@ The following built-in functions are supported:
(e.g. numbers and named tuples)
* :func:`zip`

.. _pysupported-hashing:

Hashing
-------

The :func:`hash` built-in is supported and produces hash values for all
supported hashable types with the following Python version specific behavior:

Under Python 3, hash values computed by Numba will exactly match those computed
in CPython under the condition that the :attr:`sys.hash_info.algorithm` is
``siphash24`` (default).

Under Python 2, hash values computed by Numba will follow the behavior
described for Python 3 with the :attr:`sys.hash_info.algorithm` emulated as
``siphash24``. No attempt is made to replicate Python 2 hashing behavior.

The ``PYTHONHASHSEED`` environment variable influences the hashing behavior in
precisely the manner described in the CPython documentation.


Standard library modules
========================
Expand Down
46 changes: 45 additions & 1 deletion numba/_helperlib.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
*/

#include "_pymodule.h"
#include <stddef.h>
#include <stdio.h>
#include <math.h>
#include "_math_c99.h"
Expand Down Expand Up @@ -1063,12 +1064,55 @@ numba_unpickle(const char *data, int n)
* Unicode helpers
*/

/* Developer note:
*
* The hash value of unicode objects is obtained via:
* ((PyASCIIObject *)(obj))->hash;
* The use comes from this definition:
* https://github.com/python/cpython/blob/6d43f6f081023b680d9db4542d19b9e382149f0a/Objects/unicodeobject.c#L119-L120
* and it's used extensively throughout the `cpython/Object/unicodeobject.c`
* source, not least in `unicode_hash` itself:
* https://github.com/python/cpython/blob/6d43f6f081023b680d9db4542d19b9e382149f0a/Objects/unicodeobject.c#L11662-L11679
*
* The Unicode string struct layouts are described here:
* https://github.com/python/cpython/blob/6d43f6f081023b680d9db4542d19b9e382149f0a/Include/cpython/unicodeobject.h#L82-L161
* essentially, all the unicode string layouts start with a `PyASCIIObject` at
* offset 0 (as of commit 6d43f6f081023b680d9db4542d19b9e382149f0a, somewhere
* in the 3.8 development cycle).
*
* For safety against future CPython internal changes, the code checks that the
* _base members of the unicode structs are what is expected in 3.7, and that
* their offset is 0. It then walks the struct to the hash location to make sure
* the offset is indeed the same as PyASCIIObject->hash.
* Note: The large condition in the if should evaluate to a compile time
* constant.
*/

#define MEMBER_SIZE(structure, member) sizeof(((structure *)0)->member)

NUMBA_EXPORT_FUNC(void *)
numba_extract_unicode(PyObject *obj, Py_ssize_t *length, int *kind) {
numba_extract_unicode(PyObject *obj, Py_ssize_t *length, int *kind,
Py_ssize_t *hash) {
#if (PY_MAJOR_VERSION >= 3) && (PY_MINOR_VERSION >= 3)
if (!PyUnicode_READY(obj)) {
*length = PyUnicode_GET_LENGTH(obj);
*kind = PyUnicode_KIND(obj);
/* this is here as a crude check for safe casting of all unicode string
* structs to a PyASCIIObject */
if (MEMBER_SIZE(PyCompactUnicodeObject, _base) == sizeof(PyASCIIObject) &&
MEMBER_SIZE(PyUnicodeObject, _base) == sizeof(PyCompactUnicodeObject) &&
offsetof(PyCompactUnicodeObject, _base) == 0 &&
offsetof(PyUnicodeObject, _base) == 0 &&
offsetof(PyCompactUnicodeObject, _base.hash) == offsetof(PyASCIIObject, hash) &&
offsetof(PyUnicodeObject, _base._base.hash) == offsetof(PyASCIIObject, hash)
) {
/* Grab the hash from the type object cache, do not compute it. */
*hash = ((PyASCIIObject *)(obj))->hash;
}
else {
/* cast is not safe, set the hash as -1 to force eval as needed */
*hash = -1;
stuartarchibald marked this conversation as resolved.
Show resolved Hide resolved
}
return PyUnicode_DATA(obj);
} else {
return NULL;
Expand Down
1 change: 0 additions & 1 deletion numba/dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -643,7 +643,6 @@ def compile(self, sig):
existing = self.overloads.get(tuple(args))
if existing is not None:
return existing.entry_point

# Try to load from disk cache
cres = self._cache.load_overload(sig, self.targetctx)
if cres is not None:
Expand Down
16 changes: 13 additions & 3 deletions numba/pythonapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -1136,21 +1136,25 @@ def string_as_string_size_and_kind(self, strobj):
The ``buffer`` is a i8* of the output buffer.
The ``length`` is a i32/i64 (py_ssize_t) of the length of the buffer.
The ``kind`` is a i32 (int32) of the Unicode kind constant
The ``hash`` is a long/uint64_t (py_hash_t) of the Unicode constant hash
"""
if PYVERSION >= (3, 3):
p_length = cgutils.alloca_once(self.builder, self.py_ssize_t)
p_kind = cgutils.alloca_once(self.builder, Type.int())
p_hash = cgutils.alloca_once(self.builder, self.py_hash_t)
fnty = Type.function(self.cstring, [self.pyobj,
self.py_ssize_t.as_pointer(),
Type.int().as_pointer()])
Type.int().as_pointer(),
self.py_hash_t.as_pointer()])
fname = "numba_extract_unicode"
fn = self._get_function(fnty, name=fname)

buffer = self.builder.call(fn, [strobj, p_length, p_kind])
buffer = self.builder.call(fn, [strobj, p_length, p_kind, p_hash])
ok = self.builder.icmp_unsigned('!=',
ir.Constant(buffer.type, None),
buffer)
return (ok, buffer, self.builder.load(p_length), self.builder.load(p_kind))
return (ok, buffer, self.builder.load(p_length),
self.builder.load(p_kind), self.builder.load(p_hash))
else:
assert False, 'not supported on Python < 3.3'

Expand Down Expand Up @@ -1188,6 +1192,12 @@ def bytes_from_string_and_size(self, string, size):
fn = self._get_function(fnty, name=fname)
return self.builder.call(fn, [string, size])

def object_hash(self, obj):
fnty = Type.function(self.py_hash_t, [self.pyobj,])
fname = "PyObject_Hash"
fn = self._get_function(fnty, name=fname)
return self.builder.call(fn, [obj,])

def object_str(self, obj):
fnty = Type.function(self.pyobj, [self.pyobj])
fn = self._get_function(fnty, name="PyObject_Str")
Expand Down
2 changes: 1 addition & 1 deletion numba/targets/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ def refresh(self):
# Populate built-in registry
from . import (arraymath, enumimpl, iterators, linalg, numbers,
optional, polynomial, rangeobj, slicing, smartarray,
tupleobj, gdb_hook)
tupleobj, gdb_hook, hashing)
try:
from . import npdatetime
except NotImplementedError:
Expand Down
2 changes: 1 addition & 1 deletion numba/targets/boxing.py
Original file line number Diff line number Diff line change
Expand Up @@ -819,7 +819,7 @@ def _python_set_to_native(typ, obj, c, size, setptr, errorptr):
native = c.unbox(typ.dtype, itemobj)
with c.builder.if_then(native.is_error, likely=False):
c.builder.store(cgutils.true_bit, errorptr)
inst.add(native.value, do_resize=False)
inst.add_pyapi(c.pyapi, native.value, do_resize=False)

if typ.reflected:
inst.parent = obj
Expand Down