Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minsize argument for Dict.empty() #8110

Closed
stefanfed opened this issue May 25, 2022 · 9 comments · Fixed by #8183
Closed

Minsize argument for Dict.empty() #8110

stefanfed opened this issue May 25, 2022 · 9 comments · Fixed by #8183
Labels
feature_request performance - run time Performance issue occurring at run time.

Comments

@stefanfed
Copy link
Contributor

Feature request

Being able to allocate memory for a Dict based on problem-specific information would minimize resizes and increase performance. A Dict is initialized by a call to the function numba_dict_new_minsize in dictobject.c. The starting number of buckets is fixed at D_MINSIZE = 8, which only allows for 5 entries without a resize.

The code comments state that this is suitable for the common case of a small dictionary used for passing keyword arguments. I believe an optional size argument would give users the ability to achieve significantly better performance with large hash tables. This would be in line with Numba's focus on fast numerical computation.

I've tried to modify Numba so that I could pass a size argument to Dict.empty() and have it call numba_dict_new instead of numba_dict_new_minsize. Unfortunately, I haven't been successful so far. Please consider adding such an option in a future release. Thank you.

@gmarkall
Copy link
Member

Many thanks for the feature request, and for identifying the required changes. It sounds like you've made a good attempt at making the change so far - would you like to push your branch up somewhere and perhaps I can suggest how to complete the implementation?

@gmarkall gmarkall added the performance - run time Performance issue occurring at run time. label May 31, 2022
@gmarkall
Copy link
Member

Noting from the triage meeting that a similar request is also made for typed lists regularly.

@stefanfed
Copy link
Contributor Author

Apologies for the delay in responding. I'm working on uploading the (unsuccessful) changes I made to a fork.

I believe typed lists have this feature under the empty_list method. The argument is called allocated.

@stefanfed stefanfed reopened this May 31, 2022
stefanfed added a commit to stefanfed/numba that referenced this issue Jun 10, 2022
@stefanfed
Copy link
Contributor Author

Hi @gmarkall,

I've pushed my changes to a branch at https://github.com/stefanfed/numba/tree/dict_empty_allocated

This implementation of Dict.empty() works in both the object and nopython modes only when the optional argument allocated is set to the default None (or omitted). It appears to have the same behavior as before the changes. When an integer is provided, it shows the following error message and crashes the interpreter:

>>> Dict.empty(nb.int64, nb.int64, allocated=8)
LLVM ERROR: Symbol not found: numba_dict_new

I haven't figured out how these functions are linked to LLVM from dictobject.h. It seems numba_dict_new has been exported in the same way as numba_dict_new_minsize based on the preceding NUMBA_EXPORT_FUNC(int) macro. I'm not sure what that does exactly, to be honest. Likely, changes have to be made to another file to have LLVM recognize numba_dict_new. My understanding of the internal workings of numba with LLVM is lacking. That's my biggest problem.

The other problem is that allocated must be a power of 2 according to the notes in dictobject.h. I'm unsure what the proper behavior would be if a number that isn't a power of 2 is passed. I'd suggest rounding it up to the nearest power of 2. However, it could also be rounded down, or it could fail, leaving the check up to the user.

I'm also wondering what the ideal method rounding to the nearest power of 2 would be and at what stage to do it. Perhaps it should be done in the C implementation?

I'd appreciate your help in completing this. Thanks.

@gmarkall
Copy link
Member

Hi @stefanfed - many thanks for pushing your branch!

Regarding this problem:

>>> Dict.empty(nb.int64, nb.int64, allocated=8)
LLVM ERROR: Symbol not found: numba_dict_new

The numba_dict_new function needs to be added to LLVM's symbol table. This is done using llvmlite.binding.add_symbol(). One place in Numba where a collection of these are added is in the _load_global_helpers() function:

numba/numba/core/base.py

Lines 155 to 158 in 2f89d45

for c_helpers in (_helperlib.c_helpers, _dynfunc.c_helpers):
for py_name, c_address in c_helpers.items():
c_name = "numba_" + py_name
ll.add_symbol(c_name, c_address)

The c_helpers dicts of these modules are constructed using a function with a macro called declmethod (for _helperlib, for example):

numba/numba/_helpermod.c

Lines 28 to 49 in 2f89d45

static PyObject *
build_c_helpers_dict(void)
{
PyObject *dct = PyDict_New();
if (dct == NULL)
goto error;
#define _declpointer(name, value) do { \
PyObject *o = PyLong_FromVoidPtr(value); \
if (o == NULL) goto error; \
if (PyDict_SetItemString(dct, name, o)) { \
Py_DECREF(o); \
goto error; \
} \
Py_DECREF(o); \
} while (0)
#define declmethod(func) _declpointer(#func, &numba_##func)
#define declpointer(ptr) _declpointer(#ptr, &numba_##ptr)
declmethod(fixed_fmod);

In order to make this work for the numba_dict_new method, I would suggest adding a similar construct that builds a helpers dict for typed dict methods containing the numba_dict_new method, and adding it to the list of c_helpers dicts that get initialized in _load_global_helpers().

It seems numba_dict_new has been exported in the same way as numba_dict_new_minsize based on the preceding NUMBA_EXPORT_FUNC(int) macro. I'm not sure what that does exactly, to be honest.

Basically it just adds __attribute__ ((visibility("hidden"))) on Unix-like platforms so that the symbols are defined but not exported by default (I believe this is already the default on Windows anyway).

The other problem is that allocated must be a power of 2 according to the notes in dictobject.h. I'm unsure what the proper behavior would be if a number that isn't a power of 2 is passed. I'd suggest rounding it up to the nearest power of 2. However, it could also be rounded down, or it could fail, leaving the check up to the user.

I think rounding up and documenting somewhere that the behaviour is to round up would be acceptable.

I'm also wondering what the ideal method rounding to the nearest power of 2 would be and at what stage to do it. Perhaps it should be done in the C implementation?

My guess would be that this would be the easiest place to ensure that only power-of-2 sizes are allocated - I think having no check / adjustment in the C side and relying on something in the Python side to round things up to power-of-2 sizes leaves open more opportunity for a future change to accidentally bypass the rounding-up, and I'd imagine it will also be easier to reason about the correctness of your changes at the point of review as well.

I'd appreciate your help in completing this. Thanks.

Many thanks for your efforts so far! I hope we can get this to completion - please do let us know if we have provide further guidance.

stefanfed added a commit to stefanfed/numba that referenced this issue Jun 19, 2022
See numba#8110
The new 'allocated' parameter in Dict.empty() is the number of entries the dictionary
should take without requiring a resize.
@stefanfed
Copy link
Contributor Author

Hi @gmarkall,

I was able to fix the LLVM error by adding a single line to _helpermod.c. Thanks for pointing me there.

I added a function for rounding up to the next power of 2 in dictobject.c. To ensure no resizes are performed, I round up from 1.5 times the requested size because the load factor is set to 2/3 (after which the dictionary is resized). So the new parameter allocated is the number of dictionary entries that can be added without requiring a resize.

The implementation appears to work correctly now. I tested it with various inputs where the number of keys to be added is known ahead of time. The performance gain is over 2x for some ideal inputs (like a monotonically increasing range of ints), and it grows in steps with the input size as expected. When the ints are shuffled, the ratio goes down but the amount of time saved appears to stay the same. I saw no performance regressions.

I have a concern about the way I round up to the next power of 2:
https://github.com/stefanfed/numba/blob/078c70148de5f10041cb11f57c4f5e8ecdecba33/numba/cext/dictobject.c#L282-L296

Because Py_ssize_t is signed, it overflows when the next size is 2^63 on 64-bit, for example. I really doubt anyone has anywhere near enough memory to get to there, so this won't be a problem on 64-bit (though it could cause a crash if someone tried it). However, it could be a substantial problem on 32-bit. I wasn't sure how to resolve this - I'd appreciate your advice.

I'm not sure why Py_ssize_t is used for the size parameter as opposed to size_t. The size can never be negative. Could I change that, or would it cause problems elsewhere? Also, I'll note that the method may not be the most efficient because it doesn't take advantage of certain instructions that exist in most modern desktop processors. This likely isn't a priority, but it could be optimized.

Another change I hesitated to make involves numba_dict_new:
https://github.com/stefanfed/numba/blob/078c70148de5f10041cb11f57c4f5e8ecdecba33/numba/cext/dictobject.c#L560-L585

Currently, the only functionality of numba_dict_new_minsize is to call numba_dict_new with size=D_MINSIZE. Since we can guarantee D_MINSIZE is a power of 2, we don't need to round it up. Therefore, in numba_dict_new, I only attempt to round up if size != D_MINSIZE. An alternative is to copy all the code from numba_dict_new to numba_dict_new_minsize without the rounding up part. Would you say that's a good idea?

Should I open a pull request? I noticed that you require unit tests. I'm not 100% sure on this part. What kind of tests would you like me to write?

@stefanfed
Copy link
Contributor Author

On second thought, numba_dict_new needs to be changed because the size parameter takes on 2 different meanings. In the minsize case, it's literally the allocation size, while in other cases it's the number of keys that can be inserted without a resize. So in the single case where someone passes allocated=8, it won't behave as I intend it to. I'll update it with the changes shortly.

stefanfed added a commit to stefanfed/numba that referenced this issue Jun 19, 2022
@stefanfed
Copy link
Contributor Author

I restructured a few things in my latest commit. I renamed allocated to n_keys, because it makes more sense with the desired behavior: it's the number of keys to make space for so that a resize isn't needed while adding them. I also created a new function called numba_dict_new_noresize in dictobject.c which contains all of the new functionality:
https://github.com/stefanfed/numba/blob/acbf5c274b4509488b6a47380088c0589874b376/numba/cext/dictobject.c#L563-L581

No other functions have been touched. This is then added to the LLVM symbol table and called from the python implementation. The overflow concern with rounding up to the nearest power of 2 remains. I'll have a closer look at solutions, though not using a signed type for this argument would be ideal.

@gmarkall
Copy link
Member

I was able to fix the LLVM error by adding a single line to _helpermod.c. Thanks for pointing me there.

Ah - I hadn't noticed that _helpermod.c already contained some of the dict_new methods - thanks for spotting that it's simpler than I suggested it could be 🙂

Because Py_ssize_t is signed, it overflows when the next size is 2^63 on 64-bit, for example. I really doubt anyone has anywhere near enough memory to get to there, so this won't be a problem on 64-bit (though it could cause a crash if someone tried it). However, it could be a substantial problem on 32-bit. I wasn't sure how to resolve this - I'd appreciate your advice.

If my maths is correct:

In [2]: 2 ** (8 * 4 - 2)
Out[2]: 1073741824

On a 32-bit system, that is still a large size (1GB of contiguous allocation) for a system that can only have 2GB of addressable memory per process (On both Windows and Linux x86, at least). I'd be inclined not to worry about this (do let me know if I seem to have an incorrect calculation or assumption here though please 🙂)

Currently, the only functionality of numba_dict_new_minsize is to call numba_dict_new with size=D_MINSIZE. Since we can guarantee D_MINSIZE is a power of 2, we don't need to round it up. Therefore, in numba_dict_new, I only attempt to round up if size != D_MINSIZE. An alternative is to copy all the code from numba_dict_new to numba_dict_new_minsize without the rounding up part. Would you say that's a good idea?

Probably not - it sounds more succinct as it is.

Should I open a pull request?

Yes, I think your progress is looking great and we're a lot of the way there with this, so I think opening a PR would be totally appropriate and a really helpful way for us to iterate this to completion.

I noticed that you require unit tests. I'm not 100% sure on this part. What kind of tests would you like me to write?

I'll have a look through what the possibilities are here, since I think you're correct in identifying that it's not just obvious how to test this. In the meantime if you make any changes you were going to make and open the PR, and I'll add some suggestions for testing in comments on the PR.

Thanks for your work and perseverance so far!

stefanfed added a commit to stefanfed/numba that referenced this issue Jun 21, 2022
stefanfed added a commit to stefanfed/numba that referenced this issue Jul 6, 2022
stefanfed added a commit to stefanfed/numba that referenced this issue Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request performance - run time Performance issue occurring at run time.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants