Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: bucket overflow #55

Closed
victoryc opened this issue Apr 4, 2021 · 8 comments
Closed

ERROR: bucket overflow #55

victoryc opened this issue Apr 4, 2021 · 8 comments
Assignees

Comments

@victoryc
Copy link
Contributor

victoryc commented Apr 4, 2021

I occasionally get the following error for the code that normally works. Any idea what causes this error?

526/512 ERROR: bucket overflow! (n=823, bucket_num=2/4, key=0)
512/512 ERROR: bucket overflow! (n=823, bucket_num=2/4, key=0)
514/512 ERROR: bucket overflow! (n=823, bucket_num=2/4, key=0)
516/512 ERROR: bucket overflow! (n=823, bucket_num=2/4, key=0)
528/512 ERROR: bucket overflow! (n=823, bucket_num=2/4, key=0)
530/512 ERROR: bucket overflow! (n=823, bucket_num=2/4, key=0)
...

An observation that may or may not be relevant: I noticed that this error seems to occur when the number of coordinates in the sparse tensor is small. The same code ran fine without producing the above error when a bigger input with more coordinates was fed.

@zhijian-liu zhijian-liu self-assigned this Apr 5, 2021
@zhijian-liu
Copy link

Could you provide us with a sample code and data that can reproduce this error? Thanks!

@victoryc
Copy link
Contributor Author

victoryc commented Apr 5, 2021

As of now, my program is quite big (involving multiple files) and the error occurs only occasionally. So, I am not in a position to provide simple sample code to reproduce this as of now.

Instead, I have tried to narrow this down and this is what I have found so far.
The error occurs when the torchsparse_backend.query_forward function is called at line 21 of functional/query.py. The top of the callstack:

forward (\home\username\projects\detect3d\myproj\torchsparse\torchsparse\nn\functional\query.py:20)
sphashquery (\home\username\projects\detect3d\myproj\torchsparse\torchsparse\nn\functional\query.py:40)
conv3d (\home\username\projects\detect3d\myproj\torchsparse\torchsparse\nn\functional\conv.py:144)
forward (\home\username\projects\detect3d\myproj\torchsparse\torchsparse\nn\modules\conv.py:72)
_call_impl (\home\username\anaconda3\envs\myproj\lib\python3.8\site-packages\torch\nn\modules\module.py:889)

The variable values in that function when the error occurs: C=823, hash_query.shape=([27, 823])

@victoryc
Copy link
Contributor Author

victoryc commented Apr 8, 2021

I found out the cause of this in the code.
In query.cpp, the following line is used to compute the size of the table:
int table_size = 2 * pow(2,ceil(log2((double)n)));
In hashmap.cuh, the number of buckets is computed to be table_size/BUCKET_SIZE, where BUCKET_SIZE is 512.

When the hash values of the coordinates are evenly distributed, there is no problem. Taking the table_size to be 2 times next power of 2 greater than n, as done in the line above, provides some amount of tolerance for the case where the hash values are not evenly distributed. But, that is only a heuristic. When the hash values are more un-evenly distributed than that, the code fails with the "ERROR: bucket overflow!" messages.

@victoryc
Copy link
Contributor Author

victoryc commented Apr 8, 2021

For the benefit of others running into this, let me mention how I addressed it for now. There used to be the following line in query.cpp:
int table_size = 2 * pow(2,ceil(log2((double)n)));
I replaced it by:

  const int nextPow2 = pow(2,ceil(log2((double)n)));
  // When n is large, the hash values tend to be more evenly distrubuted and choosing table_size to be 
  // 2 * nextPow2 typically suffices. For smaller n, the effect of uneven distribution of hash values is more
  // pronounced and hence we choose table_size to be 4 * nextPow2 to reduce the chance of bucket overflow.
  int table_size =  (n < 2048) ? 4 * nextPow2 : 2 * nextPow2;

This way, for larger n, when memory size consideration is more important, the table_size is getting computed to be the same as it used to be. For smaller n, when the effect of uneven distribution of hash values is more pronounced and memory size consideration is not as important, we compute the table_size to be bigger than it used to be, to reduce the chances of bucket overflow error.

Original authors: I will submit a pull request with this change. But, I won't be offended :-) if you reject this pull request since this change just reduces the chances of the bucket overflow error and doesn't address the issue in a robust, thorough way.

@zhijian-liu
Copy link

@victoryc, thank you so much for investigating this issue! I was discussing with @kentangSJTU this morning about the same solution. I will be happy to take a look after you submit the PR. We are also thinking about making this a knob to tune so that users can change that based on their needs. Take one step further, it might also be possible to dynamically adjust this if there is overflow.

@victoryc victoryc closed this as completed Apr 8, 2021
@harrydobbs
Copy link

Hi there,
I am getting the same error. Are you still planning on adding a knob to tune?
Cheers,
Harry

@zhijian-liu
Copy link

Hi @harry1576, it's not in our plan for the near future. As a workaround, you may follow #55 (comment) to enlarge the table size further.

@Akola-Mbey-Denis
Copy link

Please I followed through the suggestions for this issue, but that did not fix it. Is there an alternative solution to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants