Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filling BFArray via buffer argument does not consistently work #231

Closed
dentalfloss1 opened this issue Feb 28, 2024 · 5 comments
Closed

Filling BFArray via buffer argument does not consistently work #231

dentalfloss1 opened this issue Feb 28, 2024 · 5 comments

Comments

@dentalfloss1
Copy link
Collaborator

In the current version of test_udp_io.py in the ibverbs-support branch, we fill bfarrays like so:
final = bf.ndarray(shape=(final.shape[0],4,4096), dtype='ci4', buffer=final.ctypes.data)
This method produces a bfarray that occasionally does not contain the correct data. It's possible this is platform dependent as these tests tend to not fail on Mac.
This following method seems to work consistently:
final = bf.ndarray(final,dtype='ci4')

Attached is a test in which the first method fails on our local Ubuntu based machine while the second method passes. redo_test_udp_io.py.txt

@jaycedowell
Copy link
Collaborator

@dentalfloss1 One thing I noticed when looking into this was that AccumulateOp is directly saving the ring's contents to final, i.e., https://github.com/ledatelescope/bifrost/blob/ibverb-support/test/test_udp_io.py#L184 That's probably a bad idea since the ring's data could be getting overwritten or destroyed when the pipeline finishes. It would probably be better save a copy of idata instead. I don't think that this is the root cause of this issue but it could be a contributing factor.

@jaycedowell
Copy link
Collaborator

I'm thinking through this some more. For the first part in AccumulateOp we have something like:

import numpy as np
final = []
for i in range(100):
    final.append( np.random.rand(500) )

for f in final:
    print(f.__array_interface__['data'][0])

On my Ubuntu desktop the values printed out increment by 4016 while on my Mac it's 4096. For reference the default Bifrost alignment is 4096.

For the next part in the test suite we do something like:

f = np.array(final)
print(f.__array_interface__['data'][0]

On my desktop I get something that is aligned at 16 while my Mac still goes to an alignment of 4096.

Then we do a transpose (which is probably in place) and a copy:

g = f.transpose(1,0).copy()
print(g.__array_interface__['data'][0]

This time on my desktop I get something aligned at 32 while the Mac still ends up at 4096.

My guess is that using bf.ndarray(buffer=...) is sensitive to how the provided buffer is aligned. I'm not really sure what the mechanism would be, though. I just don't see something like that in the code.

If you build Bifrost with an alignment of 16 instead of 4096 do these failures disappear?

@jaycedowell
Copy link
Collaborator

I tried the test suggested above and it does... something. I still get failures on my desktop but now they are things like a comparison with an array full of zeros. That's believable if you assume some packets are getting dropped for whatever reason.

I'm still not convinced that I'm seeing the whole picture.

@jaycedowell
Copy link
Collaborator

jaycedowell commented Mar 16, 2024

And why are all of the self-hosted tests failing now with

checking for valid CUDA architectures... found: 50 52 53 60 61 62 70 72 75 80 86 87 89 90
configure: error: failed to find any
checking which CUDA architectures to target... 
Error: Process completed with exit code 1.

?

Update: This has been resolved.

@jaycedowell
Copy link
Collaborator

Maybe this was all a problem with how we were directly saving the ring's contents rather than a copy. After refactoring the disk and UDP I/O tests in ibverb-support the problem seems to have largely gone away. I do occasionally see a test failure but it looks like it's comparing zeros (dropped packets) against real data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants