Skip to content

Conversation

@markalle
Copy link
Contributor

There's a cuda check in the converter that checks one address and caches a flag for CONVERTOR_CUDA. But it was just checking pUserBuf which isn't necessarily anywhere near the data, and doesn't even need to be a legal address for that matter. MPI datatype displacements could put the actual data anywhere.

I added a lookup of the first non-loop pElem in the datatype description to use its .disp to find the first byte of data in the user's buf.

The code is probably being overly cautious, when it could just be coded as

    pElem = &(convertor->use_desc->desc[0]);
    while (!(pElem->elem.common.flags & OPAL_DATATYPE_FLAG_DATA)) {
        ++pElem;
    }
    // use pElem->elem.disp

But I was reluctant to write code that would run away in that while loop and fail ungracefully if the description array was bad for some reason or let it select a wrong first displacement if degenerate loops were a possibility in the description.


I'm not sure I feel about only looking at the first byte, when I'd think an MPI datatype could span cuda and non cuda memory, displacements can be anything. But this is still an improvement over just looking at pUserBuf which can be anything

@bosilca
Copy link
Member

bosilca commented Mar 12, 2021

You don't need this. Instead you should check the range between pBaseBuf + pData->true_lb and pBaseBuf + pData->true_ub.

@jjhursey
Copy link
Member

bot:ibm:gnu:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Mar 12, 2021
@markalle
Copy link
Contributor Author

Thanks, that seems much cleaner, I'll change it.

About using the true_ub, currently the cuda check uses cuPointerGetAttributes() on one address and I don't think there's an API for checking a range. I wasn't planning on checking the UB, partly because I'm not sure the top is that much more important than the middle and nobody wants the cudacheck walking through the entire datatype. Plus I don't know what I'd even have the cudacheck do if it saw different results for different parts of the userbuf.

So I'm leaning toward just using LB to change the current check, I don't know why I didn't do that initially instead of pElem.disp

@bosilca
Copy link
Member

bosilca commented Mar 16, 2021

Don't use LB, it point to nothing important. true_lb is the displacement of the first used byte from the buffer, so if you want to test something simple this is what you need.

For the memory range I was under the impression that CUDA devices are mapped by full GB, and that it is possible to use the MTRR registers to figure that out. Checking the first byte should be good enough for now.

@markalle markalle force-pushed the cuda_userbuf_offset branch from 82b1aac to 250451e Compare March 16, 2021 20:47
@markalle
Copy link
Contributor Author

I meant true_lb. My earlier "find the first byte" wouldn't necessarily equal true_lb, since it was "first" in the sense of data ordering, but in the context of picking an address for a cuda query I think true_lb makes as much sense as anything.

So I repushed, and went up the chain to look at the callers. It seems like some callers have a convertor available and for them it makes sense to pass in a userbuf, but other callers don't have a convertor so I'm changing them to add the true_lb themselves so they pass in the actual final address they want to have cuda-checked.

There's a cuda check in the converter that checks one address and
caches a flag for CONVERTOR_CUDA.  But it was just checking
pUserBuf which isn't necessarily anywhere near the data, and
doesn't even need to be a legal address for that matter.  MPI
datatype displacements could put the actual data anywhere.

Since opal_cuda_check_one_buf() was already using a convertor, I
left its buffer meaning as userbuf and inside it uses the convertor's
datatype for the offset.  Then for opal_cuda_check_bufs() since it's
just passing in two addresses with no convertor/datatype information,
it's expected to handle the offset itself and pass in the final address
to check

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle markalle force-pushed the cuda_userbuf_offset branch from 250451e to d2d90c4 Compare July 22, 2021 19:03
@markalle
Copy link
Contributor Author

rebased

I think this should be ready to go in now

@awlauria awlauria requested a review from bosilca July 22, 2021 19:11
@awlauria
Copy link
Contributor

bot:aws:retest

@awlauria
Copy link
Contributor

@markalle can you rebase? These files may have gotten moved around a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants