Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: introduce gdrcopy awareness to hmem copy #8881

Merged
merged 1 commit into from
May 5, 2023

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented May 2, 2023

From v1.19, OFI_HMEM_DATA_GDRCOPY_HANDLE flag will signal the presence of gdrcopy handle in ofi_mr.hmem_data. SHM provider can take advantage of gdrcopy to achieve lower memcpy latencies from/to cuda devices.

This patch introduces the logic to select gdrcopy on hmem copy paths, with the exception of cuda IPC which is not supported by gdrcopy.

For OMB single-node cuda->cuda latency we are seeing improvement(via EFA owner provider, which is responsible for gdr pinning).

pt2pt/osu_latency
Size Before After
0 0.95 0.93
1 15.41 2.78
2 15.18 4.36
4 15.04 4.45
8 15.11 4.38
16 15.13 3.00
32 15.29 2.99
64 14.85 3.07
128 14.85 4.66
256 14.62 3.34
512 14.82 4.96
1024 15.13 6.38
2048 15.68 10.99
2560 15.57 12.77
3072 15.57 15.79
3584 15.65 15.80
4096 15.89 15.95
8192 15.58 15.59
16384 15.31 15.24
32768 15.11 15.12
65536 15.24 15.27
131072 15.52 15.60
262144 16.01 16.07
524288 17.24 19.95
1048576 19.28 19.29
2097152 23.23 23.20
4194304 30.93 30.90

uint64_t iov_offset)
{
if (FI_VERSION_GE(smr_prov.fi_version, FI_VERSION(1,19)) &&
1 == iov_count && mr && mr[0] && FI_HMEM_CUDA == mr[0]->iface &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1 == iov_count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was such condition in shm for ipc, I guess here is that we may not want to support mixed iface type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. SHM currently does not support mixed HMEM iface. Here is an example: https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_msg.c#L312

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shm does not support mixed iface for the IPC path but does support mixed ifaces in the inline/inject paths which is what this is optimizing. I added the ofi_copy_from_mr_iov functions for that reason.
I'm thinking it could be better to have the gdr copy size limit be global and move this to the mr_iov copy path since the hmem data is part of the mr info. thoughts?

Copy link
Contributor Author

@wenduwan wenduwan May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shm does not support mixed iface for the IPC path but does support mixed ifaces in the inline/inject paths which is what this is optimizing.

Actually we are optimizing the current cuda ipc path, which only supports 1 hmem iface as mentioned. But what you are suggesting here also makes sense:

it could be better to have the gdr copy size limit be global ...

It could be simpler to make the size limit global (because this limit is only related to the hardware and cuda driver, not efa/shm/etc provider).

... and move this to the mr_iov copy path

I considered this option but it sorta conflicted with our intention of separating gdrcopy from generic hmem copy. Our idea is to make the caller explicitly choose the copy method. Therefore hiding gdrcopy inside mr iov copy path sounds contradictory to this end.

You pointed out a valid issue - the current impl will not use gdrcopy for inline/inject paths when there are mixed ifaces. This is not an issue here since we are optimizing the cuda ipc path that has only 1 iface, i.e.

	/* Do not inline/inject if IPC is available so device to device
	 * transfer may occur if possible. */
	if (iov_count == 1 && desc && desc[0]) {
		smr_desc = (struct ofi_mr *) *desc;
		iface = smr_desc->iface;
		use_ipc = ofi_hmem_is_ipc_enabled(iface) &&
				smr_desc->flags & FI_HMEM_DEVICE_ONLY &&
				!(op_flags & FI_INJECT);

		if (FI_VERSION_GE(smr_prov.fi_version, FI_VERSION(1,19)) &&
		    FI_HMEM_CUDA == iface &&
		    (OFI_HMEM_DATA_GDRCOPY_HANDLE & smr_desc->flags)) {
			assert(smr_desc->hmem_data);
			gdrcopy_available = true;
		}
	}

An alternative to changing the common mr iov path is to change e.g. smr_copy_from_mr_iov and iterate over the ifaces, then we can choose the proper memcopy method case by case. The downside is code duplication with mr iov copy, plus we don't have an immediate use for that.

@aingerson WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenduwan I see what you mean about the duplicate code... I'm leaning towards saying to just do it in the mr iov path with maybe a compile check around the gdr check? shm is the only provider using the mr iov copy functions and since the gdrcopy hmem data is in the ofi mr, I think it makes sense to say if you're going to use those functions that you would be opting in to the gdr copy if the gdr copy handle was registered it.
if there comes another provider/use case that needs to use it that explicitly needs to avoid the gdr code on a system with it available when it was registered with the gdr handle in the hmem data... then I think we can cross that bridge when we come to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aingerson Sneaky! Yeah if SHM is the only user then we can definitely do that.

But I just realized that there is another caveat. For gdrcopy to work, the caller should make sure the src and dest cannot both be on cuda. Based on my reading SHM does not have this issue(need to run more tests), so I will leave a note in code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh if that's the case, then we can never use gdr copy for the inline/inject paths since those always copy to the shm region which will never be cuda. If we have to make sure they are both cuda then we can only call it on the progress ipc path if the cmd->iface and rx_entry->iface are both cuda. I think it would be best to keep the ofi_mr copy functions as is and just add a case in the progress ipc path to call the gdrcopy function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aingerson lol I think you read this in reverse?

For gdrcopy to work, the caller should make sure the src and dest cannot both be on cuda

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenduwan oh lol yup. my bad!

@wenduwan
Copy link
Contributor Author

wenduwan commented May 2, 2023

bot:aws:retest

@wenduwan wenduwan force-pushed the shm_gdrcopy_aware branch 3 times, most recently from fbba715 to 3547270 Compare May 4, 2023 02:17
Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments. I think this approach seems the best to me. Thank you!

prov/shm/src/smr_msg.c Outdated Show resolved Hide resolved
src/hmem.c Outdated Show resolved Hide resolved
src/hmem.c Outdated Show resolved Hide resolved
src/hmem.c Outdated Show resolved Hide resolved
From v1.19, OFI_HMEM_DATA_GDRCOPY_HANDLE flag will signal the presence
of gdrcopy handle in ofi_mr.hmem_data. SHM provider can take advantage
of gdrcopy to achieve lower memcpy latencies from/to cuda devices.

This patch introduces the logic to select gdrcopy on hmem copy paths,
with the exception of cuda IPC which is not supported by gdrcopy.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan
Copy link
Contributor Author

wenduwan commented May 4, 2023

@aingerson Please see update. Thanks for simplifying the end result!

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the adjustments!

@wenduwan wenduwan requested a review from wzamazon May 4, 2023 15:59
Copy link
Contributor

@shijin-aws shijin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a monor comment

* TODO: Fine tune the max data size to switch from gdrcopy to cudaMemcpy
* Note: buf must be on the host since gdrcopy does not support D2D copy
*/
if (dir == OFI_COPY_BUF_TO_IOV)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. so there is no way to make cuda_gdrcopy* called by ofi_hmem_copy* ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that we need to pass along hmem_data in addition to device.

So based on our prior discussion this is not a good idea since gdrcopy is the single use case right now. We prefer to keep the hmem copy api stable at this point, and switch to gdrcopy on a higher level of the call stack. Hence we are not changing hmem copy functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it , makes sense. I see it is wrapped inside ofi_mr_iov_copy, which should be fine

@shijin-aws
Copy link
Contributor

what is the error in Intel CI?

@aingerson
Copy link
Contributor

@shijin-aws Unrelated CI error. You can ignore.

@wenduwan
Copy link
Contributor Author

wenduwan commented May 5, 2023

@shefty @wzamazon Do you have additional comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants