How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? #57

RemiLacroix-IDRIS · 2020-08-24T21:49:33Z

Hello,

We are currently managing all installations for our cluster on a node which does not have GPU and consequently does not have a GPU-enabled hfi1 driver.

Due to the following code snippet, this prevents us from building the CUDA-enabled version of PSM2:
https://github.com/intel/opa-psm2/blob/7a33bedc4bb3dff4e57c00293a2d70890db4d983/psm_hal_gen1/psm_hal_inline_i.h#L507-L516

Is there any way to work that around? There is a runtime check to ensure the hfi1 driver is actually GPU-enabled, wouldn't that be enough?

Best regards,
Rémi

mwheinz · 2020-08-25T13:48:58Z

If you look in the IFS package, the CUDA binaries should be there. You should be able to find the CUDA versions of the RPMs and, using commands like opascpall and opacmd, install them on the appropriate ndoes.

RemiLacroix-IDRIS · 2020-08-25T15:46:39Z

We are in a context where we would like to build PSM2 instead of installing it from the RPMs.

ToddRimmer · 2020-08-25T15:49:16Z

There is a runtime check to ensure the hfi1 driver is actually GPU-enabled, wouldn't that be enough?

I wish it were that simple. Unfortunately cuda and nvidia code is not upstream. As such we must develop 2 versions of the hfi1 driver and build the PSM user space accordingly. Only the cuda enabled version of the hfi1 driver contains the APIs and header files used by the cuda enabled PSM. So to build the cuda enabled PSM, you need to have the cuda enabled hfi1 driver installed so it’s header files are available. As Mike mentions, both packages are available in IFS. There is an upstream effort ongoing referred to as “DMAbuf” which seeks to solve the issue of peer to peer DMA without having direct driver to driver interactions. This mechanism, once accepted and integrated into other vendors software, can resolve some of the issues. Todd Rimmer DCG Architecture Voice: 484-245-9487 mailto:Todd.Rimmer@intel.com From: Michael Heinz <notifications@github.com> Sent: Tuesday, August 25, 2020 9:49 AM To: intel/opa-psm2 <opa-psm2@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [intel/opa-psm2] How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? (#57) If you look in the IFS package, the CUDA binaries should be there. You should be able to find the CUDA versions of the RPMs and, using commands like opascpall and opacmd, install them on the appropriate ndoes. — You are receiving this because you are subscribed to this thread. Reply to this email directly, #57 (comment), or https://github.com/notifications/unsubscribe-auth/AEKZS22VY2H4ZPNDUWLWGPLSCO6NZANCNFSM4QJ6O66Q.

RemiLacroix-IDRIS · 2020-08-25T16:07:48Z

That's unfortunate but thanks for the answer.

So to build the cuda enabled PSM, you need to have the cuda enabled hfi1 driver installed so it’s header files are available.

Wouldn't it be possible to distribute the required headers with PSM and test at runtime that the actual driver has the proper capabilities?

BrendanCunningham · 2020-09-01T16:33:01Z

That's unfortunate but thanks for the answer.

So to build the cuda enabled PSM, you need to have the cuda enabled hfi1 driver installed so it’s header files are available.

Wouldn't it be possible to distribute the required headers with PSM and test at runtime that the actual driver has the proper capabilities?

As we (PSM2) do not maintain hfi1 and we wish for PSM2 to build against the hfi1 headers installed on the system, we are not going to include the hfi1 headers with PSM2.

Runtime check

PSM2 does check at runtime whether the loaded hfi1 has matching GPUDirect capabilities:
https://github.com/intel/opa-psm2/blob/7a33bedc4bb3dff4e57c00293a2d70890db4d983/psm_context.c#L537-L550

That is, the following combinations do not work or are not advisable:

PSM2, no CUDA w/ hfi1-gpudirect => fatal
PSM2-CUDA w/hfi1, no GPUDirect support => warning

Building on host that does not have hfi1-gpudirect headers

You can get the hfi1 headers needed to build PSM2 with CUDA support (uapi/rdma/hfi/hfi1_{user,ioctl}.h) from the ifs-kernel-updates-devel .rpm found in an IFS tarball (from Intel RDC).

IFS tarballs for most distros should have both CUDA and non-CUDA ifs-kernel-updates-devel .rpms. Right now hfi1 headers found in both CUDA and non-CUDA ifs-kernel-updates-devel .rpms both have the required CUDA/GPUDirect definitions.

You can install the ifs-kernel-updates-devel .rpm on your build node (headers will go under /usr/include/uapi/rdma/hfi). Alternatively, you can extract the .rpm with rpmdev-extract, place the headers where you want, then edit IFS_HFI_HEADER_PATH in psm/buildflags.mak to point to the appropriate uapi/ grandparent of hfi/hfi1_{user,ioctl}.h. I have tried this and it works.

Let me know if this helps or if you have any more questions. Thanks.

Brendan

RemiLacroix-IDRIS · 2020-09-02T17:01:12Z

Just to be sure I understand correctly, this RPM is not installed by default?

BrendanCunningham · 2020-09-02T19:51:09Z

Just to be sure I understand correctly, this RPM is not installed by default?

No, the IFS 'INSTALL' script should install ifs-kernel-updates-devel.

I am saying that if you did not install IFS on your build node that you can extract the hfi1 headers required to build PSM2 from the ifs-kernel-updates-devel .rpm found in the IFS tarball.

RemiLacroix-IDRIS · 2020-09-03T06:51:55Z

Ok, then I need to double-check what is happening here because I couldn't find any /usr/include/uapi directory on our nodes, although I am confident that we have IFS installed on those.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? #57

How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? #57

RemiLacroix-IDRIS commented Aug 24, 2020

mwheinz commented Aug 25, 2020

RemiLacroix-IDRIS commented Aug 25, 2020

ToddRimmer commented Aug 25, 2020 via email

RemiLacroix-IDRIS commented Aug 25, 2020

BrendanCunningham commented Sep 1, 2020

RemiLacroix-IDRIS commented Sep 2, 2020

BrendanCunningham commented Sep 2, 2020 •

edited

Loading

RemiLacroix-IDRIS commented Sep 3, 2020

How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? #57

How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? #57

Comments

RemiLacroix-IDRIS commented Aug 24, 2020

mwheinz commented Aug 25, 2020

RemiLacroix-IDRIS commented Aug 25, 2020

ToddRimmer commented Aug 25, 2020 via email

RemiLacroix-IDRIS commented Aug 25, 2020

BrendanCunningham commented Sep 1, 2020

Runtime check

Building on host that does not have hfi1-gpudirect headers

RemiLacroix-IDRIS commented Sep 2, 2020

BrendanCunningham commented Sep 2, 2020 • edited Loading

RemiLacroix-IDRIS commented Sep 3, 2020

BrendanCunningham commented Sep 2, 2020 •

edited

Loading