Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX] Add support for -march=native in standalone NVPTX #79373

Merged
merged 4 commits into from
Jan 25, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jan 24, 2024

Summary:
We support --target=nvptx64-nvidia-cuda as a way to target the NVPTX
architecture from standard CPU. This patch simply uses the existing
support for handling --offload-arch=native to also apply to the
standalone toolchain.

Summary:
We support `--target=nvptx64-nvidia-cuda` as a way to target the NVPTX
architecture from standard CPU. This patch simply uses the existing
support for handling `--offload-arch=native` to also apply to the
standalone toolchain.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Jan 24, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 24, 2024

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
We support --target=nvptx64-nvidia-cuda as a way to target the NVPTX
architecture from standard CPU. This patch simply uses the existing
support for handling --offload-arch=native to also apply to the
standalone toolchain.


Full diff: https://github.com/llvm/llvm-project/pull/79373.diff

3 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+35-26)
  • (modified) clang/lib/Driver/ToolChains/Cuda.h (+5-5)
  • (modified) clang/test/Driver/nvptx-cuda-system-arch.c (+5)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 1462576ca870e6f..6215c43b5fc96bd 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -738,9 +738,18 @@ NVPTXToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
     if (!llvm::is_contained(*DAL, A))
       DAL->append(A);
 
-  if (!DAL->hasArg(options::OPT_march_EQ))
+  if (!DAL->hasArg(options::OPT_march_EQ)) {
     DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
                       CudaArchToString(CudaArch::CudaDefault));
+  } else if (DAL->getLastArgValue(options::OPT_march_EQ) == "native") {
+    auto GPUsOrErr = getSystemGPUArchs(Args);
+    if (!GPUsOrErr)
+      getDriver().Diag(diag::err_drv_undetermined_gpu_arch)
+          << getArchName() << llvm::toString(GPUsOrErr.takeError()) << "-march";
+    else
+      DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
+                        Args.MakeArgString(GPUsOrErr->front()));
+  }
 
   return DAL;
 }
@@ -783,6 +792,31 @@ void NVPTXToolChain::adjustDebugInfoKind(
   }
 }
 
+Expected<SmallVector<std::string>>
+NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const {
+  // Detect NVIDIA GPUs availible on the system.
+  std::string Program;
+  if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
+    Program = A->getValue();
+  else
+    Program = GetProgramPath("nvptx-arch");
+
+  auto StdoutOrErr = executeToolChainProgram(Program);
+  if (!StdoutOrErr)
+    return StdoutOrErr.takeError();
+
+  SmallVector<std::string, 1> GPUArchs;
+  for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
+    if (!Arch.empty())
+      GPUArchs.push_back(Arch.str());
+
+  if (GPUArchs.empty())
+    return llvm::createStringError(std::error_code(),
+                                   "No NVIDIA GPU detected in the system");
+
+  return std::move(GPUArchs);
+}
+
 /// CUDA toolchain.  Our assembler is ptxas, and our "linker" is fatbinary,
 /// which isn't properly a linker but nonetheless performs the step of stitching
 /// together object files from the assembler into a single blob.
@@ -948,31 +982,6 @@ CudaToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
   return DAL;
 }
 
-Expected<SmallVector<std::string>>
-CudaToolChain::getSystemGPUArchs(const ArgList &Args) const {
-  // Detect NVIDIA GPUs availible on the system.
-  std::string Program;
-  if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
-    Program = A->getValue();
-  else
-    Program = GetProgramPath("nvptx-arch");
-
-  auto StdoutOrErr = executeToolChainProgram(Program);
-  if (!StdoutOrErr)
-    return StdoutOrErr.takeError();
-
-  SmallVector<std::string, 1> GPUArchs;
-  for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
-    if (!Arch.empty())
-      GPUArchs.push_back(Arch.str());
-
-  if (GPUArchs.empty())
-    return llvm::createStringError(std::error_code(),
-                                   "No NVIDIA GPU detected in the system");
-
-  return std::move(GPUArchs);
-}
-
 Tool *NVPTXToolChain::buildAssembler() const {
   return new tools::NVPTX::Assembler(*this);
 }
diff --git a/clang/lib/Driver/ToolChains/Cuda.h b/clang/lib/Driver/ToolChains/Cuda.h
index 8a053f3393e1206..43c17ba7c0ba03d 100644
--- a/clang/lib/Driver/ToolChains/Cuda.h
+++ b/clang/lib/Driver/ToolChains/Cuda.h
@@ -168,6 +168,11 @@ class LLVM_LIBRARY_VISIBILITY NVPTXToolChain : public ToolChain {
   unsigned GetDefaultDwarfVersion() const override { return 2; }
   unsigned getMaxDwarfVersion() const override { return 2; }
 
+  /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
+  /// if unable to find one.
+  virtual Expected<SmallVector<std::string>>
+  getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
+
   CudaInstallationDetector CudaInstallation;
 
 protected:
@@ -223,11 +228,6 @@ class LLVM_LIBRARY_VISIBILITY CudaToolChain : public NVPTXToolChain {
 
   const ToolChain &HostTC;
 
-  /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
-  /// if unable to find one.
-  virtual Expected<SmallVector<std::string>>
-  getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
-
 protected:
   Tool *buildAssembler() const override; // ptxas
   Tool *buildLinker() const override;    // fatbinary (ok, not really a linker)
diff --git a/clang/test/Driver/nvptx-cuda-system-arch.c b/clang/test/Driver/nvptx-cuda-system-arch.c
index 037215fd52a88b2..bd001f82052dc38 100644
--- a/clang/test/Driver/nvptx-cuda-system-arch.c
+++ b/clang/test/Driver/nvptx-cuda-system-arch.c
@@ -31,3 +31,8 @@
 // RUN:   %clang -### --target=x86_64-unknown-linux-gnu -nogpulib --offload-arch=native --offload-new-driver --nvptx-arch-tool=%t/nvptx_arch_sm_70 -x cuda --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 \
 // RUN:   | FileCheck %s --check-prefix=ARCH-sm_70
 // ARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"
+
+// case when nvptx-arch is used via '-march=native'
+// RUN:   %clang -### --target=nvptx64-nvidia-cuda -nogpulib -march=native --nvptx-arch-tool=%t/nvptx_arch_sm_70 \
+// RUN:     --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 | FileCheck %s --check-prefix=MARCH-sm_70
+// MARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 24, 2024

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
We support --target=nvptx64-nvidia-cuda as a way to target the NVPTX
architecture from standard CPU. This patch simply uses the existing
support for handling --offload-arch=native to also apply to the
standalone toolchain.


Full diff: https://github.com/llvm/llvm-project/pull/79373.diff

3 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+35-26)
  • (modified) clang/lib/Driver/ToolChains/Cuda.h (+5-5)
  • (modified) clang/test/Driver/nvptx-cuda-system-arch.c (+5)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 1462576ca870e6f..6215c43b5fc96bd 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -738,9 +738,18 @@ NVPTXToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
     if (!llvm::is_contained(*DAL, A))
       DAL->append(A);
 
-  if (!DAL->hasArg(options::OPT_march_EQ))
+  if (!DAL->hasArg(options::OPT_march_EQ)) {
     DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
                       CudaArchToString(CudaArch::CudaDefault));
+  } else if (DAL->getLastArgValue(options::OPT_march_EQ) == "native") {
+    auto GPUsOrErr = getSystemGPUArchs(Args);
+    if (!GPUsOrErr)
+      getDriver().Diag(diag::err_drv_undetermined_gpu_arch)
+          << getArchName() << llvm::toString(GPUsOrErr.takeError()) << "-march";
+    else
+      DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
+                        Args.MakeArgString(GPUsOrErr->front()));
+  }
 
   return DAL;
 }
@@ -783,6 +792,31 @@ void NVPTXToolChain::adjustDebugInfoKind(
   }
 }
 
+Expected<SmallVector<std::string>>
+NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const {
+  // Detect NVIDIA GPUs availible on the system.
+  std::string Program;
+  if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
+    Program = A->getValue();
+  else
+    Program = GetProgramPath("nvptx-arch");
+
+  auto StdoutOrErr = executeToolChainProgram(Program);
+  if (!StdoutOrErr)
+    return StdoutOrErr.takeError();
+
+  SmallVector<std::string, 1> GPUArchs;
+  for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
+    if (!Arch.empty())
+      GPUArchs.push_back(Arch.str());
+
+  if (GPUArchs.empty())
+    return llvm::createStringError(std::error_code(),
+                                   "No NVIDIA GPU detected in the system");
+
+  return std::move(GPUArchs);
+}
+
 /// CUDA toolchain.  Our assembler is ptxas, and our "linker" is fatbinary,
 /// which isn't properly a linker but nonetheless performs the step of stitching
 /// together object files from the assembler into a single blob.
@@ -948,31 +982,6 @@ CudaToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
   return DAL;
 }
 
-Expected<SmallVector<std::string>>
-CudaToolChain::getSystemGPUArchs(const ArgList &Args) const {
-  // Detect NVIDIA GPUs availible on the system.
-  std::string Program;
-  if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
-    Program = A->getValue();
-  else
-    Program = GetProgramPath("nvptx-arch");
-
-  auto StdoutOrErr = executeToolChainProgram(Program);
-  if (!StdoutOrErr)
-    return StdoutOrErr.takeError();
-
-  SmallVector<std::string, 1> GPUArchs;
-  for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
-    if (!Arch.empty())
-      GPUArchs.push_back(Arch.str());
-
-  if (GPUArchs.empty())
-    return llvm::createStringError(std::error_code(),
-                                   "No NVIDIA GPU detected in the system");
-
-  return std::move(GPUArchs);
-}
-
 Tool *NVPTXToolChain::buildAssembler() const {
   return new tools::NVPTX::Assembler(*this);
 }
diff --git a/clang/lib/Driver/ToolChains/Cuda.h b/clang/lib/Driver/ToolChains/Cuda.h
index 8a053f3393e1206..43c17ba7c0ba03d 100644
--- a/clang/lib/Driver/ToolChains/Cuda.h
+++ b/clang/lib/Driver/ToolChains/Cuda.h
@@ -168,6 +168,11 @@ class LLVM_LIBRARY_VISIBILITY NVPTXToolChain : public ToolChain {
   unsigned GetDefaultDwarfVersion() const override { return 2; }
   unsigned getMaxDwarfVersion() const override { return 2; }
 
+  /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
+  /// if unable to find one.
+  virtual Expected<SmallVector<std::string>>
+  getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
+
   CudaInstallationDetector CudaInstallation;
 
 protected:
@@ -223,11 +228,6 @@ class LLVM_LIBRARY_VISIBILITY CudaToolChain : public NVPTXToolChain {
 
   const ToolChain &HostTC;
 
-  /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
-  /// if unable to find one.
-  virtual Expected<SmallVector<std::string>>
-  getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
-
 protected:
   Tool *buildAssembler() const override; // ptxas
   Tool *buildLinker() const override;    // fatbinary (ok, not really a linker)
diff --git a/clang/test/Driver/nvptx-cuda-system-arch.c b/clang/test/Driver/nvptx-cuda-system-arch.c
index 037215fd52a88b2..bd001f82052dc38 100644
--- a/clang/test/Driver/nvptx-cuda-system-arch.c
+++ b/clang/test/Driver/nvptx-cuda-system-arch.c
@@ -31,3 +31,8 @@
 // RUN:   %clang -### --target=x86_64-unknown-linux-gnu -nogpulib --offload-arch=native --offload-new-driver --nvptx-arch-tool=%t/nvptx_arch_sm_70 -x cuda --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 \
 // RUN:   | FileCheck %s --check-prefix=ARCH-sm_70
 // ARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"
+
+// case when nvptx-arch is used via '-march=native'
+// RUN:   %clang -### --target=nvptx64-nvidia-cuda -nogpulib -march=native --nvptx-arch-tool=%t/nvptx_arch_sm_70 \
+// RUN:     --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 | FileCheck %s --check-prefix=MARCH-sm_70
+// MARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"

@Artem-B
Copy link
Member

Artem-B commented Jan 25, 2024

This option may not as well as one would hope.

Problem #1 is that it will drastically slow down compilation for some users. NVIDIA GPU drivers are loaded on demand, and the process takes a while (O(second), depending on the kind and number of GPUs). If you build on a headless machine, they will get loaded during GPU probing step, and they will get unloaded after that. For each compilation. This will also affect folks who use AMD GPUs to run graphics, but use NVIDIA GPUs for compute (my current machine is set up that way). It can be worked around by enabling driver persistence, but there would be no obvious cues for the user that they would need to do so.

Problem #2 is that it will likely result in unnecessary compilation for nontrivial subset of users who have separate GPUs dedicated to compute and do not care to compile for a separate GPU they use for graphics. The arch=native will create a working configuration, but would build more than necessary. Again, the end user would not be aware of that.

Problem #3 -- it adds an extra step to the reproducibility/debugging process. If/when someone reports an issue with a compilation done with -mnative, we'll inevitably have to start with clarifying questions -- what exactly was the hardware configuration of the machine where the compilation was done.

With my "GPU support dude for nontrivial number of users" hat on, I personally would really like not to open this can of worms. It's not a very big deal, but my gut is telling me that I will see all three cases once the option makes it into the tribal knowledge (hi, reddit & stack overflow!).

So, in short, the source code changes are OK, but I'm not a huge fan of -mnative in principle (both CPU and GPU variants).
If others find it useful, I'm OK with adding the option, but it should probably come with documented caveats so affected users have a chance to find the answer if/when they run into trouble.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

Some interesting points, I'll try to clarify some things.

This option may not as well as one would hope.

Problem #1 is that it will drastically slow down compilation for some users. NVIDIA GPU drivers are loaded on demand, and the process takes a while (O(second), depending on the kind and number of GPUs). If you build on a headless machine, they will get loaded during GPU probing step, and they will get unloaded after that. For each compilation. This will also affect folks who use AMD GPUs to run graphics, but use NVIDIA GPUs for compute (my current machine is set up that way). It can be worked around by enabling driver persistence, but there would be no obvious cues for the user that they would need to do so.

On my machine, which the GPUs already loaded, calling nvptx-arch takes about 15ms. For the headless situation, I've noticed that if I have no started XORG on my server it will take up to 250ms, which is what I'm assuming you're referring to. I think this latency is reasonable, but we'd probably want to document what it does under the hood.

Problem #2 is that it will likely result in unnecessary compilation for nontrivial subset of users who have separate GPUs dedicated to compute and do not care to compile for a separate GPU they use for graphics. The arch=native will create a working configuration, but would build more than necessary. Again, the end user would not be aware of that.

It will target the first GPU it finds. We could maybe change the behavior to detect the newest, but the idea is just to target the user's system. I suppose this is somewhat different to the existing --offload-arch=native which will correctly copmile for all supported GPUs.

Problem #3 -- it adds an extra step to the reproducibility/debugging process. If/when someone reports an issue with a compilation done with -mnative, we'll inevitably have to start with clarifying questions -- what exactly was the hardware configuration of the machine where the compilation was done.

I'm not so sure, the actual architecture will show up when doing -v or with an LLVM stack dump, so unless the bug report is really unhelpful it should be visible somewhere. But I suppose it's possible. I think that it's much less intuitive currently where we'll just have it default to sm_52 and then not execute anything when that fails to load. Either that or JIT the PTX we may or may not include.

With my "GPU support dude for nontrivial number of users" hat on, I personally would really like not to open this can of worms. It's not a very big deal, but my gut is telling me that I will see all three cases once the option makes it into the tribal knowledge (hi, reddit & stack overflow!).

So, in short, the source code changes are OK, but I'm not a huge fan of -mnative in principle (both CPU and GPU variants). If others find it useful, I'm OK with adding the option, but it should probably come with documented caveats so affected users have a chance to find the answer if/when they run into trouble.

There's some argument against the native operations that users are accustomed to, but because the CPU does it I feel like it's helpful to make the GPU do it for cases where the user just wants something that's guaranteed to work. This not working is weird considering that -mcpu=native works for AMDGPU and --offload-arch=native works for CUDA, HIP, and OpenMP currently.

@jlebar
Copy link
Member

jlebar commented Jan 25, 2024

I think I'm with Art on this one.

Problem #2 [...] The arch=native will create a working configuration, but would build more than necessary.

It will target the first GPU it finds. We could maybe change the behavior to detect the newest, but the idea is just to target the user's system.

OK, but I think this is worse.

Now it's basically always incorrect to ship a build system which uses arch=native, because the people running the build might very reasonably have multiple GPUs in their system, and which GPU clang picks is unspecified.

But we all know people are going to do it anyway.

Given that this feature cannot correctly be used with a build system, and given that 99.99% of invocations of clang are from a build system that the user running the build did not write, it seems to me that we should not add a feature that is such a footgun when used with a build system.

(A non-CUDA C++ file compiled with march=native will almost surely run on your computer, whereas this won't, and it's unpredictable whether or not it will, depending on the order the nvidia driver returns GPUs in. So there's no good analogy here.)

If we were going to add this, I think we should compile for all the GPUs in your system, like Art had assumed. I think that's better, but it has other problems, like slow builds and also the fact that your graphics GPU is likely less powerful than your compute GPU, so now compilation is going to fail because you're e.g. using tensorcores and compiling for a GPU that doesn't have them. So again you can't really use arch=native in a build system, even if you say "requires an sm80 GPU", because really the requirement is "has an sm80 GPU and no others in the machine".

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

I think I'm with Art on this one.

Problem #2 [...] The arch=native will create a working configuration, but would build more than necessary.

It will target the first GPU it finds. We could maybe change the behavior to detect the newest, but the idea is just to target the user's system.

OK, but I think this is worse.

Now it's basically always incorrect to ship a build system which uses arch=native, because the people running the build might very reasonably have multiple GPUs in their system, and which GPU clang picks is unspecified.

It's not unspecified per-se, it just picks the one the CUDA driver assigned to ID zero, so it will correspond to the layman using a default device if loaded into CUDA.

The AMDGPU version has a warning when multiple GPUs are found. I should probably add the same thing here as it would make this explicit.

But we all know people are going to do it anyway.

Given that this feature cannot correctly be used with a build system, and given that 99.99% of invocations of clang are from a build system that the user running the build did not write, it seems to me that we should not add a feature that is such a footgun when used with a build system.

(A non-CUDA C++ file compiled with march=native will almost surely run on your computer, whereas this won't, and it's unpredictable whether or not it will, depending on the order the nvidia driver returns GPUs in. So there's no good analogy here.)

If we were going to add this, I think we should compile for all the GPUs in your system, like Art had assumed. I think that's better, but it has other problems, like slow builds and also the fact that your graphics GPU is likely less powerful than your compute GPU, so now compilation is going to fail because you're e.g. using tensorcores and compiling for a GPU that doesn't have them. So again you can't really use arch=native in a build system, even if you say "requires an sm80 GPU", because really the requirement is "has an sm80 GPU and no others in the machine".

We already do this for CUDA with --offload-arch=native. This handling is for targeting NVPTX directly, similar to OpenCL. That means there is no concept of multiple device passes, there can only be a single target architecture just like compiling standard C++ code. I'd like to have -march=native because it makes it easier to just build something that works for testing purposes, and it's consistent with all the other native handling, since the NVPTX target is the only one that doesn't support it to my knowledge.

@Artem-B
Copy link
Member

Artem-B commented Jan 25, 2024

It's not unspecified per-se, it just picks the one the CUDA driver assigned to ID zero, so it will correspond to the layman using a default device if loaded into CUDA.

The default "fastest card first" is also somewhat flaky. First, the "default" enumeration order is affected by the environment (could be by PCI ID, or by "highest-performance-first") which adds another external parameter the user may or may not be aware of. The "highest performance first" is also known to be wrong. E.g. on my machine CUDA runtime was picking a puny newer card I used for graphics over a 2 orders of magnitude faster compute card.

I think that it's much less intuitive currently where we'll just have it default to sm_52

That would fall under the "any default choice for GPU will be wrong" and the implication that it's up to the user to explicitly provide the correct set of GPUs to target.

On the other hand, I'd be OK with providing --offload-arch=native translating into "compile for all present GPU variants", with a possibility to further adjust the selected set with the usual --no-offload-arch-foo, if the user needs to. This will at least produce code that will run on the machine where it's built, be somewhat consistent and is still adjustable by the user when the default choice will inevitably be wrong.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

On the other hand, I'd be OK with providing --offload-arch=native translating into "compile for all present GPU variants", with a possibility to further adjust the selected set with the usual --no-offload-arch-foo, if the user needs to. This will at least produce code that will run on the machine where it's built, be somewhat consistent and is still adjustable by the user when the default choice will inevitably be wrong.

This is what we already do for --offload-arch=native on CUDA, but this is somewhat tangential. I've updated this patch to present the warning in the case of multiply GPUs being detected, so I don't think there's a concern here with the user being confused. If they have two GPUs, the warning will tell them which one it's using with the correct sm_ value to specify it manually if they so wish. If there is only one GPU on the system, it should be obvious that it's going to be targeted.

@Artem-B
Copy link
Member

Artem-B commented Jan 25, 2024

This is what we already do for --offload-arch=native on CUDA, but this is somewhat tangential. I've updated this patch to present the warning in the case of multiply GPUs being detected, so I don't think there's a concern here with the user being confused. If they have two GPUs, the warning will tell them which one it's using with the correct sm_ value to specify it manually if they so wish.

User confusion is only part of the issue here. With any single GPU choice we would still potentially produce a nonworking binary, if our GPU choice does not match what the user wants.

"all GPUs" has the advantage of always producing the binary that's guaranteed to work. Granted, in the case of multiple GPUs it comes with the compilation time overhead, but I think it's a better trade-off than compiling faster, but not working. If the overhead is unacceptable, then we can tweak the build, but in that case, the user may as well just specify the desired architectures explicitly.

If there is only one GPU on the system, it should be obvious that it's going to be targeted.
This case works the same with either approach.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

User confusion is only part of the issue here. With any single GPU choice we would still potentially produce a nonworking binary, if our GPU choice does not match what the user wants.

"all GPUs" has the advantage of always producing the binary that's guaranteed to work. Granted, in the case of multiple GPUs it comes with the compilation time overhead, but I think it's a better trade-off than compiling faster, but not working. If the overhead is unacceptable, then we can tweak the build, but in that case, the user may as well just specify the desired architectures explicitly.

I think the semantics of native on other architectures are clear enough here. This combined with the fact that using -march=native will error out in the case of no GPUs available, or give a warning if more than one GPU is available, should be sufficiently clear what it's doing. This obviously falls apart if you compile with -march=native and then move it off of the system you compiled it for, but the same applies for standard x64 binaries I feel.

Realistically, very, very few casual users are going to be using direct NVPTX targeting. The current use-case is for building tests directly for the GPU without needing to handle calling amdgpu-arch and nvptx-arch manually in CMake. If I had this in, then I could simplify a lot of CMake code in my libc project by just letting the compiler handle the autodetection. Then one less random program dependency is removed from the build process. AMDGPU already has -mcpu=native so I'd like NVPTX to match if possible.

@Artem-B
Copy link
Member

Artem-B commented Jan 25, 2024

I think the semantics of native on other architectures are clear enough here.

I don't think we have the same idea about that. Let's spell it out, so there's no confusion.

GCC manual says:

Using -march=native enables all instruction subsets supported by the local machine (hence the result might not run on different machines)

The way I read it "all instruction subsets supported by the local machine" would be what all-GPUs strategy would do. The binary is expected to run on all GPU architecture variants available on the machine.

Granted, gcc was not written with GPUs in mind, but it's a good baseline for establishing existing conventions for the meaning of -march=native.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

I think the semantics of native on other architectures are clear enough here.

I don't think we have the same idea about that. Let's spell it out, so there's no confusion.

GCC manual says:

Using -march=native enables all instruction subsets supported by the local machine (hence the result might not run on different machines)

The way I read it "all instruction subsets supported by the local machine" would be what all-GPUs strategy would do. The binary is expected to run on all GPU architecture variants available on the machine.

Granted, gcc was not written with GPUs in mind, but it's a good baseline for establishing existing conventions for the meaning of -march=native.

This more or less depends on what your definition of "local machine" is when it comes to a system augmented with GPUs. The verbiage of "The local machine" implies an assumption that there is only one, which I personally find consistent with just selecting the first GPU found on the system. There is ambiguity in how we should treat this in the case of multiple GPUs, but that's what the warning message is for. it informs the user that the "native" architecture is somewhat ambiguous and that the first one was selected.

Further, our current default makes sense, because it corresponds to Device ID zero in CUDA, which means that unless you change the environment via CUDA_VISIBLE_DEVICES or something, it will work on the default device.

So, in the case there is one device, the behavior is consistent with -march=native. In the case where there are two, we make an implicit decision to target the first GPU and inform the user. This method of compilation is not like CUDA, so we can't target all the GPUs at the same time. This will be useful in cases where we want to write code that simply targets a GPU that will "work". We have CMake code around LLVM already to do this, so it would be nice to get rid of that.

@Artem-B
Copy link
Member

Artem-B commented Jan 25, 2024

This method of compilation is not like CUDA, so we can't target all the GPUs at the same time.

I think this is the key fact I was missing. If the patch is only for a standalone compilation which does not do multi-GPU compilation in principle, then your approach makes sense.

I was arguing from the normal offloading which does have ability to target multiple GPUs.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

This method of compilation is not like CUDA, so we can't target all the GPUs at the same time.

I think this is the key fact I was missing. If the patch is only for a standalone compilation which does not do multi-GPU compilation in principle, then your approach makes sense.

I was arguing from the normal offloading which does have ability to target multiple GPUs.

Yes, this is more similar to OpenCL or just regular CPU compilation where we have a single job that creates a simple executable, terminal application style. So given a single target, the desire is to "pick me the one that will work on the default CUDA device without me needing to check." type thing.

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as we can only handle a single GPU target during compilation.

@jlebar
Copy link
Member

jlebar commented Jan 25, 2024

This method of compilation is not like CUDA, so we can't target all the GPUs at the same time.

Can you clarify for me -- what are you compiling where it's impossible to target multiple GPUs in the binary? I'm confused because Art is understanding that it's not CUDA, but we're modifying the CUDA driver here?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

This method of compilation is not like CUDA, so we can't target all the GPUs at the same time.

Can you clarify for me -- what are you compiling where it's impossible to target multiple GPUs in the binary? I'm confused because Art is understanding that it's not CUDA, but we're modifying the CUDA driver here?

The idea is to simply compile C / C++ code directly targeting NVPTX rather than going through offloading languages like CUDA or OpenMP. This is more or less what cross-compiling is. We specify --target=nvptx64-nvidia-cuda which instructs the compiler to cross-compile the C / C++ targeting NVPTX. This results in a workflow that is very close to compiling a standard executable by design. This is mostly related to my work on the LLVM C library for GPUs which I did a talk on that goes in more detail

Right now, with the LLVM libc infrastructure I can do the following on my AMD GPU.

#include <stdio.h>
int main() { puts("Hello World!"); }

And compile it and run it more or less.

$ clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc crt1.o
$ amdhsa_loader a.out
Hello World!

This works with AMD currently, and I want it to work for NVPTX so I can remove some ugly, annoying code in the libc project. This is how I'm running unit tests targeting the GPU in that project, which needs to run on the user's GPU. I'd rather just use -march=native than detect it manually in CMake.

@jlebar
Copy link
Member

jlebar commented Jan 25, 2024

I...think I understand.

Is the output of this compilation step a cubin, then?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

I...think I understand.

Is the output of this compilation step a cubin, then?

Yes, it will spit out a simple cubin instead of a fatbinary. The NVIDIA toolchain is much worse about this stuff than the AMD one, but in general it works. You can check with -### or whatever like in https://godbolt.org/z/fjafj9Ehx.

@jlebar
Copy link
Member

jlebar commented Jan 25, 2024

Got it, okay, thanks.

Since this change only applies to --target=nvptx64-nvidia-cuda, fine by me. Thanks for putting up with our scrutiny. :)

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 25, 2024

Got it, okay, thanks.

Since this change only applies to --target=nvptx64-nvidia-cuda, fine by me. Thanks for putting up with our scrutiny. :)

No problem, I probably should've have been clearer in my commit messages.

@jhuber6 jhuber6 merged commit 82d335e into llvm:main Jan 25, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants