-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] Add support for -march=native in standalone NVPTX #79373
Conversation
Summary: We support `--target=nvptx64-nvidia-cuda` as a way to target the NVPTX architecture from standard CPU. This patch simply uses the existing support for handling `--offload-arch=native` to also apply to the standalone toolchain.
@llvm/pr-subscribers-clang-driver Author: Joseph Huber (jhuber6) ChangesSummary: Full diff: https://github.com/llvm/llvm-project/pull/79373.diff 3 Files Affected:
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 1462576ca870e6f..6215c43b5fc96bd 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -738,9 +738,18 @@ NVPTXToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
if (!llvm::is_contained(*DAL, A))
DAL->append(A);
- if (!DAL->hasArg(options::OPT_march_EQ))
+ if (!DAL->hasArg(options::OPT_march_EQ)) {
DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
CudaArchToString(CudaArch::CudaDefault));
+ } else if (DAL->getLastArgValue(options::OPT_march_EQ) == "native") {
+ auto GPUsOrErr = getSystemGPUArchs(Args);
+ if (!GPUsOrErr)
+ getDriver().Diag(diag::err_drv_undetermined_gpu_arch)
+ << getArchName() << llvm::toString(GPUsOrErr.takeError()) << "-march";
+ else
+ DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
+ Args.MakeArgString(GPUsOrErr->front()));
+ }
return DAL;
}
@@ -783,6 +792,31 @@ void NVPTXToolChain::adjustDebugInfoKind(
}
}
+Expected<SmallVector<std::string>>
+NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const {
+ // Detect NVIDIA GPUs availible on the system.
+ std::string Program;
+ if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
+ Program = A->getValue();
+ else
+ Program = GetProgramPath("nvptx-arch");
+
+ auto StdoutOrErr = executeToolChainProgram(Program);
+ if (!StdoutOrErr)
+ return StdoutOrErr.takeError();
+
+ SmallVector<std::string, 1> GPUArchs;
+ for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
+ if (!Arch.empty())
+ GPUArchs.push_back(Arch.str());
+
+ if (GPUArchs.empty())
+ return llvm::createStringError(std::error_code(),
+ "No NVIDIA GPU detected in the system");
+
+ return std::move(GPUArchs);
+}
+
/// CUDA toolchain. Our assembler is ptxas, and our "linker" is fatbinary,
/// which isn't properly a linker but nonetheless performs the step of stitching
/// together object files from the assembler into a single blob.
@@ -948,31 +982,6 @@ CudaToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
return DAL;
}
-Expected<SmallVector<std::string>>
-CudaToolChain::getSystemGPUArchs(const ArgList &Args) const {
- // Detect NVIDIA GPUs availible on the system.
- std::string Program;
- if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
- Program = A->getValue();
- else
- Program = GetProgramPath("nvptx-arch");
-
- auto StdoutOrErr = executeToolChainProgram(Program);
- if (!StdoutOrErr)
- return StdoutOrErr.takeError();
-
- SmallVector<std::string, 1> GPUArchs;
- for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
- if (!Arch.empty())
- GPUArchs.push_back(Arch.str());
-
- if (GPUArchs.empty())
- return llvm::createStringError(std::error_code(),
- "No NVIDIA GPU detected in the system");
-
- return std::move(GPUArchs);
-}
-
Tool *NVPTXToolChain::buildAssembler() const {
return new tools::NVPTX::Assembler(*this);
}
diff --git a/clang/lib/Driver/ToolChains/Cuda.h b/clang/lib/Driver/ToolChains/Cuda.h
index 8a053f3393e1206..43c17ba7c0ba03d 100644
--- a/clang/lib/Driver/ToolChains/Cuda.h
+++ b/clang/lib/Driver/ToolChains/Cuda.h
@@ -168,6 +168,11 @@ class LLVM_LIBRARY_VISIBILITY NVPTXToolChain : public ToolChain {
unsigned GetDefaultDwarfVersion() const override { return 2; }
unsigned getMaxDwarfVersion() const override { return 2; }
+ /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
+ /// if unable to find one.
+ virtual Expected<SmallVector<std::string>>
+ getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
+
CudaInstallationDetector CudaInstallation;
protected:
@@ -223,11 +228,6 @@ class LLVM_LIBRARY_VISIBILITY CudaToolChain : public NVPTXToolChain {
const ToolChain &HostTC;
- /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
- /// if unable to find one.
- virtual Expected<SmallVector<std::string>>
- getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
-
protected:
Tool *buildAssembler() const override; // ptxas
Tool *buildLinker() const override; // fatbinary (ok, not really a linker)
diff --git a/clang/test/Driver/nvptx-cuda-system-arch.c b/clang/test/Driver/nvptx-cuda-system-arch.c
index 037215fd52a88b2..bd001f82052dc38 100644
--- a/clang/test/Driver/nvptx-cuda-system-arch.c
+++ b/clang/test/Driver/nvptx-cuda-system-arch.c
@@ -31,3 +31,8 @@
// RUN: %clang -### --target=x86_64-unknown-linux-gnu -nogpulib --offload-arch=native --offload-new-driver --nvptx-arch-tool=%t/nvptx_arch_sm_70 -x cuda --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 \
// RUN: | FileCheck %s --check-prefix=ARCH-sm_70
// ARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"
+
+// case when nvptx-arch is used via '-march=native'
+// RUN: %clang -### --target=nvptx64-nvidia-cuda -nogpulib -march=native --nvptx-arch-tool=%t/nvptx_arch_sm_70 \
+// RUN: --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 | FileCheck %s --check-prefix=MARCH-sm_70
+// MARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"
|
@llvm/pr-subscribers-clang Author: Joseph Huber (jhuber6) ChangesSummary: Full diff: https://github.com/llvm/llvm-project/pull/79373.diff 3 Files Affected:
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 1462576ca870e6f..6215c43b5fc96bd 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -738,9 +738,18 @@ NVPTXToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
if (!llvm::is_contained(*DAL, A))
DAL->append(A);
- if (!DAL->hasArg(options::OPT_march_EQ))
+ if (!DAL->hasArg(options::OPT_march_EQ)) {
DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
CudaArchToString(CudaArch::CudaDefault));
+ } else if (DAL->getLastArgValue(options::OPT_march_EQ) == "native") {
+ auto GPUsOrErr = getSystemGPUArchs(Args);
+ if (!GPUsOrErr)
+ getDriver().Diag(diag::err_drv_undetermined_gpu_arch)
+ << getArchName() << llvm::toString(GPUsOrErr.takeError()) << "-march";
+ else
+ DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
+ Args.MakeArgString(GPUsOrErr->front()));
+ }
return DAL;
}
@@ -783,6 +792,31 @@ void NVPTXToolChain::adjustDebugInfoKind(
}
}
+Expected<SmallVector<std::string>>
+NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const {
+ // Detect NVIDIA GPUs availible on the system.
+ std::string Program;
+ if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
+ Program = A->getValue();
+ else
+ Program = GetProgramPath("nvptx-arch");
+
+ auto StdoutOrErr = executeToolChainProgram(Program);
+ if (!StdoutOrErr)
+ return StdoutOrErr.takeError();
+
+ SmallVector<std::string, 1> GPUArchs;
+ for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
+ if (!Arch.empty())
+ GPUArchs.push_back(Arch.str());
+
+ if (GPUArchs.empty())
+ return llvm::createStringError(std::error_code(),
+ "No NVIDIA GPU detected in the system");
+
+ return std::move(GPUArchs);
+}
+
/// CUDA toolchain. Our assembler is ptxas, and our "linker" is fatbinary,
/// which isn't properly a linker but nonetheless performs the step of stitching
/// together object files from the assembler into a single blob.
@@ -948,31 +982,6 @@ CudaToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
return DAL;
}
-Expected<SmallVector<std::string>>
-CudaToolChain::getSystemGPUArchs(const ArgList &Args) const {
- // Detect NVIDIA GPUs availible on the system.
- std::string Program;
- if (Arg *A = Args.getLastArg(options::OPT_nvptx_arch_tool_EQ))
- Program = A->getValue();
- else
- Program = GetProgramPath("nvptx-arch");
-
- auto StdoutOrErr = executeToolChainProgram(Program);
- if (!StdoutOrErr)
- return StdoutOrErr.takeError();
-
- SmallVector<std::string, 1> GPUArchs;
- for (StringRef Arch : llvm::split((*StdoutOrErr)->getBuffer(), "\n"))
- if (!Arch.empty())
- GPUArchs.push_back(Arch.str());
-
- if (GPUArchs.empty())
- return llvm::createStringError(std::error_code(),
- "No NVIDIA GPU detected in the system");
-
- return std::move(GPUArchs);
-}
-
Tool *NVPTXToolChain::buildAssembler() const {
return new tools::NVPTX::Assembler(*this);
}
diff --git a/clang/lib/Driver/ToolChains/Cuda.h b/clang/lib/Driver/ToolChains/Cuda.h
index 8a053f3393e1206..43c17ba7c0ba03d 100644
--- a/clang/lib/Driver/ToolChains/Cuda.h
+++ b/clang/lib/Driver/ToolChains/Cuda.h
@@ -168,6 +168,11 @@ class LLVM_LIBRARY_VISIBILITY NVPTXToolChain : public ToolChain {
unsigned GetDefaultDwarfVersion() const override { return 2; }
unsigned getMaxDwarfVersion() const override { return 2; }
+ /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
+ /// if unable to find one.
+ virtual Expected<SmallVector<std::string>>
+ getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
+
CudaInstallationDetector CudaInstallation;
protected:
@@ -223,11 +228,6 @@ class LLVM_LIBRARY_VISIBILITY CudaToolChain : public NVPTXToolChain {
const ToolChain &HostTC;
- /// Uses nvptx-arch tool to get arch of the system GPU. Will return error
- /// if unable to find one.
- virtual Expected<SmallVector<std::string>>
- getSystemGPUArchs(const llvm::opt::ArgList &Args) const override;
-
protected:
Tool *buildAssembler() const override; // ptxas
Tool *buildLinker() const override; // fatbinary (ok, not really a linker)
diff --git a/clang/test/Driver/nvptx-cuda-system-arch.c b/clang/test/Driver/nvptx-cuda-system-arch.c
index 037215fd52a88b2..bd001f82052dc38 100644
--- a/clang/test/Driver/nvptx-cuda-system-arch.c
+++ b/clang/test/Driver/nvptx-cuda-system-arch.c
@@ -31,3 +31,8 @@
// RUN: %clang -### --target=x86_64-unknown-linux-gnu -nogpulib --offload-arch=native --offload-new-driver --nvptx-arch-tool=%t/nvptx_arch_sm_70 -x cuda --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 \
// RUN: | FileCheck %s --check-prefix=ARCH-sm_70
// ARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"
+
+// case when nvptx-arch is used via '-march=native'
+// RUN: %clang -### --target=nvptx64-nvidia-cuda -nogpulib -march=native --nvptx-arch-tool=%t/nvptx_arch_sm_70 \
+// RUN: --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda %s 2>&1 | FileCheck %s --check-prefix=MARCH-sm_70
+// MARCH-sm_70: "-cc1" "-triple" "nvptx64-nvidia-cuda"{{.*}}"-target-cpu" "sm_70"
|
This option may not as well as one would hope. Problem #1 is that it will drastically slow down compilation for some users. NVIDIA GPU drivers are loaded on demand, and the process takes a while (O(second), depending on the kind and number of GPUs). If you build on a headless machine, they will get loaded during GPU probing step, and they will get unloaded after that. For each compilation. This will also affect folks who use AMD GPUs to run graphics, but use NVIDIA GPUs for compute (my current machine is set up that way). It can be worked around by enabling driver persistence, but there would be no obvious cues for the user that they would need to do so. Problem #2 is that it will likely result in unnecessary compilation for nontrivial subset of users who have separate GPUs dedicated to compute and do not care to compile for a separate GPU they use for graphics. The Problem #3 -- it adds an extra step to the reproducibility/debugging process. If/when someone reports an issue with a compilation done with With my "GPU support dude for nontrivial number of users" hat on, I personally would really like not to open this can of worms. It's not a very big deal, but my gut is telling me that I will see all three cases once the option makes it into the tribal knowledge (hi, reddit & stack overflow!). So, in short, the source code changes are OK, but I'm not a huge fan of |
Some interesting points, I'll try to clarify some things.
On my machine, which the GPUs already loaded, calling
It will target the first GPU it finds. We could maybe change the behavior to detect the newest, but the idea is just to target the user's system. I suppose this is somewhat different to the existing
I'm not so sure, the actual architecture will show up when doing
There's some argument against the |
I think I'm with Art on this one.
OK, but I think this is worse. Now it's basically always incorrect to ship a build system which uses arch=native, because the people running the build might very reasonably have multiple GPUs in their system, and which GPU clang picks is unspecified. But we all know people are going to do it anyway. Given that this feature cannot correctly be used with a build system, and given that 99.99% of invocations of clang are from a build system that the user running the build did not write, it seems to me that we should not add a feature that is such a footgun when used with a build system. (A non-CUDA C++ file compiled with march=native will almost surely run on your computer, whereas this won't, and it's unpredictable whether or not it will, depending on the order the nvidia driver returns GPUs in. So there's no good analogy here.) If we were going to add this, I think we should compile for all the GPUs in your system, like Art had assumed. I think that's better, but it has other problems, like slow builds and also the fact that your graphics GPU is likely less powerful than your compute GPU, so now compilation is going to fail because you're e.g. using tensorcores and compiling for a GPU that doesn't have them. So again you can't really use arch=native in a build system, even if you say "requires an sm80 GPU", because really the requirement is "has an sm80 GPU and no others in the machine". |
It's not unspecified per-se, it just picks the one the CUDA driver assigned to ID zero, so it will correspond to the layman using a default device if loaded into CUDA. The AMDGPU version has a warning when multiple GPUs are found. I should probably add the same thing here as it would make this explicit.
We already do this for CUDA with |
The default "fastest card first" is also somewhat flaky. First, the "default" enumeration order is affected by the environment (could be by PCI ID, or by "highest-performance-first") which adds another external parameter the user may or may not be aware of. The "highest performance first" is also known to be wrong. E.g. on my machine CUDA runtime was picking a puny newer card I used for graphics over a 2 orders of magnitude faster compute card.
That would fall under the "any default choice for GPU will be wrong" and the implication that it's up to the user to explicitly provide the correct set of GPUs to target. On the other hand, I'd be OK with providing |
This is what we already do for |
User confusion is only part of the issue here. With any single GPU choice we would still potentially produce a nonworking binary, if our GPU choice does not match what the user wants. "all GPUs" has the advantage of always producing the binary that's guaranteed to work. Granted, in the case of multiple GPUs it comes with the compilation time overhead, but I think it's a better trade-off than compiling faster, but not working. If the overhead is unacceptable, then we can tweak the build, but in that case, the user may as well just specify the desired architectures explicitly.
|
I think the semantics of Realistically, very, very few casual users are going to be using direct NVPTX targeting. The current use-case is for building tests directly for the GPU without needing to handle calling |
I don't think we have the same idea about that. Let's spell it out, so there's no confusion. GCC manual says:
The way I read it "all instruction subsets supported by the local machine" would be what all-GPUs strategy would do. The binary is expected to run on all GPU architecture variants available on the machine. Granted, gcc was not written with GPUs in mind, but it's a good baseline for establishing existing conventions for the meaning of |
This more or less depends on what your definition of "local machine" is when it comes to a system augmented with GPUs. The verbiage of "The local machine" implies an assumption that there is only one, which I personally find consistent with just selecting the first GPU found on the system. There is ambiguity in how we should treat this in the case of multiple GPUs, but that's what the warning message is for. it informs the user that the "native" architecture is somewhat ambiguous and that the first one was selected. Further, our current default makes sense, because it corresponds to Device ID zero in CUDA, which means that unless you change the environment via So, in the case there is one device, the behavior is consistent with |
I think this is the key fact I was missing. If the patch is only for a standalone compilation which does not do multi-GPU compilation in principle, then your approach makes sense. I was arguing from the normal offloading which does have ability to target multiple GPUs. |
Yes, this is more similar to OpenCL or just regular CPU compilation where we have a single job that creates a simple executable, terminal application style. So given a single target, the desire is to "pick me the one that will work on the default CUDA device without me needing to check." type thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, as we can only handle a single GPU target during compilation.
Can you clarify for me -- what are you compiling where it's impossible to target multiple GPUs in the binary? I'm confused because Art is understanding that it's not CUDA, but we're modifying the CUDA driver here? |
The idea is to simply compile C / C++ code directly targeting NVPTX rather than going through offloading languages like CUDA or OpenMP. This is more or less what cross-compiling is. We specify Right now, with the LLVM
And compile it and run it more or less.
This works with AMD currently, and I want it to work for NVPTX so I can remove some ugly, annoying code in the |
I...think I understand. Is the output of this compilation step a cubin, then? |
Yes, it will spit out a simple |
Got it, okay, thanks. Since this change only applies to |
No problem, I probably should've have been clearer in my commit messages. |
Summary:
We support
--target=nvptx64-nvidia-cuda
as a way to target the NVPTXarchitecture from standard CPU. This patch simply uses the existing
support for handling
--offload-arch=native
to also apply to thestandalone toolchain.