UCX: Adding support for RISC-V architecture; RV64G#8246
UCX: Adding support for RISC-V architecture; RV64G#8246jleidel wants to merge 711 commits intoopenucx:masterfrom
Conversation
yosefe
left a comment
There was a problem hiding this comment.
as general comment need to update ucx docx (under docs folder) to show riscv support and also see how can we add riscv to CI
src/ucm/bistro/bistro.c
Outdated
| } | ||
|
|
||
| #if defined(__riscv) | ||
| __sync_synchronize(); |
There was a problem hiding this comment.
maybe sync on all platforms? or use memory barriers macros?
(to avoid if defined here)
There was a problem hiding this comment.
There was a problem hiding this comment.
yes; for now it can be empty for all other architectures
There was a problem hiding this comment.
@yosefe what would be a good name for the macro? ucm_ifence?
There was a problem hiding this comment.
I have it implemented for arm, but we had no reason to expose it outside of arch:
ucx/src/ucs/arch/aarch64/cpu.h
Line 39 in 5761527
We have #define ucs_memory_cpu_fence and I think this one can follow similar naming convention #define ucs_instruction_cpu_fence
There was a problem hiding this comment.
I'm quire a bit confused why we need I-berrier here. Eventually, when patched applied we invoke ucs_clear_cache which flushes the I-cache.
There was a problem hiding this comment.
@ct-clmsn Do you still need the if defined, even after fixing cache flush ?
src/ucm/util/log.c
Outdated
| } | ||
| else if (trace_idx == 0) { |
There was a problem hiding this comment.
there are few code style issues, pls see https://github.com/openucx/ucx/wiki/Code-style-checking
There was a problem hiding this comment.
found and fixed indentation issues. we had errors running the .clang-format script.
| #define UCS_RISCV64_ATOMIC_H_ | ||
|
|
||
| #define UCS_DEFINE_ATOMIC_ADD(wordsize, suffix) \ | ||
| static inline void ucs_atomic_add##wordsize(volatile uint##wordsize##_t *ptr, \ |
There was a problem hiding this comment.
do we really want to define these, or could rely on compiler __builtin_XX constructs?
There was a problem hiding this comment.
Did a little quick grep work and saw that these files have dependencies on the functions generated by the macros:
- src/ucp/wireup/wireup_ep.c
- src/ucs/memory/rcache.c
- src/ucs/async/async.c
- src/ucs/async/thread.c
- src/uct/sm/base/sm_ep.c
not sure what the right answer looks like; this implementation attempts to maintain consistency with the other architecture implementations.
There was a problem hiding this comment.
see
Line 17 in 1a9873a
| * See file LICENSE for terms. | ||
| */ | ||
|
|
||
| #ifndef UCS_RV64_BITOPS_H_ |
There was a problem hiding this comment.
UCS_RISCV_BITOPS_
let's use RISCV name throughout the code and not RV64
There was a problem hiding this comment.
@jleidel @yosefe looking at linux source code I saw that in some places they use RISCV and in other RV64.
It seems line RV64 is an equivalent of aarch64, which actually describes the arch better.
In some other places I see people using riscv64. Anyway, if the code only works for RV64 ISA then maybe it is better to use RV64
There was a problem hiding this comment.
There was a problem hiding this comment.
from my perspective it does not matter as long as it's consistent
There was a problem hiding this comment.
@shamisp which is your preference? rv64 or riscv?
There was a problem hiding this comment.
@ct-clmsn my approach similar to Yossis' - as long as it is consistent it is all good. if it only works on 64b platforms IMHO rv64 is better name, but this is really up to you.
src/ucs/arch/riscv/cpu.c
Outdated
| } | ||
|
|
||
| while (fgets(buf, sizeof(buf), f)) { | ||
| if (sscanf(buf, "uarch : %s", value) == 1) { |
There was a problem hiding this comment.
no need so many spaces before :
src/ucs/arch/riscv/cpu.c
Outdated
| fclose(f); | ||
| } | ||
|
|
||
| void ucs_rv64_cpuid(ucs_rv64_cpuid_t *cpuid) |
There was a problem hiding this comment.
replace rv64 by riscv in all places
There was a problem hiding this comment.
@shamisp 's comment about riscv or rv64 applies here. will need to sort out naming convention.
src/ucs/arch/riscv/cpu.h
Outdated
| } | ||
|
|
||
|
|
||
| //double ucs_rv64_init_tsc_freq(); |
| #define ucs_memory_cpu_load_fence() ucs_compiler_fence() | ||
| #define ucs_memory_cpu_wc_fence() asm volatile ("fence w,w" ::: "memory") | ||
|
|
||
| static inline double ucs_arch_get_clocks_per_sec() |
src/ucs/arch/riscv/cpu.h
Outdated
| char isa[256]; | ||
| char mmu[256]; | ||
| char uarch[256]; |
There was a problem hiding this comment.
it's used to load data from the operating system as required by ucs; I've seen some debugging messages use this information. I'd like to retain it, removing it would be a fair amount of work.
There was a problem hiding this comment.
i don't see any debug message using "uarch" or "mmu" for example, can you pls point to it?
There was a problem hiding this comment.
This was implemented because the ARM support had ucs_aarch64_cpuid_t. ucs_aarch64_cpuid_t exists because of an #ifdef __aarch64__ in ./uct/ib/base/ib_device.c; an assumption was made that riscv64 would require similar work.
The type and the code that initializes the type are used in the riscv64 version of ucs_arch_get_cpu_vendor. ucs_arch_get_cpu_vendor is called in several places but the information stored in the type is never actually consumed or used.
This structure and that code have been removed.
| ucs_config_field_t ucs_arch_global_opts_table[] = { | ||
| #if ENABLE_BUILTIN_MEMCPY | ||
| {"BUILTIN_MEMCPY_MIN", "auto", | ||
| "Minimal threshold of buffer length for using built-in memcpy.", | ||
| ucs_offsetof(ucs_arch_global_opts_t, builtin_memcpy_min), UCS_CONFIG_TYPE_MEMUNITS}, | ||
|
|
||
| {"BUILTIN_MEMCPY_MAX", "auto", | ||
| "Maximal threshold of buffer length for using built-in memcpy.", | ||
| ucs_offsetof(ucs_arch_global_opts_t, builtin_memcpy_max), UCS_CONFIG_TYPE_MEMUNITS}, | ||
| #endif | ||
| {NULL} | ||
| }; |
There was a problem hiding this comment.
are we going to have special memcpy for riscv?
There was a problem hiding this comment.
A run of ucx_info, ucx_perf, etc without this call causes illegal instruction segmentation faults. I'd like to retain this code.
There was a problem hiding this comment.
aarch64 for example does not have it
There was a problem hiding this comment.
Quick note, there is still a tremendous amount of experimental code and/or lacking features in portions of the RISC-V system software and tool chains. Went back through the patch, updated some fence.i locations, and was able to remove this successfully.
src/ucs/arch/cpu.h
Outdated
| #elif defined(__aarch64__) | ||
| # include "aarch64/cpu.h" | ||
| #elif defined(__riscv) | ||
| # ifdef HAVE___CLEAR_CACHE |
There was a problem hiding this comment.
What is the problem with built-in clear cache ?
There was a problem hiding this comment.
It doesn't necessarily clear the cache in the manner expected. The particular implementation of RISC-V that we're testing on predates the ratification of the CMOBase set of extensions (Zicbom, Zicbop, Zicboz) that implement explicit cache flush operations. Given that the community was well aware of the forthcoming CMOBase infrastructure (ratified in Nov 2021), the toolchain and system libs have somewhat lagged their support of particular builtins. They're "implemented", but not fully implemented. This is quite literally a situation where we're running on rather early hardware/software.
src/ucs/arch/riscv/cpu.h
Outdated
| return UCS_ERR_UNSUPPORTED; | ||
| } | ||
|
|
||
| #if !HAVE___CLEAR_CACHE |
There was a problem hiding this comment.
In the above code you undefined HAVE___CLEAR_CACHE , so it is never defined.
There was a problem hiding this comment.
Thank you for pointing this out - this issue has been fixed.
src/ucs/arch/riscv/cpu.h
Outdated
| #if !HAVE___CLEAR_CACHE | ||
| static inline void ucs_arch_clear_cache(void *start, void *end) | ||
| { | ||
| __riscv_flush_icache(start, end, 0); |
There was a problem hiding this comment.
I suspect clear cache maps to the same call. IMHO something really broken here and this is why you had to introduce fence.i in multiple places.
There was a problem hiding this comment.
@jleidel's response clarifies things a bit. gcc has 2 intrinsics for barriers on both caches. __sync_synchronize for data cache and __riscv_flush_icache for instruction cache. It was a little unclear which cache is flushed by this particular ucs function; the naming conventions used in the macros for memory barriers are really nice.
Additionally, I've done some runtime gdb disassembly of both (__sync_synchronize , __riscv_flush_icache ) intrinsics; the disassembly for both intrinsics have no reference to the fence.i instruction. The disassembly looks like an attempt to perform both memory barriers using software (ie: loop until the cache clears) as opposed to using the hardware instruction.
There was a problem hiding this comment.
This is the reference I mentioned on the call:
ucx/src/ucs/arch/aarch64/cpu.h
Line 157 in 5761527
|
@ct-clmsn can you pls fix the commit title according to code style https://github.com/openucx/ucx/wiki/Guidance-for-contributors item 3? |
|
Not the merge commit; the other ones. |
9bbfaed to
b84d18d
Compare
|
@yosefe apologies for the 2 rebases - found an issue with a single message. I believe the commit titles/messages are now updated to meet the specification. |
src/ucm/mmap/install.c
Outdated
| do { \ | ||
| int exp_events = (_event) & (_mask); \ | ||
| (_data)->fired_events = 0; \ | ||
| ucs_rv64_isb(); \ |
There was a problem hiding this comment.
seems the only difference here is ucs_rv64_isb()
so better define a memory barrier macro, empty in all architectures except rv64, and use it.
There was a problem hiding this comment.
well, ISB is actually and an instruction on Arm.
There was a problem hiding this comment.
Why do you think instruction barrier is needed here - from the code is not as obvious.
There was a problem hiding this comment.
The RV64 manual provides RV64 mappings to fence instructions offered on other architectures (ARM, PPC, x86, etc). I implemented the ARM flavors of fences (as per the RV64 mapping) to provide something that looks familiar to ya'll. RV64's data cache flush support is not implemented in the available hardware; the ISA extension for data cache flush was ratified after the available hardware was released. The hardware vendor has implemented their own, non-ISA, instruction cache 'discard'. The non-ISA 'discard' instruction does not perform a 'write back' to memory when invoked; the cache flushes and memory is not touched. 'discard' is not a data cache flush to use in bistro because the instruction is vendor specific and does not 'write back' to memory. The official ISA instruction cache flush is implemented in the available hardware. Segmentation faults happen when fences and the icache flush are not placed before the first time a function modified by bistro is invoked.
src/ucm/mmap/install.c
Outdated
| #if defined(__riscv) | ||
| ucs_arch_clear_cache(p, p+ucm_get_page_size()); | ||
| #endif |
There was a problem hiding this comment.
why clear cache is needed here? and why only on rv64?
(trying to remove #if from the code as much as possible to generic arch macros)
There was a problem hiding this comment.
Segmentation faults happen when fences and the icache flush are not placed before the first time a bistro modified function is invoked.
There was a problem hiding this comment.
keep the cache flush. but make it unconditional, i.e do it for all architectures
There was a problem hiding this comment.
would a macro like the following suffice?
UCM_INVOKE(mmap, ...);
On RV64, cache flushes could be added by the macro prior to mmap(...) getting invoked. On the non-RV64 platform, the cacheflushes would not occur.
There was a problem hiding this comment.
@ct-clmsn should be something like:
if (events & (UCM_EVENT_SHMAT|UCM_EVENT_SHMDT|UCM_EVENT_VM_MAPPED|UCM_EVENT_VM_UNMAPPED)) {
ucs_arch_sync_inst_cache(p, p + ucm_get_page_size());
shmid = shmget(IPC_PRIVATE, ucm_get_page_size(), IPC_CREAT | SHM_R | SHM_W);
...
}And ucs_arch_sync_inst_cache() is defined to ucs_arch_clear_cache on rv4, empty on other archs.
However, looking at the code more, why is this needed only before shmget and not in other places?
src/ucm/util/replace.h
Outdated
| #define UCM_DEFINE_REPLACE_FUNC(_name, _rettype, _fail_val, ...) \ | ||
| _UCM_DEFINE_REPLACE_FUNC(ucm_override_##_name, ucm_##_name, _rettype, _fail_val, __VA_ARGS__) | ||
|
|
||
| #if defined(__riscv) |
There was a problem hiding this comment.
is the difference only ucm_trace? why needed to change?
There was a problem hiding this comment.
Found an issue with ucm_trace on RV64, there is a /memcpystrlen in ucm_trace that is causing segmentation faults. Removing the ucm_trace fixed the issue. This is a side effect of the lack of a data cache flush implementation in the hardware./This strlen issue is why ucm_trace is removed in src/ucm/mmap/install.c.
There was a problem hiding this comment.
apologies, the issue wasn't memcpy, the segmentation fault happens in ucm_log_vsnprintf; which is called by ucm_trace under the hood. There is a situation on the RV64 platform where the strlen on the line linked to above causes a segfault. I have a custom RV64 patch for ucm_log_vsnprintf to fix the strlen issue, not sure if ya'll would want to provide a custom architecture fix for that function; at the time it seemed less controversial to provide an architecture specific macro that dropped ucm_trace.
There was a problem hiding this comment.
@ct-clmsn what is the reason strlen fails on rv64 and not on other archs?
what is the fix?
IMO better try to fix the underlying issue rather and working around it
There was a problem hiding this comment.
I agree with @yosefe suggestion. Let's fix the actual problem and not hack around it.
| * See file LICENSE for terms. | ||
| */ | ||
|
|
||
| #ifndef UCS_RISCV64_ATOMIC_H_ |
There was a problem hiding this comment.
seems this file is not needed anymore
| union input { | ||
| uint64_t value; | ||
| struct pvalue { | ||
| uint32_t rhs; | ||
| uint32_t lhs; | ||
| } pvalue; | ||
| } n_sto = { .value = n }; | ||
|
|
||
| const int lhs = __builtin_ctz(n_sto.pvalue.lhs); | ||
| const int rhs = __builtin_ctz(n_sto.pvalue.rhs); | ||
|
|
||
| int val = rhs; | ||
| if(rhs < 0) { | ||
| val = 32 + lhs; | ||
| } |
There was a problem hiding this comment.
pls fix code style according to clang-format
| return (double) freq; | ||
| } | ||
|
|
||
| static inline ucs_cpu_model_t ucs_arch_get_cpu_model() |
| return UCS_CPU_FLAG_UNKNOWN; | ||
| } | ||
|
|
||
| static inline void ucs_cpu_init() |
There was a problem hiding this comment.
one more space line before
use 2 space lines in H file (in multiple other places as well)
|
|
||
| #define ucs_arch_wait_mem ucs_arch_generic_wait_mem | ||
|
|
||
| static inline void ucs_arch_clear_cache(void * start, void * end) { |
| #define UCS_ARCH_GLOBAL_OPTS_INITALIZER { \ | ||
| .builtin_memcpy_min = UCS_MEMUNITS_AUTO, \ | ||
| .builtin_memcpy_max = UCS_MEMUNITS_AUTO \ | ||
| } | ||
|
|
||
| /* built-in memcpy config */ | ||
| typedef struct ucs_arch_global_opts { | ||
| size_t builtin_memcpy_min; | ||
| size_t builtin_memcpy_max; | ||
| } ucs_arch_global_opts_t; |
There was a problem hiding this comment.
can remove builtin memcpy options for rv64, it's not used anyway
for example aarch64 does not have it
There was a problem hiding this comment.
thought this was removed previously, looks like the last push didn't remove it (merge artifact). will fix promptly.
| UCT_INSTANTIATE_UD_TEST_CASE(test_ud) | ||
|
|
||
|
|
||
| class test_ud_peer_failure : public ud_base_test { |
There was a problem hiding this comment.
is this a merge artifact? if not why needed to change UD tests?
| ucs_arch_clear_cache(p, p+ucm_get_page_size()); | ||
| #endif | ||
| shmid = shmget(IPC_PRIVATE, ucm_get_page_size(), IPC_CREAT | SHM_R | SHM_W); | ||
|
|
src/ucm/mmap/install.c
Outdated
| #if defined(__riscv) | ||
| ucs_arch_clear_cache(p, p+ucm_get_page_size()); | ||
| #endif |
There was a problem hiding this comment.
@ct-clmsn should be something like:
if (events & (UCM_EVENT_SHMAT|UCM_EVENT_SHMDT|UCM_EVENT_VM_MAPPED|UCM_EVENT_VM_UNMAPPED)) {
ucs_arch_sync_inst_cache(p, p + ucm_get_page_size());
shmid = shmget(IPC_PRIVATE, ucm_get_page_size(), IPC_CREAT | SHM_R | SHM_W);
...
}And ucs_arch_sync_inst_cache() is defined to ucs_arch_clear_cache on rv4, empty on other archs.
However, looking at the code more, why is this needed only before shmget and not in other places?
src/ucm/util/replace.h
Outdated
| #define _UCM_DEFINE_REPLACE_FUNC(_over_name, _ucm_name, _rettype, _fail_val, ...) \ | ||
| \ | ||
| /* Define a symbol which goes to the replacement - in case we are loaded first */ \ | ||
| _rettype _over_name(UCM_FUNC_DEFINE_ARGS(__VA_ARGS__)) \ |
There was a problem hiding this comment.
it seems like only the difference here is ucm_trace call ? What is the problem with the trace call ?
There was a problem hiding this comment.
Sorry I missed the above discussion
src/ucm/mmap/install.c
Outdated
| do { \ | ||
| int exp_events = (_event) & (_mask); \ | ||
| (_data)->fired_events = 0; \ | ||
| ucs_rv64_isb(); \ |
There was a problem hiding this comment.
Hmm, why ISB is needed here ? I don't see any obvious reason.
If there is good reason probably it is required on all architectures.
There was a problem hiding this comment.
More I read it and more I don't understand it. The ISB if followed by system call that switches to privilege mode. In theory it should include ISB and memory barrier. Out curiosity, have you tried to replace it with memory write barrier ?
|
@shamisp @yosefe did some testing today. It turns out riscv64's implementation of |
|
At this point, I believe we have exhausted all possible options for removing architecture-specific I$ flushes in the RV64 implementation. Given that this is currently the "most" functional platform we have available for UCX prototyping, we limited in our options with respect to removing architecture specific code from non-architecture specific sections. Our current platform lacks certain support for hardware-driven cache flushes (RV-CMOBASE extension), compat headers (for libfabrics support; kernel 5.11.0-1028) and certain ELF extensions (DT_JMPREL is missing in binutils). We've tried all combinations of architecture-agnostic user and system calls to arrive at a solution. Many of which would potentially induce performance issues in future platform variants. Given the "experimental" nature of RISC-V platforms in HPC, we would like to request that we allow this PR to merge as is. This would allow everyone in the RISC-V community to make forward progress on other HPC runtime/comm layers. However, we will also sign up to amend the code once we have a more functional/stable baseline platform to use for tests and experiments. The RISC-V "platforms" and "profiles" specs will define specific extension requirements that will ease the pain of porting complex runtimes such as OpenUCX to future RISC-V variants. |
@yosefe @ct-clmsn @jleidel Since the u-arch/arch issues cannot be resolved what about moving parts of the abstraction that got |
|
@shamisp @yosefe I've made some updates to the bistro support and placed compile guards around some of the openmp code. The icache issues have been resolved. The modifications have been tested against a sifive unmatched and a pine64/star64 device. The code runs successfully on both devices. Also, I've merged the implementation against master so, it is ready to go. |
f2d9c2b to
06a1c3f
Compare
|
@ct-clmsn can you pls fix the commit title? would need force-push |
| /** | ||
| * @brief JALR - Add 12 bit immediate to source register, save to destination register, jump and link from destination register | ||
| * | ||
| * @param[in] _reg register number (0-31), @param[out] _reg register number (0-31), @param[imm] 12 bit immmediate value |
There was a problem hiding this comment.
I fixed this in f9083eb
These are not out variables in the sense that doxygen would consider it. Also there is no "imm" type.
cffb8a6 to
942f468
Compare
This limit is no longer needed after moving protov1 to ucp rcache, and replacing datatype registration with ucp_memh pointer.
Need to always pass mr_id on rc_verbs transport, even if flush_remote is disabled. On machines with relaxed ordering and DevX md, we would use the atomic-KSM key for put operations after a fence. So need to use the right offset.
The active message lane may be used only for control messages, and not send data during rendezvous protocol. Signed-off-by: Yossi Itigin <yosefe@nvidia.com>
Low byte of flush mkey (aka atomic mr_id) can be 0. When unified mode is used, UCP uses local iface query to know the length of ep address. If one of the peers atomic mr_id is 0 while another is not, UCP uses worng ep address length and incorrectly unpacks UCP address.
Use simplified versions of mem_reg/dereg/pack, that do not rely on uct_ib_md_ops_t callbacks. Update some unit tests that called IB functions directly.
Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Need to always pass mr_id on rc_verbs transport, even if flush_remote is disabled. On machines with relaxed ordering and DevX md, we would use the atomic-KSM key for put operations after a fence. So need to use the right offset.
Low byte of flush mkey (aka atomic mr_id) can be 0. When unified mode is used, UCP uses local iface query to know the length of ep address. If one of the peers atomic mr_id is 0 while another is not, UCP uses worng ep address length and incorrectly unpacks UCP address.
Use simplified versions of mem_reg/dereg/pack, that do not rely on uct_ib_md_ops_t callbacks. Update some unit tests that called IB functions directly.
What
Adding support for the RISC-V architecture to OpenUCX. This includes support for generic RV64G (IMAFDC) micro architectures. Core development by @ct-clmsn and @jleidel
Why ?
Adds support for RISC-V architecture across all subcomponents. This includes support for basic build infrastructure, tooling, ucg, ucm, ucm/bistro, ucp, ucs and uct. Tested using the in-situ test suite on actual RISC-V RV64G hardware in a cluster configuration (see below for configuration details).
How ?
All development and testing was performed using the following parameters: