Skip to content

Comments

UCX: Adding support for RISC-V architecture; RV64G#8246

Closed
jleidel wants to merge 711 commits intoopenucx:masterfrom
tactcomplabs:rv64-support
Closed

UCX: Adding support for RISC-V architecture; RV64G#8246
jleidel wants to merge 711 commits intoopenucx:masterfrom
tactcomplabs:rv64-support

Conversation

@jleidel
Copy link

@jleidel jleidel commented May 19, 2022

What

Adding support for the RISC-V architecture to OpenUCX. This includes support for generic RV64G (IMAFDC) micro architectures. Core development by @ct-clmsn and @jleidel

Why ?

Adds support for RISC-V architecture across all subcomponents. This includes support for basic build infrastructure, tooling, ucg, ucm, ucm/bistro, ucp, ucs and uct. Tested using the in-situ test suite on actual RISC-V RV64G hardware in a cluster configuration (see below for configuration details).

How ?

All development and testing was performed using the following parameters:

  • SiFive HiFive Unmatched boards
  • Ubuntu 21.04
  • GCC 10.3.0

@shamisp
Copy link
Contributor

shamisp commented May 19, 2022

@yosefe - @jleidel 's company has signed CLA.

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as general comment need to update ucx docx (under docs folder) to show riscv support and also see how can we add riscv to CI

}

#if defined(__riscv)
__sync_synchronize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe sync on all platforms? or use memory barriers macros?
(to avoid if defined here)

Copy link

@ct-clmsn ct-clmsn May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the riscv processor has a fence.i instruction synchronizing both the instruction and data caches; it exists for self modifying code (JITs). would you and @shamisp be open to each architecture providing a macro that is selectively implemented to optionally support this processor feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes; for now it can be empty for all other architectures

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe what would be a good name for the macro? ucm_ifence?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have it implemented for arm, but we had no reason to expose it outside of arch:

#define ucs_aarch64_isb(_op) asm volatile ("isb " #_op ::: "memory")

We have #define ucs_memory_cpu_fence and I think this one can follow similar naming convention #define ucs_instruction_cpu_fence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quire a bit confused why we need I-berrier here. Eventually, when patched applied we invoke ucs_clear_cache which flushes the I-cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ct-clmsn Do you still need the if defined, even after fixing cache flush ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumping up the above question @ct-clmsn :)

Comment on lines 177 to 178
}
else if (trace_idx == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are few code style issues, pls see https://github.com/openucx/ucx/wiki/Code-style-checking

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found and fixed indentation issues. we had errors running the .clang-format script.

#define UCS_RISCV64_ATOMIC_H_

#define UCS_DEFINE_ATOMIC_ADD(wordsize, suffix) \
static inline void ucs_atomic_add##wordsize(volatile uint##wordsize##_t *ptr, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to define these, or could rely on compiler __builtin_XX constructs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a little quick grep work and saw that these files have dependencies on the functions generated by the macros:

  • src/ucp/wireup/wireup_ep.c
  • src/ucs/memory/rcache.c
  • src/ucs/async/async.c
  • src/ucs/async/thread.c
  • src/uct/sm/base/sm_ep.c

not sure what the right answer looks like; this implementation attempts to maintain consistency with the other architecture implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see

# include "generic/atomic.h"
, maybe we can use generic/atomic.h for riscv as well

* See file LICENSE for terms.
*/

#ifndef UCS_RV64_BITOPS_H_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCS_RISCV_BITOPS_
let's use RISCV name throughout the code and not RV64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jleidel @yosefe looking at linux source code I saw that in some places they use RISCV and in other RV64.
It seems line RV64 is an equivalent of aarch64, which actually describes the arch better.
In some other places I see people using riscv64. Anyway, if the code only works for RV64 ISA then maybe it is better to use RV64

Copy link

@ct-clmsn ct-clmsn May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed these issues. @shamisp; prior to submitting the patch, I chatted with @jleidel about the naming convention rv64 or riscv. we landed on riscv; we have a similar set of concerns and observations. if ya'll want or prefer rv64 that's an easy enough modification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my perspective it does not matter as long as it's consistent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamisp which is your preference? rv64 or riscv?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ct-clmsn my approach similar to Yossis' - as long as it is consistent it is all good. if it only works on 64b platforms IMHO rv64 is better name, but this is really up to you.

}

while (fgets(buf, sizeof(buf), f)) {
if (sscanf(buf, "uarch : %s", value) == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need so many spaces before :

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed these issues.

fclose(f);
}

void ucs_rv64_cpuid(ucs_rv64_cpuid_t *cpuid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace rv64 by riscv in all places

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamisp 's comment about riscv or rv64 applies here. will need to sort out naming convention.

}


//double ucs_rv64_init_tsc_freq();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this one.

#define ucs_memory_cpu_load_fence() ucs_compiler_fence()
#define ucs_memory_cpu_wc_fence() asm volatile ("fence w,w" ::: "memory")

static inline double ucs_arch_get_clocks_per_sec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCS_F_ALWAYS_INLINE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this one.

Comment on lines 49 to 51
char isa[256];
char mmu[256];
char uarch[256];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the cpuid used?

Copy link

@ct-clmsn ct-clmsn May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used to load data from the operating system as required by ucs; I've seen some debugging messages use this information. I'd like to retain it, removing it would be a fair amount of work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see any debug message using "uarch" or "mmu" for example, can you pls point to it?

Copy link

@ct-clmsn ct-clmsn May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was implemented because the ARM support had ucs_aarch64_cpuid_t. ucs_aarch64_cpuid_t exists because of an #ifdef __aarch64__ in ./uct/ib/base/ib_device.c; an assumption was made that riscv64 would require similar work.

The type and the code that initializes the type are used in the riscv64 version of ucs_arch_get_cpu_vendor. ucs_arch_get_cpu_vendor is called in several places but the information stored in the type is never actually consumed or used.

This structure and that code have been removed.

Comment on lines 16 to 27
ucs_config_field_t ucs_arch_global_opts_table[] = {
#if ENABLE_BUILTIN_MEMCPY
{"BUILTIN_MEMCPY_MIN", "auto",
"Minimal threshold of buffer length for using built-in memcpy.",
ucs_offsetof(ucs_arch_global_opts_t, builtin_memcpy_min), UCS_CONFIG_TYPE_MEMUNITS},

{"BUILTIN_MEMCPY_MAX", "auto",
"Maximal threshold of buffer length for using built-in memcpy.",
ucs_offsetof(ucs_arch_global_opts_t, builtin_memcpy_max), UCS_CONFIG_TYPE_MEMUNITS},
#endif
{NULL}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to have special memcpy for riscv?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A run of ucx_info, ucx_perf, etc without this call causes illegal instruction segmentation faults. I'd like to retain this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aarch64 for example does not have it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note, there is still a tremendous amount of experimental code and/or lacking features in portions of the RISC-V system software and tool chains. Went back through the patch, updated some fence.i locations, and was able to remove this successfully.

#elif defined(__aarch64__)
# include "aarch64/cpu.h"
#elif defined(__riscv)
# ifdef HAVE___CLEAR_CACHE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem with built-in clear cache ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't necessarily clear the cache in the manner expected. The particular implementation of RISC-V that we're testing on predates the ratification of the CMOBase set of extensions (Zicbom, Zicbop, Zicboz) that implement explicit cache flush operations. Given that the community was well aware of the forthcoming CMOBase infrastructure (ratified in Nov 2021), the toolchain and system libs have somewhat lagged their support of particular builtins. They're "implemented", but not fully implemented. This is quite literally a situation where we're running on rather early hardware/software.

return UCS_ERR_UNSUPPORTED;
}

#if !HAVE___CLEAR_CACHE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above code you undefined HAVE___CLEAR_CACHE , so it is never defined.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out - this issue has been fixed.

#if !HAVE___CLEAR_CACHE
static inline void ucs_arch_clear_cache(void *start, void *end)
{
__riscv_flush_icache(start, end, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect clear cache maps to the same call. IMHO something really broken here and this is why you had to introduce fence.i in multiple places.

Copy link

@ct-clmsn ct-clmsn May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jleidel's response clarifies things a bit. gcc has 2 intrinsics for barriers on both caches. __sync_synchronize for data cache and __riscv_flush_icache for instruction cache. It was a little unclear which cache is flushed by this particular ucs function; the naming conventions used in the macros for memory barriers are really nice.

Additionally, I've done some runtime gdb disassembly of both (__sync_synchronize , __riscv_flush_icache ) intrinsics; the disassembly for both intrinsics have no reference to the fence.i instruction. The disassembly looks like an attempt to perform both memory barriers using software (ie: loop until the cache clears) as opposed to using the hardware instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reference I mentioned on the call:

static inline void ucs_arch_clear_cache(void *start, void *end)

@ct-clmsn
Copy link

ct-clmsn commented Jun 8, 2022

@shamisp @yosefe quick update: several modifications have been made and pushed into this PR.

@yosefe
Copy link
Contributor

yosefe commented Jun 8, 2022

@ct-clmsn can you pls fix the commit title according to code style https://github.com/openucx/ucx/wiki/Guidance-for-contributors item 3?
(would need to force-push to resolve)

@ct-clmsn
Copy link

ct-clmsn commented Jun 8, 2022

@yosefe change the automated commit message for the merge (commit # a3e2624)?

@yosefe
Copy link
Contributor

yosefe commented Jun 8, 2022

Not the merge commit; the other ones.

@ct-clmsn ct-clmsn force-pushed the rv64-support branch 2 times, most recently from 9bbfaed to b84d18d Compare June 8, 2022 20:36
@ct-clmsn
Copy link

ct-clmsn commented Jun 8, 2022

@yosefe apologies for the 2 rebases - found an issue with a single message. I believe the commit titles/messages are now updated to meet the specification.

do { \
int exp_events = (_event) & (_mask); \
(_data)->fired_events = 0; \
ucs_rv64_isb(); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the only difference here is ucs_rv64_isb()
so better define a memory barrier macro, empty in all architectures except rv64, and use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, ISB is actually and an instruction on Arm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think instruction barrier is needed here - from the code is not as obvious.

Copy link

@ct-clmsn ct-clmsn Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RV64 manual provides RV64 mappings to fence instructions offered on other architectures (ARM, PPC, x86, etc). I implemented the ARM flavors of fences (as per the RV64 mapping) to provide something that looks familiar to ya'll. RV64's data cache flush support is not implemented in the available hardware; the ISA extension for data cache flush was ratified after the available hardware was released. The hardware vendor has implemented their own, non-ISA, instruction cache 'discard'. The non-ISA 'discard' instruction does not perform a 'write back' to memory when invoked; the cache flushes and memory is not touched. 'discard' is not a data cache flush to use in bistro because the instruction is vendor specific and does not 'write back' to memory. The official ISA instruction cache flush is implemented in the available hardware. Segmentation faults happen when fences and the icache flush are not placed before the first time a function modified by bistro is invoked.

Comment on lines 202 to 204
#if defined(__riscv)
ucs_arch_clear_cache(p, p+ucm_get_page_size());
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why clear cache is needed here? and why only on rv64?
(trying to remove #if from the code as much as possible to generic arch macros)

Copy link

@ct-clmsn ct-clmsn Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Segmentation faults happen when fences and the icache flush are not placed before the first time a bistro modified function is invoked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the cache flush. but make it unconditional, i.e do it for all architectures

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a macro like the following suffice?

UCM_INVOKE(mmap, ...);

On RV64, cache flushes could be added by the macro prior to mmap(...) getting invoked. On the non-RV64 platform, the cacheflushes would not occur.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ct-clmsn should be something like:

if (events & (UCM_EVENT_SHMAT|UCM_EVENT_SHMDT|UCM_EVENT_VM_MAPPED|UCM_EVENT_VM_UNMAPPED)) {
    ucs_arch_sync_inst_cache(p, p + ucm_get_page_size());
    shmid = shmget(IPC_PRIVATE, ucm_get_page_size(), IPC_CREAT | SHM_R | SHM_W);
   ...
}

And ucs_arch_sync_inst_cache() is defined to ucs_arch_clear_cache on rv4, empty on other archs.

However, looking at the code more, why is this needed only before shmget and not in other places?

#define UCM_DEFINE_REPLACE_FUNC(_name, _rettype, _fail_val, ...) \
_UCM_DEFINE_REPLACE_FUNC(ucm_override_##_name, ucm_##_name, _rettype, _fail_val, __VA_ARGS__)

#if defined(__riscv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the difference only ucm_trace? why needed to change?

Copy link

@ct-clmsn ct-clmsn Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an issue with ucm_trace on RV64, there is a memcpy/strlen in ucm_trace that is causing segmentation faults. Removing the ucm_trace fixed the issue. This is a side effect of the lack of a data cache flush implementation in the hardware./This strlen issue is why ucm_trace is removed in src/ucm/mmap/install.c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which memcpy was it?

Copy link

@ct-clmsn ct-clmsn Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies, the issue wasn't memcpy, the segmentation fault happens in ucm_log_vsnprintf; which is called by ucm_trace under the hood. There is a situation on the RV64 platform where the strlen on the line linked to above causes a segfault. I have a custom RV64 patch for ucm_log_vsnprintf to fix the strlen issue, not sure if ya'll would want to provide a custom architecture fix for that function; at the time it seemed less controversial to provide an architecture specific macro that dropped ucm_trace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ct-clmsn what is the reason strlen fails on rv64 and not on other archs?
what is the fix?
IMO better try to fix the underlying issue rather and working around it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @yosefe suggestion. Let's fix the actual problem and not hack around it.

* See file LICENSE for terms.
*/

#ifndef UCS_RISCV64_ATOMIC_H_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this file is not needed anymore

Comment on lines +20 to +34
union input {
uint64_t value;
struct pvalue {
uint32_t rhs;
uint32_t lhs;
} pvalue;
} n_sto = { .value = n };

const int lhs = __builtin_ctz(n_sto.pvalue.lhs);
const int rhs = __builtin_ctz(n_sto.pvalue.rhs);

int val = rhs;
if(rhs < 0) {
val = 32 + lhs;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls fix code style according to clang-format

return (double) freq;
}

static inline ucs_cpu_model_t ucs_arch_get_cpu_model()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more space line before

return UCS_CPU_FLAG_UNKNOWN;
}

static inline void ucs_cpu_init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more space line before
use 2 space lines in H file (in multiple other places as well)


#define ucs_arch_wait_mem ucs_arch_generic_wait_mem

static inline void ucs_arch_clear_cache(void * start, void * end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style

Comment on lines +16 to +25
#define UCS_ARCH_GLOBAL_OPTS_INITALIZER { \
.builtin_memcpy_min = UCS_MEMUNITS_AUTO, \
.builtin_memcpy_max = UCS_MEMUNITS_AUTO \
}

/* built-in memcpy config */
typedef struct ucs_arch_global_opts {
size_t builtin_memcpy_min;
size_t builtin_memcpy_max;
} ucs_arch_global_opts_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove builtin memcpy options for rv64, it's not used anyway
for example aarch64 does not have it

Copy link

@ct-clmsn ct-clmsn Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought this was removed previously, looks like the last push didn't remove it (merge artifact). will fix promptly.

UCT_INSTANTIATE_UD_TEST_CASE(test_ud)


class test_ud_peer_failure : public ud_base_test {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a merge artifact? if not why needed to change UD tests?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge artifact; will fix.

ucs_arch_clear_cache(p, p+ucm_get_page_size());
#endif
shmid = shmget(IPC_PRIVATE, ucm_get_page_size(), IPC_CREAT | SHM_R | SHM_W);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the spaceline

Comment on lines 202 to 204
#if defined(__riscv)
ucs_arch_clear_cache(p, p+ucm_get_page_size());
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ct-clmsn should be something like:

if (events & (UCM_EVENT_SHMAT|UCM_EVENT_SHMDT|UCM_EVENT_VM_MAPPED|UCM_EVENT_VM_UNMAPPED)) {
    ucs_arch_sync_inst_cache(p, p + ucm_get_page_size());
    shmid = shmget(IPC_PRIVATE, ucm_get_page_size(), IPC_CREAT | SHM_R | SHM_W);
   ...
}

And ucs_arch_sync_inst_cache() is defined to ucs_arch_clear_cache on rv4, empty on other archs.

However, looking at the code more, why is this needed only before shmget and not in other places?

#define _UCM_DEFINE_REPLACE_FUNC(_over_name, _ucm_name, _rettype, _fail_val, ...) \
\
/* Define a symbol which goes to the replacement - in case we are loaded first */ \
_rettype _over_name(UCM_FUNC_DEFINE_ARGS(__VA_ARGS__)) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like only the difference here is ucm_trace call ? What is the problem with the trace call ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed the above discussion

do { \
int exp_events = (_event) & (_mask); \
(_data)->fired_events = 0; \
ucs_rv64_isb(); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why ISB is needed here ? I don't see any obvious reason.
If there is good reason probably it is required on all architectures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More I read it and more I don't understand it. The ISB if followed by system call that switches to privilege mode. In theory it should include ISB and memory barrier. Out curiosity, have you tried to replace it with memory write barrier ?

@ct-clmsn
Copy link

ct-clmsn commented Aug 9, 2022

@shamisp @yosefe did some testing today. It turns out riscv64's implementation of brk is not equivalent to syscall(SYS_brk, ...). This is the source of several issues and work arounds you have seen in the patch. I'm in the process of implementing an architecture specific solution in the function ucm_brk_syscall in src/util/sys.c around line 378. ucm_brk_syscall already has architecture specific code for calling syscall(SYS_brk, ...) for x86; comments in the function state architecture specific code should be used in the implementation. This should resolve all of the issues above, with exception to any outstanding character-spacing issues.

@ct-clmsn
Copy link

ct-clmsn commented Aug 11, 2022

@yosefe @shamisp ucm_log_vsnprintf (in the file src/ucm/util/log.c) accepts an argument va_list ap. A line of code value.s = va_arg(ap, char *); expects va_arg to return NULL or 0 when it currently does not behave like this on RV64. Working on a solution.

@jleidel
Copy link
Author

jleidel commented Aug 18, 2022

At this point, I believe we have exhausted all possible options for removing architecture-specific I$ flushes in the RV64 implementation. Given that this is currently the "most" functional platform we have available for UCX prototyping, we limited in our options with respect to removing architecture specific code from non-architecture specific sections. Our current platform lacks certain support for hardware-driven cache flushes (RV-CMOBASE extension), compat headers (for libfabrics support; kernel 5.11.0-1028) and certain ELF extensions (DT_JMPREL is missing in binutils). We've tried all combinations of architecture-agnostic user and system calls to arrive at a solution. Many of which would potentially induce performance issues in future platform variants. Given the "experimental" nature of RISC-V platforms in HPC, we would like to request that we allow this PR to merge as is. This would allow everyone in the RISC-V community to make forward progress on other HPC runtime/comm layers. However, we will also sign up to amend the code once we have a more functional/stable baseline platform to use for tests and experiments. The RISC-V "platforms" and "profiles" specs will define specific extension requirements that will ease the pain of porting complex runtimes such as OpenUCX to future RISC-V variants.

@shamisp
Copy link
Contributor

shamisp commented Aug 19, 2022

At this point, I believe we have exhausted all possible options for removing architecture-specific I$ flushes in the RV64 implementation. Given that this is currently the "most" functional platform we have available for UCX prototyping, we limited in our options with respect to removing architecture specific code from non-architecture specific sections. Our current platform lacks certain support for hardware-driven cache flushes (RV-CMOBASE extension), compat headers (for libfabrics support; kernel 5.11.0-1028) and certain ELF extensions (DT_JMPREL is missing in binutils). We've tried all combinations of architecture-agnostic user and system calls to arrive at a solution. Many of which would potentially induce performance issues in future platform variants. Given the "experimental" nature of RISC-V platforms in HPC, we would like to request that we allow this PR to merge as is. This would allow everyone in the RISC-V community to make forward progress on other HPC runtime/comm layers. However, we will also sign up to amend the code once we have a more functional/stable baseline platform to use for tests and experiments. The RISC-V "platforms" and "profiles" specs will define specific extension requirements that will ease the pain of porting complex runtimes such as OpenUCX to future RISC-V variants.

@yosefe @ct-clmsn @jleidel Since the u-arch/arch issues cannot be resolved what about moving parts of the abstraction that got ifdef-ed (UCM_FIRE_EVENT, ucs_arch_clear_cache, _UCM_DEFINE_REPLACE_FUNC) to arch files in bistro. This way we can have one generic implementation and RV64 will have an option to overwrite those without ifdefs in the main code. @yosefe ?

@ct-clmsn
Copy link

ct-clmsn commented Jul 13, 2023

@shamisp @yosefe I've made some updates to the bistro support and placed compile guards around some of the openmp code.

The icache issues have been resolved.

The modifications have been tested against a sifive unmatched and a pine64/star64 device. The code runs successfully on both devices.

Also, I've merged the implementation against master so, it is ready to go.

@ct-clmsn ct-clmsn force-pushed the rv64-support branch 2 times, most recently from f2d9c2b to 06a1c3f Compare July 19, 2023 14:49
@yosefe
Copy link
Contributor

yosefe commented Jul 26, 2023

@ct-clmsn can you pls fix the commit title? would need force-push

@ct-clmsn
Copy link

ct-clmsn commented Jul 26, 2023

@yosefe apologies, to clarify this commit? 06a1c3f

/**
* @brief JALR - Add 12 bit immediate to source register, save to destination register, jump and link from destination register
*
* @param[in] _reg register number (0-31), @param[out] _reg register number (0-31), @param[imm] 12 bit immmediate value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in f9083eb

These are not out variables in the sense that doxygen would consider it. Also there is no "imm" type.

@ct-clmsn ct-clmsn force-pushed the rv64-support branch 2 times, most recently from cffb8a6 to 942f468 Compare July 26, 2023 13:55
Alexey-Rivkin and others added 26 commits July 26, 2023 12:59
This limit is no longer needed after moving protov1 to ucp rcache, and
replacing datatype registration with ucp_memh pointer.
Need to always pass mr_id on rc_verbs transport, even if flush_remote is
disabled. On machines with relaxed ordering and DevX md, we would use
the atomic-KSM key for put operations after a fence. So need to use the
right offset.
The active message lane may be used only for control messages, and not
send data during rendezvous protocol.

Signed-off-by: Yossi Itigin <yosefe@nvidia.com>
Low byte of flush mkey (aka atomic mr_id) can be 0.
When unified mode is used, UCP uses local iface query to know
the length of ep address. If one of the peers atomic mr_id is 0
while another is not, UCP uses worng ep address length and
incorrectly unpacks UCP address.
Use simplified versions of mem_reg/dereg/pack, that do not rely on
uct_ib_md_ops_t callbacks.
Update some unit tests that called IB functions directly.
Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Need to always pass mr_id on rc_verbs transport, even if flush_remote is
disabled. On machines with relaxed ordering and DevX md, we would use
the atomic-KSM key for put operations after a fence. So need to use the
right offset.
Low byte of flush mkey (aka atomic mr_id) can be 0.
When unified mode is used, UCP uses local iface query to know
the length of ep address. If one of the peers atomic mr_id is 0
while another is not, UCP uses worng ep address length and
incorrectly unpacks UCP address.
Use simplified versions of mem_reg/dereg/pack, that do not rely on
uct_ib_md_ops_t callbacks.
Update some unit tests that called IB functions directly.
@ct-clmsn
Copy link

@yosefe @jleidel @shamisp UCX has changed significantly since this PR was proposed. This PR is OBE (overcome by events) and I'd encourage merging #9168 as it's a more up to date proposal.

@shamisp shamisp closed this Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.