Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf vendor events riscv: Update SiFive CPU PMU events #986

Closed
wants to merge 16 commits into from

Conversation

bjoto
Copy link

@bjoto bjoto commented May 9, 2024

Pull request for series with
subject: perf vendor events riscv: Update SiFive CPU PMU events
version: 1
url: https://patchwork.kernel.org/project/linux-riscv/list/?series=851753

nick650823 and others added 16 commits April 28, 2024 14:50
When the cpus in the same cluster are all in the idle state, the kernel
might put the cluster into a deeper low power state. Call the
cluster_pm_enter() before entering the low power state and call the
cluster_pm_exit() after the cluster woken up.

Signed-off-by: Nick Hu <nick.hu@sifive.com>
Link: https://lore.kernel.org/r/20240226065113.1690534-1-nick.hu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Alexandre Ghiti <alexghiti@rivosinc.com> says:

patch 1 removes a useless memory barrier and patch 2 actually fixes the
issue with IPI in the patching code.

* b4-shazam-merge:
  riscv: Fix text patching when IPI are used
  riscv: Remove superfluous smp_mb()

Link: https://lore.kernel.org/r/20240229121056.203419-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
…USH_CTX prctl"

Charlie Jenkins <charlie@rivosinc.com> says:

Improve the performance of icache flushing by creating a new prctl flag
PR_RISCV_SET_ICACHE_FLUSH_CTX. The interface is left generic to allow
for future expansions such as with the proposed J extension [1].

Documentation is also provided to explain the use case.

Patch sent to add PR_RISCV_SET_ICACHE_FLUSH_CTX to man-pages [2].

[1] https://github.com/riscv/riscv-j-extension
[2] https://lore.kernel.org/linux-man/20240124-fencei_prctl-v1-1-0bddafcef331@rivosinc.com

* b4-shazam-merge:
  cpumask: Add assign cpu
  documentation: Document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl
  riscv: Include riscv_set_icache_flush_ctx prctl
  riscv: Remove unnecessary irqflags processor.h include

Link: https://lore.kernel.org/r/20240312-fencei-v13-0-4b6bdc2bbf32@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
IS_ENABLED(CONFIG_64BIT) in initialization of pgtable_l{4,5}_enabled is
redundant, remove it.

Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240320064712.442579-2-dawei.li@shingroup.cn
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
pgtable_l{4,5}_enabled are read only after initialization, make explicit
annotation of __ro_after_init on them.

Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240320064712.442579-3-dawei.li@shingroup.cn
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
…ired

After commit f51f7a0 ("riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC
for !dma_coherent"), for non-coherent platforms with less than 4GB
memory, we rely on users to pass "swiotlb=mmnn,force" kernel parameters
to enable DMA bouncing for unaligned kmalloc() buffers. Now let's go
further: If no bouncing needed for ZONE_DMA, let kernel automatically
allocate 1MB swiotlb buffer per 1GB of RAM for kmalloc() bouncing on
non-coherent platforms, so that no need to pass "swiotlb=mmnn,force"
any more.

The math of "1MB swiotlb buffer per 1GB of RAM for kmalloc() bouncing"
is taken from arm64. Users can still force smaller swiotlb buffer by
passing "swiotlb=mmnn".

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240325110036.1564-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Jisheng Zhang <jszhang@kernel.org> says:

This series selects ARCH_USE_CMPXCHG_LOCKREF to enable the
cmpxchg-based lockless lockref implementation for riscv. Then,
implement arch_cmpxchg64_{relaxed|acquire|release}.

After patch1:
Using Linus' test case[1] on TH1520 platform, I see a 11.2% improvement.
On JH7110 platform, I see 12.0% improvement.

After patch2:
on both TH1520 and JH7110 platforms, I didn't see obvious
performance improvement with Linus' test case [1]. IMHO, this may
be related with the fence and lr.d/sc.d hw implementations. In theory,
lr/sc without fence could give performance improvement over lr/sc plus
fence, so add the code here to leave performance improvement room on
newer HW platforms.

* b4-shazam-merge:
  riscv: cmpxchg: implement arch_cmpxchg64_{relaxed|acquire|release}
  riscv: select ARCH_USE_CMPXCHG_LOCKREF

Link: http://marc.info/?l=linux-fsdevel&m=137782380714721&w=4 [1]
Link: https://lore.kernel.org/r/20240325111038.1700-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently, riscv linux requires at least IMA, so all platforms have a
multiplier. And I assume the 'mul' efficiency is comparable or better
than a sequence of five or so register-dependent arithmetic
instructions. Select ARCH_HAS_FAST_MULTIPLIER to get slightly nicer
codegen. Refer to commit f9b4192 ("[PATCH] bitops: hweight()
speedup") for more details.

In a simple benchmark test calling hweight64() in a loop, it got:
about 14% performance improvement on JH7110, tested on Milkv Mars.

about 23% performance improvement on TH1520 and SG2042, tested on
Sipeed LPI4A and SG2042 platform.

a slight performance drop on CV1800B, tested on milkv duo. Among all
riscv platforms in my hands, this is the only one which sees a slight
performance drop. It means the 'mul' isn't quick enough. However, the
situation exists on x86 too, for example, P4 doesn't have fast
integer multiplies as said in the above commit, x86 also selects
ARCH_HAS_FAST_MULTIPLIER. So let's select ARCH_HAS_FAST_MULTIPLIER
which can benefit almost riscv platforms.

Samuel also provided some performance numbers:
On Unmatched: 20% speedup for __sw_hweight32 and 30% speedup for
__sw_hweight64.
On D1: 8% speedup for __sw_hweight32 and 8% slowdown for
__sw_hweight64.

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Tested-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240325105823.1483-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This set of PMU event descriptions applies not only to the SiFive U74
core configuration, but also to other SiFive cores that implement the
Bullet microarchitecture (such as U64, P270, and X280). Rename the
directory to be more generic.

Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
The EventCode field (as stored in the mhpmeventN CSRs) is actually 56
bits wide, but there is no need to keep leading zeroes in the JSON
files. Remove them to simplify review of the following change, which
regenerates the files in a way that does not include leading zeroes.

This change was performed automatically with `sed -i "s/0x0*/0x/"`.

Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Regenerate the event lists from the original hardware description. This
makes them consistent with the event lists for newer versions of the
hardware, allowing most files to be reused across hardware versions.

Signed-off-by: Eric Lin <eric.lin@sifive.com>
Co-developed-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
SiFive Bullet microarchitecture cores with mimpid values starting with
0x07 or greater add new PMU events to support debug, trace, and counter
sampling and filtering (Sscofpmf).

All other PMU events are unchanged from earlier Bullet cores.

Signed-off-by: Eric Lin <eric.lin@sifive.com>
Co-developed-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
SiFive Bullet microarchitecture cores with mimpid values starting with
0x0d or greater add new PMU events to count TLB miss stall cycles.

All other PMU events are unchanged from earlier Bullet cores.

Signed-off-by: Eric Lin <eric.lin@sifive.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
The SiFive Performance P550 core features an out-of-order
microarchitecture which exposes the same PMU events as Bullet,
plus events for UTLB hits and PTE cache misses/hits.

Add support for specifying these events using symbolic names.

Signed-off-by: Eric Lin <eric.lin@sifive.com>
Co-developed-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
The SiFive Performance P650 core (including the vector-enabled P670 and
area-optimized P450/P470 variants) updates the P550 microarchitecture.
It brings in the debug, trace, and counter events from newer Bullet
cores, and adds new events for iTLB and dTLB multi-hits.

All other PMU events are unchanged from the P550 core.

Signed-off-by: Eric Lin <eric.lin@sifive.com>
Co-developed-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
@bjoto
Copy link
Author

bjoto commented May 9, 2024

Upstream branch: 0a16a17
series: https://patchwork.kernel.org/project/linux-riscv/list/?series=851753
version: 1

@bjoto bjoto closed this May 14, 2024
@bjoto bjoto deleted the series/851753=>for-next branch May 16, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants