Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riscv: Improve PTDUMP to show RSW with non-zero value #24

Closed
wants to merge 3 commits into from

Conversation

bjoto
Copy link

@bjoto bjoto commented Sep 15, 2023

Pull request for series with
subject: riscv: Improve PTDUMP to show RSW with non-zero value
version: 2
url: https://patchwork.kernel.org/project/linux-riscv/list/?series=783969

@bjoto
Copy link
Author

bjoto commented Sep 15, 2023

Upstream branch: 0bb80ec
series: https://patchwork.kernel.org/project/linux-riscv/list/?series=783969
version: 2

@bjoto
Copy link
Author

bjoto commented Sep 15, 2023

Upstream branch: 0bb80ec
series: https://patchwork.kernel.org/project/linux-riscv/list/?series=783969
version: 2

@bjoto
Copy link
Author

bjoto commented Sep 18, 2023

Upstream branch: 0bb80ec
series: https://patchwork.kernel.org/project/linux-riscv/list/?series=783969
version: 2

@bjoto
Copy link
Author

bjoto commented Sep 18, 2023

Upstream branch: 0bb80ec
series: https://patchwork.kernel.org/project/linux-riscv/list/?series=783969
version: 2

@bjoto
Copy link
Author

bjoto commented Sep 18, 2023

Upstream branch: 0bb80ec
series: https://patchwork.kernel.org/project/linux-riscv/list/?series=783969
version: 2

RSW field can be used to encode 2 bits of software defined
information, currently PTDUMP only prints RSW when its value
is 1 or 3.

To fix this issue and enhance the debug experience with PTDUMP,
we use _PAGE_SOFT as the RSW mask and redefine _PAGE_SPECIAL to
(1 << 8), allow it to print the RSW with any non-zero value,
otherwise, it will print an empty string for each row.

This patch also removes the val from the struct prot_bits as
it is no longer needed.

Signed-off-by: Yu Chien Peter Lin <peterlin@andestech.com>
This patch introduces the PBMT field to the PTDUMP, so it can
display the memory attributes for NC or IO.

Signed-off-by: Yu Chien Peter Lin <peterlin@andestech.com>
This patch introduces the NAPOT field to PTDUMP, allowing it
to display the letter "N" for pages that have the 63rd bit set.

Signed-off-by: Yu Chien Peter Lin <peterlin@andestech.com>
@bjoto
Copy link
Author

bjoto commented Sep 20, 2023

At least one diff in series https://patchwork.kernel.org/project/linux-riscv/list/?series=783969 irrelevant now. Closing PR.

@bjoto bjoto added superseded and removed new labels Sep 20, 2023
@bjoto bjoto closed this Sep 20, 2023
@bjoto bjoto deleted the series/783853=>for-next branch September 20, 2023 03:56
bjoto pushed a commit that referenced this pull request Mar 14, 2024
Currently our IO accessors all use register addressing without offsets,
but we could safely use offset addressing (without writeback) to
simplify and optimize the generated code.

To function correctly under a hypervisor which emulates IO accesses, we
must ensure that any faulting/trapped IO access results in an ESR_ELx
value with ESR_ELX.ISS.ISV=1 and with the tranfer register described in
ESR_ELx.ISS.SRT. This means that we can only use loads/stores of a
single general purpose register (or the zero register), and must avoid
writeback addressing modes. However, we can use immediate offset
addressing modes, as these still provide ESR_ELX.ISS.ISV=1 and a valid
ESR_ELx.ISS.SRT when those accesses fault at Stage-2.

Currently we only use register addressing without offsets. We use the
"r" constraint to place the address into a register, and manually
generate the register addressing by surrounding the resulting register
operand with square braces, e.g.

| static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
| {
|         asm volatile("str %x0, [%1]" : : "rZ" (val), "r" (addr));
| }

Due to this, sequences of adjacent accesses need to generate addresses
using separate instructions. For example, the following code:

| void writeq_zero_8_times(void *ptr)
| {
|        writeq_relaxed(0, ptr + 8 * 0);
|        writeq_relaxed(0, ptr + 8 * 1);
|        writeq_relaxed(0, ptr + 8 * 2);
|        writeq_relaxed(0, ptr + 8 * 3);
|        writeq_relaxed(0, ptr + 8 * 4);
|        writeq_relaxed(0, ptr + 8 * 5);
|        writeq_relaxed(0, ptr + 8 * 6);
|        writeq_relaxed(0, ptr + 8 * 7);
| }

... is compiled to:

| <writeq_zero_8_times>:
|     str     xzr, [x0]
|     add     x1, x0, #0x8
|     str     xzr, [x1]
|     add     x1, x0, #0x10
|     str     xzr, [x1]
|     add     x1, x0, #0x18
|     str     xzr, [x1]
|     add     x1, x0, #0x20
|     str     xzr, [x1]
|     add     x1, x0, #0x28
|     str     xzr, [x1]
|     add     x1, x0, #0x30
|     str     xzr, [x1]
|     add     x0, x0, #0x38
|     str     xzr, [x0]
|     ret

As described above, we could safely use immediate offset addressing,
which would allow the ADDs to be folded into the address generation for
the STRs, resulting in simpler and smaller generated assembly. We can do
this by using the "o" constraint to allow the compiler to generate
offset addressing (without writeback) for a memory operand, e.g.

| static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
| {
|         volatile u64 __iomem *ptr = addr;
|         asm volatile("str %x0, %1" : : "rZ" (val), "o" (*ptr));
| }

... which results in the earlier code sequence being compiled to:

| <writeq_zero_8_times>:
|     str     xzr, [x0]
|     str     xzr, [x0, #8]
|     str     xzr, [x0, #16]
|     str     xzr, [x0, #24]
|     str     xzr, [x0, #32]
|     str     xzr, [x0, #40]
|     str     xzr, [x0, #48]
|     str     xzr, [x0, #56]
|     ret

As Will notes at:

  https://lore.kernel.org/linux-arm-kernel/20240117160528.GA3398@willie-the-truck/

... some compilers struggle with a plain "o" constraint, so it's
preferable to use "Qo", where the additional "Q" constraint permits
using non-offset register addressing.

This patch modifies our IO write accessors to use "Qo" constraints,
resulting in the better code generation described above. The IO read
accessors are left as-is because ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE
requires that non-offset register addressing is used, as the LDAR
instruction does not support offset addressing.

When compiling v6.8-rc1 defconfig with GCC 13.2.0, this saves ~4KiB of
text:

| [mark@lakrids:~/src/linux]% ls -al vmlinux-*
| -rwxr-xr-x 1 mark mark 153960576 Jan 23 12:01 vmlinux-after
| -rwxr-xr-x 1 mark mark 153862192 Jan 23 11:57 vmlinux-before
|
| [mark@lakrids:~/src/linux]% size vmlinux-before vmlinux-after
|    text    data     bss     dec     hex filename
| 26708921        16690350         622736 44022007        29fb8f7 vmlinux-before
| 26704761        16690414         622736 44017911        29fa8f7 vmlinux-after

... though due to internal alignment of sections, this has no impact on
the size of the resulting Image:

| [mark@lakrids:~/src/linux]% ls -al Image-*
| -rw-r--r-- 1 mark mark 43590144 Jan 23 12:01 Image-after
| -rw-r--r-- 1 mark mark 43590144 Jan 23 11:57 Image-before

Aside from the better code generation, there should be no functional
change as a result of this patch. I have lightly tested this patch,
including booting under KVM (where some devices such as PL011 are
emulated).

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240124111259.874975-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
bjoto pushed a commit that referenced this pull request May 2, 2024
The current implementation of the mov instruction with sign extension has the
following problems:

  1. It clobbers the source register if it is not stacked because it
     sign extends the source and then moves it to the destination.
  2. If the dst_reg is stacked, the current code doesn't write the value
     back in case of 64-bit mov.
  3. There is room for improvement by emitting fewer instructions.

The steps for fixing this and the instructions emitted by the JIT are explained
below with examples in all combinations:

Case A: offset == 32:
=====================

  Case A.1: src and dst are stacked registers:
  --------------------------------------------
    1. Load src_lo into tmp_lo
    2. Store tmp_lo into dst_lo
    3. Sign extend tmp_lo into tmp_hi
    4. Store tmp_hi to dst_hi

    Example: r3 = (s32)r3
	r3 is a stacked register

	ldr     r6, [r11, #-16]	// Load r3_lo into tmp_lo
	// str to dst_lo is not emitted because src_lo == dst_lo
	asr     r7, r6, #31	// Sign extend tmp_lo into tmp_hi
	str     r7, [r11, #-12] // Store tmp_hi into r3_hi

  Case A.2: src is stacked but dst is not:
  ----------------------------------------
    1. Load src_lo into dst_lo
    2. Sign extend dst_lo into dst_hi

    Example: r6 = (s32)r3
	r6 maps to {ARM_R5, ARM_R4} and r3 is stacked

	ldr     r4, [r11, #-16] // Load r3_lo into r6_lo
	asr     r5, r4, #31	// Sign extend r6_lo into r6_hi

  Case A.3: src is not stacked but dst is stacked:
  ------------------------------------------------
    1. Store src_lo into dst_lo
    2. Sign extend src_lo into tmp_hi
    3. Store tmp_hi to dst_hi

    Example: r3 = (s32)r6
	r3 is stacked and r6 maps to {ARM_R5, ARM_R4}

	str     r4, [r11, #-16] // Store r6_lo to r3_lo
	asr     r7, r4, #31	// Sign extend r6_lo into tmp_hi
	str     r7, [r11, #-12]	// Store tmp_hi to dest_hi

  Case A.4: Both src and dst are not stacked:
  -------------------------------------------
    1. Mov src_lo into dst_lo
    2. Sign extend src_lo into dst_hi

    Example: (bf) r6 = (s32)r6
	r6 maps to {ARM_R5, ARM_R4}

	// Mov not emitted because dst == src
	asr     r5, r4, #31 // Sign extend r6_lo into r6_hi

Case B: offset != 32:
=====================

  Case B.1: src and dst are stacked registers:
  --------------------------------------------
    1. Load src_lo into tmp_lo
    2. Sign extend tmp_lo according to offset.
    3. Store tmp_lo into dst_lo
    4. Sign extend tmp_lo into tmp_hi
    5. Store tmp_hi to dst_hi

    Example: r9 = (s8)r3
	r9 and r3 are both stacked registers

	ldr     r6, [r11, #-16] // Load r3_lo into tmp_lo
	lsl     r6, r6, #24	// Sign extend tmp_lo
	asr     r6, r6, #24	// ..
	str     r6, [r11, #-56] // Store tmp_lo to r9_lo
	asr     r7, r6, #31	// Sign extend tmp_lo to tmp_hi
	str     r7, [r11, #-52] // Store tmp_hi to r9_hi

  Case B.2: src is stacked but dst is not:
  ----------------------------------------
    1. Load src_lo into dst_lo
    2. Sign extend dst_lo according to offset.
    3. Sign extend tmp_lo into dst_hi

    Example: r6 = (s8)r3
	r6 maps to {ARM_R5, ARM_R4} and r3 is stacked

	ldr     r4, [r11, #-16] // Load r3_lo to r6_lo
	lsl     r4, r4, #24	// Sign extend r6_lo
	asr     r4, r4, #24	// ..
	asr     r5, r4, #31	// Sign extend r6_lo into r6_hi

  Case B.3: src is not stacked but dst is stacked:
  ------------------------------------------------
    1. Sign extend src_lo into tmp_lo according to offset.
    2. Store tmp_lo into dst_lo.
    3. Sign extend src_lo into tmp_hi.
    4. Store tmp_hi to dst_hi.

    Example: r3 = (s8)r1
	r3 is stacked and r1 maps to {ARM_R3, ARM_R2}

	lsl     r6, r2, #24 	// Sign extend r1_lo to tmp_lo
	asr     r6, r6, #24	// ..
	str     r6, [r11, #-16] // Store tmp_lo to r3_lo
	asr     r7, r6, #31	// Sign extend tmp_lo to tmp_hi
	str     r7, [r11, #-12] // Store tmp_hi to r3_hi

  Case B.4: Both src and dst are not stacked:
  -------------------------------------------
    1. Sign extend src_lo into dst_lo according to offset.
    2. Sign extend dst_lo into dst_hi.

    Example: r6 = (s8)r1
	r6 maps to {ARM_R5, ARM_R4} and r1 maps to {ARM_R3, ARM_R2}

	lsl     r4, r2, #24	// Sign extend r1_lo to r6_lo
	asr     r4, r4, #24	// ..
	asr     r5, r4, #31	// Sign extend r6_lo to r6_hi

Fixes: fc83265 ("arm32, bpf: add support for sign-extension mov instruction")
Reported-by: syzbot+186522670e6722692d86@syzkaller.appspotmail.com
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Closes: https://lore.kernel.org/all/000000000000e9a8d80615163f2a@google.com
Link: https://lore.kernel.org/bpf/20240419182832.27707-1-puranjay@kernel.org
bjoto pushed a commit that referenced this pull request May 15, 2024
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting
a read from struct thread_info. The SP_EL0 system register holds the
pointer to the task_struct and thread_info is the first member of this
struct. We can read the cpu number from the thread_info.

Here is how the ARM64 JITed assembly changes after this commit:

                                      ARM64 JIT
                                     ===========

              BEFORE                                    AFTER
             --------                                  -------

int cpu = bpf_get_smp_processor_id();        int cpu = bpf_get_smp_processor_id();

mov     x10, #0xfffffffffffff4d0             mrs     x10, sp_el0
movk    x10, #0x802b, lsl #16                ldr     w7, [x10, #24]
movk    x10, #0x8000, lsl #32
blr     x10
add     x7, x0, #0x0

               Performance improvement using benchmark[1]

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

+---------------+-------------------+-------------------+--------------+
|      Name     |      Before       |        After      |   % change   |
|---------------+-------------------+-------------------+--------------|
| glob-arr-inc  | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s |   + 10.74%   |
| arr-inc       | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s |   + 5.37%    |
| hash-inc      | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s |   + 2.08%    |
+---------------+-------------------+-------------------+--------------+

[1] anakryiko@8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bjoto pushed a commit that referenced this pull request May 15, 2024
Puranjay Mohan says:

====================
bpf: Inline helpers in arm64 and riscv JITs

Changes in v5 -> v6:
arm64 v5: https://lore.kernel.org/all/20240430234739.79185-1-puranjay@kernel.org/
riscv v2: https://lore.kernel.org/all/20240430175834.33152-1-puranjay@kernel.org/
- Combine riscv and arm64 changes in single series
- Some coding style fixes

Changes in v4 -> v5:
v4: https://lore.kernel.org/all/20240429131647.50165-1-puranjay@kernel.org/
- Implement the inlining of the bpf_get_smp_processor_id() in the JIT.

NOTE: This needs to be based on:
https://lore.kernel.org/all/20240430175834.33152-1-puranjay@kernel.org/
to be built.

Manual run of bpf-ci with this series rebased on above:
kernel-patches/bpf#6929

Changes in v3 -> v4:
v3: https://lore.kernel.org/all/20240426121349.97651-1-puranjay@kernel.org/
- Fix coding style issue related to C89 standards.

Changes in v2 -> v3:
v2: https://lore.kernel.org/all/20240424173550.16359-1-puranjay@kernel.org/
- Fixed the xlated dump of percpu mov to "r0 = &(void __percpu *)(r0)"
- Made ARM64 and x86-64 use the same code for inlining. The only difference
  that remains is the per-cpu address of the cpu_number.

Changes in v1 -> v2:
v1: https://lore.kernel.org/all/20240405091707.66675-1-puranjay12@gmail.com/
- Add a patch to inline bpf_get_smp_processor_id()
- Fix an issue in MRS instruction encoding as pointed out by Will
- Remove CONFIG_SMP check because arm64 kernel always compiles with CONFIG_SMP

This series adds the support of internal only per-CPU instructions and inlines
the bpf_get_smp_processor_id() helper call for ARM64 and RISC-V BPF JITs.

Here is an example of calls to bpf_get_smp_processor_id() and
percpu_array_map_lookup_elem() before and after this series on ARM64.

                                         BPF
                                        =====
              BEFORE                                       AFTER
             --------                                     -------

int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
(85) call bpf_get_smp_processor_id#229032       (85) call bpf_get_smp_processor_id#8

p = bpf_map_lookup_elem(map, &zero);            p = bpf_map_lookup_elem(map, &zero);
(18) r1 = map[id:78]                            (18) r1 = map[id:153]
(18) r2 = map[id:82][0]+65536                   (18) r2 = map[id:157][0]+65536
(85) call percpu_array_map_lookup_elem#313512   (07) r1 += 496
                                                (61) r0 = *(u32 *)(r2 +0)
                                                (35) if r0 >= 0x1 goto pc+5
                                                (67) r0 <<= 3
                                                (0f) r0 += r1
                                                (79) r0 = *(u64 *)(r0 +0)
                                                (bf) r0 = &(void __percpu *)(r0)
                                                (05) goto pc+1
                                                (b7) r0 = 0

                                      ARM64 JIT
                                     ===========

              BEFORE                                       AFTER
             --------                                     -------

int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
mov     x10, #0xfffffffffffff4d0                mrs     x10, sp_el0
movk    x10, #0x802b, lsl #16                   ldr     w7, [x10, #24]
movk    x10, #0x8000, lsl #32
blr     x10
add     x7, x0, #0x0

p = bpf_map_lookup_elem(map, &zero);            p = bpf_map_lookup_elem(map, &zero);
mov     x0, #0xffff0003ffffffff                 mov     x0, #0xffff0003ffffffff
movk    x0, #0xce5c, lsl #16                    movk    x0, #0xe0f3, lsl #16
movk    x0, #0xca00                             movk    x0, #0x7c00
mov     x1, #0xffff8000ffffffff                 mov     x1, #0xffff8000ffffffff
movk    x1, #0x8bdb, lsl #16                    movk    x1, #0xb0c7, lsl #16
movk    x1, #0x6000                             movk    x1, #0xe000
mov     x10, #0xffffffffffff3ed0                add     x0, x0, #0x1f0
movk    x10, #0x802d, lsl #16                   ldr     w7, [x1]
movk    x10, #0x8000, lsl #32                   cmp     x7, #0x1
blr     x10                                     b.cs    0x0000000000000090
add     x7, x0, #0x0                            lsl     x7, x7, #3
                                                add     x7, x7, x0
                                                ldr     x7, [x7]
                                                mrs     x10, tpidr_el1
                                                add     x7, x7, x10
                                                b       0x0000000000000094
                                                mov     x7, #0x0

              Performance improvement found using benchmark[1]

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

  +---------------+-------------------+-------------------+--------------+
  |      Name     |      Before       |        After      |   % change   |
  |---------------+-------------------+-------------------+--------------|
  | glob-arr-inc  | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s |   + 10.74%   |
  | arr-inc       | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s |   + 5.37%    |
  | hash-inc      | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s |   + 2.08%    |
  +---------------+-------------------+-------------------+--------------+

[1] anakryiko@8dec900975ef

             RISCV64 JIT output for `call bpf_get_smp_processor_id`
            =======================================================

                  Before                           After
                 --------                         -------

           auipc   t1,0x848c                  ld    a5,32(tp)
           jalr    604(t1)
           mv      a5,a0

  Benchmark using [1] on Qemu.

  ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

  +---------------+------------------+------------------+--------------+
  |      Name     |     Before       |       After      |   % change   |
  |---------------+------------------+------------------+--------------|
  | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
  | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
  | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
  +---------------+------------------+------------------+--------------+
====================

Link: https://lore.kernel.org/r/20240502151854.9810-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants