Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploit s390's new and enhanced PCI memory I/O (MIO) instructions #1122

Merged
merged 1 commit into from
Feb 28, 2022

Conversation

niklas88
Copy link
Contributor

Hi All,

In the existing MMIO implementation s390 relies on special syscalls to
access PCI memory. This was necessary as s390 originally only had
special privileged instructions for accessing PCI memory. With z15
however comes support for 4 new PCI memory I/O (MIO) instructions which
operate on virtually mapped PCI memory spaces.

While these are still special PCI access instructions instead of real
MMIO they behave much more like standard MMIO access. There is a load
like instruction pcilgi, a store like instruction pcistgi a block store
variant for efficient memcpy pcistbi and a write barrier instruction
pciwb. The load and store variants always operate on a 64 bit register
but only load/store the right most bytes of the register controlled by
a length value in a paired register (even numbered register rN + odd
numbered register r(N+1)).

As the previous PCI instructions did not operate on virtual memory
mappings at all a kernel using them does not setup virtual memory
mappings and thus can't support user-space using the new instructions.

Also as use of PCI MIO instructions can be disabled via the
pci=nomio kernel parameter we can't solely rely on hardware support and
kernel version. Instead Linux exposes whether PCI MIO instructions are
enabled via an ELF hardware capability. With this patch we check for
this capability and if enabled use the newly introduced PCI MIO
instructions for MMIO access and barriers.

Note to reviewers: One thing I'm not sure about is the handling of the
ELF hardware capability in udma_barrier.h. Because that is used without
mmio.h/mmio.c at least in the compile test for DMA coherency support.
So I didn't want to add a dependency to the extern s390_mio_supported
variable. Instead I added a separate getauxval() call but that doesn't
seem ideal either, I couldn't measure any performance impact even on
a 100 Gbit/s card but still it's kind of unnecessary overhead.

Thanks,
Niklas

@rleon rleon requested a review from jgunthorpe January 11, 2022 08:02
@rleon
Copy link
Member

rleon commented Jan 16, 2022

I honestly don't know what to do this PR.
It is all black magic for me :(.

@niklas88
Copy link
Contributor Author

I honestly don't know what to do this PR. It is all black magic for me :(.

It is quite special so that's understandable. Sadly I can't currently refer you to a specification but I'm here to answer
questions. Also, while I tried to give a good summary in the commit message I do realize I might
have missed giving some context being very deep in the topic myself.

First some general s390 PCI and I/O background:

  • s390 historically doesn't have MMIO but does have DMA. Instead of using MMIO, I/O is done with special I/O CPU
    instructions and so called Channel I/O devices of which some are backwards compatible to the original channel I/O devices which predate PCI by about 35 years. So a lot of history there.
  • Today's channel devices use PCIe on the physical layer but this is entirely hidden from the OS.
  • As there already was physical PCI, support for the OS talking to normal PCI devices was introduced. Because there
    was no concept of MMIO this was done with special PCI load/store instructions that refer to PCI BARs with a handle plus a
    BAR number plus an offset not via a virtual address.
  • In Linux the above old style PCI instructions are used with an indirect to adapt them for ioread()/iowrite()
    which of course refer to PCI BARs with an __iomem address. I.e. the PCI support fakes MMIO for the rest of the kernel.
  • Because the above limits performance and is kind of an ugly hack in Linux new instructions were added
    that refer to PCI memory with a virtual address. These new style instructions behaves more like MMIO except that one
    can only access the mapped I/O memory with these select instructions. Any other access like a normal load
    from this mapping will result in an addressing exception. This also added support for Write Combining and
    an explicit barrier. Also unlike the old instructions these may be executed in user-space.
  • The new PCI instructions are currently not available in KVM or z/VM guests so co-exist with the old style
    and we need to switch between them at boot time to support the same kernel image on both.

I can also give some more pointers to related code in the Linux Kernel:

  • arch/s390/pci/pci_insn.c: This is where the PCI instructions are defined. Both the legacy ones (_fh suffix) and the new ones (_mio suffix). As in this PR this uses the .insn ASM thing because the assembler doesn't know the mnemonic. The wrapping functions are then used to implement ioread()/iowrite() in the kernel via the following define/call chain:
iowrite64 → writeq → __raw_writeq (in include/asm-generic/io.h)
__raw_writeq → zpci_write_u64 (in arch/s390/include/io.h)
zpci_write_u64 → zpci_store  (in arch/s390/include/pci_io.h)
zpci_store → zpci_store_fh (old style) or __pcistg_mio (new style) (in arch/s390/pci/pci_insn.c)
  • arch/s390/pci/pci_mmio.c: This is the implementation for the syscall that the current rdma-core code calls for s390. Note that this already had to be updated to work at all when the kernel uses the newer instructions, this was commit f058599e22d ("s390/pci: Fix s390_mmio_read/write with MIO"). Sadly that one required even darker magic to basically do what this patch does but from kernel space and as if it was executed by user-space. You can see the same pcilgi, pcistgi and pcistbi again encoded via the raw .insn thing. They are however surrounded with extra incantations using an instruction called sacf (Set Address Control Fast) to execute the PCI loads/stores with the syscall invoking user-space address space. This of course adds overhead compared with the user-space implementation in this PR though our syscalls are quite fast so it's less than one might think.

util/mmio.c Outdated Show resolved Hide resolved
util/mmio.h Outdated Show resolved Hide resolved
util/mmio.h Outdated Show resolved Hide resolved
util/mmio.h Outdated Show resolved Hide resolved
util/udma_barrier.h Outdated Show resolved Hide resolved
@niklas88
Copy link
Contributor Author

niklas88 commented Jan 24, 2022

@jgunthorpe I have pushed an updated version with the following changes:

  • Moved PCI instruction wrapper functions to util/s390_pci_insn.h
  • Dropped likely() around flag check
  • Moved mmio_memcpy_x64() to util/mmio.c. Simplified get_s390_max_write_size(). The 8 byte alignment matches the requirements of pcistbi but we need to make sure not to cross a 4K boundary.
  • Added flag check for mmio_flush_writes() and include util/s390_pci_insn.h, this needed a s390 special check for the compile test
  • Added loop for retrying pcistgi in case hardware indicates being busy, this matches the kernel implementation but should be very rarely needed and isn't currently handled by the syscall implementation.

I did not use the IFUNC mechanism in this version because I think a flag is simpler and allows us to save the function call overhead for the simple load/store case where I'd also expect more gain from inlining especially since the length parameter
will always be known at compile time for these.

@niklas88 niklas88 force-pushed the s390_pci_mio branch 2 times, most recently from b2d7008 to 212b4f6 Compare January 24, 2022 09:57
util/CMakeLists.txt Show resolved Hide resolved
util/mmio.c Outdated Show resolved Hide resolved
@niklas88
Copy link
Contributor Author

niklas88 commented Feb 3, 2022

I just pushed a new version with the following changes:

  • Use __attribute__((ifunc)) for mmio_memcpy_x64(). Talking to our toolchain team it turns out on s390 we actually get the ELF hwcaps as argument to the resolver function so that turns out nicely. The toolchain folks mentioned that on s390 there shouldn't really be a window where the attribute doesn't work but the binutils support does but I still included the fallback to be consistent with the x86_64 ifunc use.
  • I signed the commit with GPG. Mostly as an exercise as I recently created this key. It does have signatures from the s390 kernel maintainers and some other IBM folks as well as my ancient University times keys though.
  • I did some basic testing of the mmio_memcpy_x64() function as it doesn't seem to be called by a simple qperf -cm1 rc_bw test. Also tested the ifunc mechanism

util/mmio.c Show resolved Hide resolved
@niklas88 niklas88 force-pushed the s390_pci_mio branch 2 times, most recently from 14096c9 to 298bf24 Compare February 11, 2022 13:48
@niklas88
Copy link
Contributor Author

I just pushed a v5. It is rebased on current master and removes again the busy wait for pcistgi. It's pretty embarrassing but while such a busy wait is needed for the old PCI store it isn't for the new one and that was correct in my initial version. When I thought I had missed it I was looking at the wrong specification...

Copy link
Member

@jgunthorpe jgunthorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine, just a few nitpicky things

util/mmio.c Outdated

static __attribute__((constructor)) void check_mio_supported(void)
{
#ifdef HWCAP_S390_PCI_MIO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather you use the usual shim mechanism for this constant to ensure it is always declared regardless of the libc version. ie have cmake test it and replace sys/auxv.h like we do for other things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks, will look into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v6 that I just pushed, I added a buildlib/fixup-include/sys-auxv.h with the HWCAP_S390_PCI_MIO. We'll likely have backports for glibc 2.34 for this flag in the major distributions with s390 support anyway but in the code the shim method is definitely nicer than the inline #ifdef.

#define mmio_flush_writes() asm volatile("" ::: "memory")
#elif defined(__loongarch__)
#define mmio_flush_writes() asm volatile("dbar 0" ::: "memory")
#elif defined(__riscv)
#define mmio_flush_writes() asm volatile("fence ow,ow" ::: "memory")
#elif defined(__s390x__)
#include "s390_mmio_insn.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is read and write being defined through udma_barrier.h ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I include the s390_mmio_insn.h for the s390_is_mio_supported flag. One way not to include the read/write would be to move the MAKE_READ/MAKE_WRITE to mmio.h and keep s390_mmio_insn.h just for the flag and the instruction wrappers. Would that be acceptable @jgunthorpe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just add an extern for the flag, it doesn't have be in exactly one header. Just be sure to include all the headers that define the extern in the C file that instantiates it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, sadly it causes a redundant redecleration warning in all files that include both udma_barrier.h and mmio.h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now clearly we could work around this with an extra define much like a header guard such that util/udma_barrier.h or util/s390_mmio_insn.h depending on which comes first. But that's kind of ugly. No?

Personally I think, moving the MAKE_READ/MAKE_WRITE as well as the 8 bit read/write to util/mmio.h while including util/s390_mmio_insn.h is a clean solution. It even improves the match with the file name, keeping s390_mmio_insn.h strictly about the instructions. I also noticed I had the flag check in both the s390_pciwb() itself and again in mmio_flush_writes() with that removed I find it pretty clean. Basically in util/udma_barrier.h we're just left with #define mmio_flush_writes() s390_pciwb()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a v6 with the above idea for you to take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh blah, yes we have to deal with that extra warning in these headers.

Copy link
Member

@jgunthorpe jgunthorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems OK overall, just do the cmake thing please

# glibc before 2.35 does not necesarily define the HWCAP_S390_PCI_MIO hardware
# capability bit constant. Check for it and if necessary shim it in such that
# kernel support for PCI MIO instructions can always be checked.
RDMA_Check_C_Compiles(HAVE_GLIBC_HWCAP_S390_PCI_MIO "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at the end of this file and print out a message when this has to be activated like all the other cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one at the bottom, wasn't sure before because there's already a Success/Failed message for the check itself.

In the existing MMIO implementation s390 relies on special syscalls to
access PCI memory. This was necessary as s390 originally only had
special privileged instructions for accessing PCI memory. With z15
however comes support for 4 new PCI memory I/O (MIO) instructions which
operate on virtually mapped PCI memory spaces.

While these are still special PCI access instructions instead of real
MMIO they behave much more like standard MMIO access. There is a load
like instruction pcilgi, a store like instruction pcistgi a block store
variant for efficient memcpy pcistbi and a write barrier instruction
pciwb. The load and store variants always operate on a 64 bit register
but only load/store the right most bytes of the register controlled by
a length value in a paired register (even numbered register rN + odd
numbered register r(N+1)).

As the previous PCI instructions did not operate on virtual memory
mappings at all a kernel using them does not setup virtual memory
mappings and thus can't support user-space using the new instructions.

Also as use of PCI MIO instructions can be disabled via the pci=nomio
kernel parameter we can't solely rely on hardware support and kernel
version. Instead Linux exposes whether PCI MIO instructions are enabled
via an ELF hardware capability. With this patch we check for this
capability and if enabled use the newly introduced PCI MIO instructions
for MMIO access and barriers.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
@jgunthorpe
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jgunthorpe jgunthorpe merged commit b23c91b into linux-rdma:master Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants