Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of late_initcall_sync broke build on RHEL7 kernels #37

Closed
solardiz opened this issue Jan 3, 2021 · 9 comments
Closed

Use of late_initcall_sync broke build on RHEL7 kernels #37

solardiz opened this issue Jan 3, 2021 · 9 comments
Labels
bug Something isn't working portability

Comments

@solardiz
Copy link
Contributor

solardiz commented Jan 3, 2021

Building an LKRG revision after ddc14c6 on a CentOS 7 system (deliberately not up to date, has an older RHEL7'ish kernel) results in these warnings:

/home/solar/lkrg-build/src/p_lkrg_main.c:681:1: warning: data definition has no type or storage class [enabled by default]
 late_initcall_sync(p_lkrg_register);
 ^
/home/solar/lkrg-build/src/p_lkrg_main.c:681:1: warning: type defaults to 'int' in declaration of 'late_initcall_sync' [-Wimplicit-int]
/home/solar/lkrg-build/src/p_lkrg_main.c:681:1: warning: parameter names (without types) in function declaration [enabled by default]
/home/solar/lkrg-build/src/p_lkrg_main.c:367:19: warning: 'p_lkrg_register' defined but not used [-Wunused-function]
 static int __init p_lkrg_register(void) {
                   ^

Despite of these, the build completes, but I wouldn't expect the module to work right (didn't try loading it).

We probably need to make our use of late_initcall_sync conditional upon kernel versions that support it.

@solardiz solardiz added the bug Something isn't working label Jan 3, 2021
@Adam-pi3
Copy link
Collaborator

Adam-pi3 commented Jan 4, 2021

@solardiz which kernel are you using? I've checked the following one and don't have any issue:

[root@localhost lkrg]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
[root@localhost lkrg]# uname -a
Linux localhost.localdomain 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost lkrg]# dmesg |grep p_lkrg
[  249.480278] p_lkrg: loading out-of-tree module taints kernel.
[  249.481277] p_lkrg: module verification failed: signature and/or required key missing - tainting kernel
[  249.483303] [p_lkrg] Loading LKRG...
[  249.600525] [p_lkrg] [kretprobe] register_kretprobe() for <ovl_create_or_link> failed! [err=-2]
[  249.600956] [p_lkrg] Can't hook 'ovl_create_or_link' function. This is expected if you are not using OverlayFS.
[  249.705233] [p_lkrg] LKRG initialized successfully!

@solardiz
Copy link
Contributor Author

which kernel are you using?

A much older RHEL7'ish kernel. Can you perhaps simply have LKRG build revert to the old behavior when building for RHEL7 kernels older than 3.10.0-1160? Perhaps if anyone ever builds LKRG into a RHEL7 kernel, they'll do so for a current RHEL7 kernel and thus have at least the version you tested above.

We could also revert to the old behavior on some upstream kernels (perhaps at least <= 3.10), but in practice for old kernels like this only RHEL7 matters.

@solardiz
Copy link
Contributor Author

Here's a weird workaround:

+++ b/src/p_lkrg_main.c
@@ -15,6 +15,10 @@
  *
  */
 
+#undef MODULE
+#include <linux/init.h>
+#define MODULE
+
 #include "p_lkrg_main.h"
 
 unsigned int log_level = 3;

Somehow that header only defines those things when non-MODULE. I might look into this some further.

@solardiz
Copy link
Contributor Author

Oh, that workaround makes the module build without warnings, but then loading it does nothing - so I guess late_initcall_sync didn't work from modules back then, and thus was properly excluded from the header file when building a module.

I suspect it's similar on some upstream kernels (e.g., various 3.x), where LKRG currently builds with the warnings, and loading it doesn't actually enable it. We probably do need to identify the kernel version where that changed.

@solardiz
Copy link
Contributor Author

Maybe this is the cleanest fix? -

+++ b/src/p_lkrg_main.c
@@ -678,7 +678,11 @@ static void __exit p_lkrg_deregister(void) {
 }
 
 
+#ifdef MODULE
+module_init(p_lkrg_register);
+#else
 late_initcall_sync(p_lkrg_register);
+#endif
 module_exit(p_lkrg_deregister);
 
 module_param(log_level, uint, 0000);

(Works on the old RHEL7'ish kernel here.)

Is there a good enough reason for us to prefer late_initcall_sync even when building as module? IIRC, a long time ago Adam had mentioned that (correspondingly older) LKRG worked fine even when loaded from initrd when it was still using module_init. So maybe we only need late_initcall_sync for linking into the kernel (not for other kinds of early load), which the above patch should still support?

@Adam-pi3
Copy link
Collaborator

@solardiz thanks for that research! late_initcall_sync gives us a guarantee that LKRG will be loaded very late which is somehow desired. It's better to use that. If you verified your patch and it works fine, can you push these changes?

@solardiz
Copy link
Contributor Author

@Adam-pi3 We already know late_initcall_sync is required (not just "somehow desired") when building LKRG into the kernel. The patch preserves that. However, it does not preserve the same "somehow desired" property when LKRG is a module. I think "somehow desired" isn't a strong enough reason to insist on using late_initcall_sync there just yet, given that we've been fine with 'module_init` until recently. I did verify the patch on that old CentOS 7. I think it should also work fine on newer kernels, and when building LKRG into the kernel, but we'd need to ask @sempervictus to retest the latter. I suppose I can push these changes first, and then we'll hopefully hear of any issues before the next release. OK to proceed?

@Adam-pi3
Copy link
Collaborator

I think we might have some misunderstanding what I mean by "somehow desired". LKRG should be loaded in the system after kernel is fully initialized, or at least after that phase of initialization when it is stable to have presumptions which LKRG have (e.g. .text section is not going to be modified without going to the official API like JUMP_LABEL and/or FTRACE). Some of the kernel modules might also install KPROBEs which is incompatible with LKRG (unless they are optimized to be FTRACE-based Probes). Taking this into account, it is "safer" to load LKRG in the last possible stage. Linux kernel defines the following priority of module loading (based on: https://elixir.bootlin.com/linux/latest/source/include/linux/module.h):

/*
 * In most cases loadable modules do not need custom
 * initcall levels. There are still some valid cases where
 * a driver may be needed early if built in, and does not
 * matter when built as a loadable module. Like bus
 * snooping debug drivers.
 */
#define early_initcall(fn)		module_init(fn)
#define core_initcall(fn)		module_init(fn)
#define core_initcall_sync(fn)		module_init(fn)
#define postcore_initcall(fn)		module_init(fn)
#define postcore_initcall_sync(fn)	module_init(fn)
#define arch_initcall(fn)		module_init(fn)
#define subsys_initcall(fn)		module_init(fn)
#define subsys_initcall_sync(fn)	module_init(fn)
#define fs_initcall(fn)			module_init(fn)
#define fs_initcall_sync(fn)		module_init(fn)
#define rootfs_initcall(fn)		module_init(fn)
#define device_initcall(fn)		module_init(fn)
#define device_initcall_sync(fn)	module_init(fn)
#define late_initcall(fn)		module_init(fn)
#define late_initcall_sync(fn)		module_init(fn)

#define console_initcall(fn)		module_init(fn)

Some kernel also defines:
#define security_initcall(fn) module_init(fn)

some of them don't define console_initcall. From what I saw, the most stable across all kernels which are guaranteed to be called very late is late_initcall_sync. It is actively used in even in kernel 3.11 (https://elixir.bootlin.com/linux/v3.11/C/ident/late_initcall_sync). I'm a bit surprised that RHEL has the problem which you mentioned. However, that's why I'm saying the it is 'somehow desired' to keep late_initcall_sync as a default behavior. However, if it produces some unexpected problems on a very old RHEL and you've find working solution by switching to default 'module_init' we should go to this path.

@solardiz
Copy link
Contributor Author

solardiz commented Jan 17, 2021

It is actively used in even in kernel 3.11

I guess those uses are in drivers compiled into the kernel, not loaded as modules. Notice how that would be consistent with the comment you quoted, and with module.h defining late_initcall_sync and all others merely as aliases to module_init. Also note how 3.11 (per that link you posted) doesn't have this in module.h yet, only having it in init.h - that's the issue we're having.

So mix fix is actually correct, resulting in the exact same behavior on older kernels that we have on newer ones, and leaving the latter unchanged.

solardiz added a commit that referenced this issue Jan 17, 2021
We switched to using late_initcall_sync() in order to have LKRG initialize
sufficiently late when it's linked into the kernel.  That change was a
no-op when building/loading LKRG as a module on recent kernels, because
their module.h defines late_initcall_sync() as an alias for module_init().
However, it broke LKRG on some older kernels, where late_initcall_sync()
wasn't defined for modules at all.

This commit fixes that by explicitly using module_init() when building LKRG
as a module.  This change is a no-op on recent kernels.

Fixes #37, updates ddc14c6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working portability
Projects
None yet
Development

No branches or pull requests

2 participants