New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get rid of mptcp hooking in tcp_check_req() #16
Comments
The above schema cant' work:
Overall we lack a way to dispose the req socket 'stolen' by the MP_JOIN without our custom chunk |
see upstream commit: commit 90bf451
I'm unable to do any better than that |
Unfortunately, most versions of clang that support BTI are capable of miscompiling the kernel when converting a switch statement into a jump table. As an example, attempting to spawn a KVM guest results in a panic: [ 56.253312] Kernel panic - not syncing: bad mode [ 56.253834] CPU: 0 PID: 279 Comm: lkvm Not tainted 5.8.0-rc1 #2 [ 56.254225] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 [ 56.254712] Call trace: [ 56.254952] dump_backtrace+0x0/0x1d4 [ 56.255305] show_stack+0x1c/0x28 [ 56.255647] dump_stack+0xc4/0x128 [ 56.255905] panic+0x16c/0x35c [ 56.256146] bad_el0_sync+0x0/0x58 [ 56.256403] el1_sync_handler+0xb4/0xe0 [ 56.256674] el1_sync+0x7c/0x100 [ 56.256928] kvm_vm_ioctl_check_extension_generic+0x74/0x98 [ 56.257286] __arm64_sys_ioctl+0x94/0xcc [ 56.257569] el0_svc_common+0x9c/0x150 [ 56.257836] do_el0_svc+0x84/0x90 [ 56.258083] el0_sync_handler+0xf8/0x298 [ 56.258361] el0_sync+0x158/0x180 This is because the switch in kvm_vm_ioctl_check_extension_generic() is executed as an indirect branch to tail-call through a jump table: ffff800010032dc8: 3869694c ldrb w12, [x10, x9] ffff800010032dcc: 8b0c096b add x11, x11, x12, lsl #2 ffff800010032dd0: d61f0160 br x11 However, where the target case uses the stack, the landing pad is elided due to the presence of a paciasp instruction: ffff800010032e14: d503233f paciasp ffff800010032e18: a9bf7bfd stp x29, x30, [sp, #-16]! ffff800010032e1c: 910003fd mov x29, sp ffff800010032e20: aa0803e0 mov x0, x8 ffff800010032e24: 940017c0 bl ffff800010038d24 <kvm_vm_ioctl_check_extension> ffff800010032e28: 93407c00 sxtw x0, w0 ffff800010032e2c: a8c17bfd ldp x29, x30, [sp], #16 ffff800010032e30: d50323bf autiasp ffff800010032e34: d65f03c0 ret Unfortunately, this results in a fatal exception because paciasp is compatible only with branch-and-link (call) instructions and not simple indirect branches. A fix is being merged into Clang 10.0.1 so that a 'bti j' instruction is emitted as an explicit landing pad in this situation. Make in-kernel BTI depend on that compiler version when building with clang. Cc: Tom Stellard <tstellar@redhat.com> Cc: Daniel Kiss <daniel.kiss@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Dave Martin <Dave.Martin@arm.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20200615105524.GA2694@willie-the-truck Link: https://lore.kernel.org/r/20200616183630.2445-1-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
I compiled with AddressSanitizer and I had these memory leaks while I was using the tep_parse_format function: Direct leak of 28 byte(s) in 4 object(s) allocated from: #0 0x7fb07db49ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe) #1 0x7fb07a724228 in extend_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:985 #2 0x7fb07a724c21 in __read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1140 #3 0x7fb07a724f78 in read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1206 #4 0x7fb07a725191 in __read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1291 #5 0x7fb07a7251df in read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1299 #6 0x7fb07a72e6c8 in process_dynamic_array_len /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:2849 #7 0x7fb07a7304b8 in process_function /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3161 #8 0x7fb07a730900 in process_arg_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3207 #9 0x7fb07a727c0b in process_arg /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1786 #10 0x7fb07a731080 in event_read_print_args /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3285 #11 0x7fb07a731722 in event_read_print /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3369 #12 0x7fb07a740054 in __tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6335 #13 0x7fb07a74047a in __parse_event /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6389 #14 0x7fb07a740536 in tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6431 #15 0x7fb07a785acf in parse_event ../../../src/fs-src/fs.c:251 #16 0x7fb07a785ccd in parse_systems ../../../src/fs-src/fs.c:284 #17 0x7fb07a786fb3 in read_metadata ../../../src/fs-src/fs.c:593 #18 0x7fb07a78760e in ftrace_fs_source_init ../../../src/fs-src/fs.c:727 #19 0x7fb07d90c19c in add_component_with_init_method_data ../../../../src/lib/graph/graph.c:1048 #20 0x7fb07d90c87b in add_source_component_with_initialize_method_data ../../../../src/lib/graph/graph.c:1127 #21 0x7fb07d90c92a in bt_graph_add_source_component ../../../../src/lib/graph/graph.c:1152 #22 0x55db11aa632e in cmd_run_ctx_create_components_from_config_components ../../../src/cli/babeltrace2.c:2252 #23 0x55db11aa6fda in cmd_run_ctx_create_components ../../../src/cli/babeltrace2.c:2347 #24 0x55db11aa780c in cmd_run ../../../src/cli/babeltrace2.c:2461 #25 0x55db11aa8a7d in main ../../../src/cli/babeltrace2.c:2673 #26 0x7fb07d5460b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2) The token variable in the process_dynamic_array_len function is allocated in the read_expect_type function, but is not freed before calling the read_token function. Free the token variable before calling read_token in order to plug the leak. Signed-off-by: Philippe Duplessis-Guindon <pduplessis@efficios.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lore.kernel.org/linux-trace-devel/20200730150236.5392-1-pduplessis@efficios.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
…s metrics" test Linux 5.9 introduced perf test case "Parse and process metrics" and on s390 this test case always dumps core: [root@t35lp67 perf]# ./perf test -vvvv -F 67 67: Parse and process metrics : --- start --- metric expr inst_retired.any / cpu_clk_unhalted.thread for IPC parsing metric: inst_retired.any / cpu_clk_unhalted.thread Segmentation fault (core dumped) [root@t35lp67 perf]# I debugged this core dump and gdb shows this call chain: (gdb) where #0 0x000003ffabc3192a in __strnlen_c_1 () from /lib64/libc.so.6 #1 0x000003ffabc293de in strcasestr () from /lib64/libc.so.6 #2 0x0000000001102ba2 in match_metric(list=0x1e6ea20 "inst_retired.any", n=<optimized out>) at util/metricgroup.c:368 #3 find_metric (map=<optimized out>, map=<optimized out>, metric=0x1e6ea20 "inst_retired.any") at util/metricgroup.c:765 #4 __resolve_metric (ids=0x0, map=<optimized out>, metric_list=0x0, metric_no_group=<optimized out>, m=<optimized out>) at util/metricgroup.c:844 #5 resolve_metric (ids=0x0, map=0x0, metric_list=0x0, metric_no_group=<optimized out>) at util/metricgroup.c:881 #6 metricgroup__add_metric (metric=<optimized out>, metric_no_group=metric_no_group@entry=false, events=<optimized out>, events@entry=0x3ffd84fb878, metric_list=0x0, metric_list@entry=0x3ffd84fb868, map=0x0) at util/metricgroup.c:943 #7 0x00000000011034ae in metricgroup__add_metric_list (map=0x13f9828 <map>, metric_list=0x3ffd84fb868, events=0x3ffd84fb878, metric_no_group=<optimized out>, list=<optimized out>) at util/metricgroup.c:988 #8 parse_groups (perf_evlist=perf_evlist@entry=0x1e70260, str=str@entry=0x12f34b2 "IPC", metric_no_group=<optimized out>, metric_no_merge=<optimized out>, fake_pmu=fake_pmu@entry=0x1462f18 <perf_pmu.fake>, metric_events=0x3ffd84fba58, map=0x1) at util/metricgroup.c:1040 #9 0x0000000001103eb2 in metricgroup__parse_groups_test( evlist=evlist@entry=0x1e70260, map=map@entry=0x13f9828 <map>, str=str@entry=0x12f34b2 "IPC", metric_no_group=metric_no_group@entry=false, metric_no_merge=metric_no_merge@entry=false, metric_events=0x3ffd84fba58) at util/metricgroup.c:1082 #10 0x00000000010c84d8 in __compute_metric (ratio2=0x0, name2=0x0, ratio1=<synthetic pointer>, name1=0x12f34b2 "IPC", vals=0x3ffd84fbad8, name=0x12f34b2 "IPC") at tests/parse-metric.c:159 #11 compute_metric (ratio=<synthetic pointer>, vals=0x3ffd84fbad8, name=0x12f34b2 "IPC") at tests/parse-metric.c:189 #12 test_ipc () at tests/parse-metric.c:208 ..... ..... omitted many more lines This test case was added with commit 218ca91 ("perf tests: Add parse metric test for frontend metric"). When I compile with make DEBUG=y it works fine and I do not get a core dump. It turned out that the above listed function call chain worked on a struct pmu_event array which requires a trailing element with zeroes which was missing. The marco map_for_each_event() loops over that array tests for members metric_expr/metric_name/metric_group being non-NULL. Adding this element fixes the issue. Output after: [root@t35lp46 perf]# ./perf test 67 67: Parse and process metrics : Ok [root@t35lp46 perf]# Committer notes: As Ian remarks, this is not s390 specific: <quote Ian> This also shows up with address sanitizer on all architectures (perhaps change the patch title) and perhaps add a "Fixes: <commit>" tag. ================================================================= ==4718==ERROR: AddressSanitizer: global-buffer-overflow on address 0x55c93b4d59e8 at pc 0x55c93a1541e2 bp 0x7ffd24327c60 sp 0x7ffd24327c58 READ of size 8 at 0x55c93b4d59e8 thread T0 #0 0x55c93a1541e1 in find_metric tools/perf/util/metricgroup.c:764:2 #1 0x55c93a153e6c in __resolve_metric tools/perf/util/metricgroup.c:844:9 #2 0x55c93a152f18 in resolve_metric tools/perf/util/metricgroup.c:881:9 #3 0x55c93a1528db in metricgroup__add_metric tools/perf/util/metricgroup.c:943:9 #4 0x55c93a151996 in metricgroup__add_metric_list tools/perf/util/metricgroup.c:988:9 #5 0x55c93a1511b9 in parse_groups tools/perf/util/metricgroup.c:1040:8 #6 0x55c93a1513e1 in metricgroup__parse_groups_test tools/perf/util/metricgroup.c:1082:9 #7 0x55c93a0108ae in __compute_metric tools/perf/tests/parse-metric.c:159:8 #8 0x55c93a010744 in compute_metric tools/perf/tests/parse-metric.c:189:9 #9 0x55c93a00f5ee in test_ipc tools/perf/tests/parse-metric.c:208:2 #10 0x55c93a00f1e8 in test__parse_metric tools/perf/tests/parse-metric.c:345:2 #11 0x55c939fd7202 in run_test tools/perf/tests/builtin-test.c:410:9 #12 0x55c939fd6736 in test_and_print tools/perf/tests/builtin-test.c:440:9 #13 0x55c939fd58c3 in __cmd_test tools/perf/tests/builtin-test.c:661:4 #14 0x55c939fd4e02 in cmd_test tools/perf/tests/builtin-test.c:807:9 #15 0x55c939e4763d in run_builtin tools/perf/perf.c:313:11 #16 0x55c939e46475 in handle_internal_command tools/perf/perf.c:365:8 #17 0x55c939e4737e in run_argv tools/perf/perf.c:409:2 #18 0x55c939e45f7e in main tools/perf/perf.c:539:3 0x55c93b4d59e8 is located 0 bytes to the right of global variable 'pme_test' defined in 'tools/perf/tests/parse-metric.c:17:25' (0x55c93b4d54a0) of size 1352 SUMMARY: AddressSanitizer: global-buffer-overflow tools/perf/util/metricgroup.c:764:2 in find_metric Shadow bytes around the buggy address: 0x0ab9a7692ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0ab9a7692b30: 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9 f9 0x0ab9a7692b40: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 0x0ab9a7692b50: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 0x0ab9a7692b60: f9 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00 0x0ab9a7692b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0ab9a7692b80: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc </quote> I'm also adding the missing "Fixes" tag and setting just .name to NULL, as doing it that way is more compact (the compiler will zero out everything else) and the table iterators look for .name being NULL as the sentinel marking the end of the table. Fixes: 0a507af ("perf tests: Add parse metric test for ipc metric") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Link: http://lore.kernel.org/lkml/20200825071211.16959-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
__apply_to_page_range() is also used to change and/or allocate page-table pages in the vmalloc area of the address space. Make sure these changes get synchronized to other page-tables in the system by calling arch_sync_kernel_mappings() when necessary. The impact appears limited to x86-32, where apply_to_page_range may miss updating the PMD. That leads to explosions in drivers like BUG: unable to handle page fault for address: fe036000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 00000000 Oops: 0002 [#1] SMP CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ #16 Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015 EIP: __execlists_context_alloc+0x132/0x2d0 [i915] Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 <c7> 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81 EAX: 5a5a5a5a EBX: f60ca000 ECX: fe036000 EDX: 00000001 ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286 CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: fffe0ff0 DR7: 00000400 Call Trace: execlists_context_alloc+0x10/0x20 [i915] intel_context_alloc_state+0x3f/0x70 [i915] __intel_context_do_pin+0x117/0x170 [i915] i915_gem_do_execbuffer+0xcc7/0x2500 [i915] i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915] drm_ioctl_kernel+0x8f/0xd0 drm_ioctl+0x223/0x3d0 __ia32_sys_ioctl+0x1ab/0x760 __do_fast_syscall_32+0x3f/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 EIP: 0xb7f28559 Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296 Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan CR2: 00000000fe036000 It looks like kasan, xen and i915 are vulnerable. Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking after 30-or-so minutes, and machine is unusable" [sfr@canb.auug.org.au: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h] Link: https://lkml.kernel.org/r/20200825172508.16800a4f@canb.auug.org.au [chris@chris-wilson.co.uk: changelog addition] [pavel@ucw.cz: changelog addition] Fixes: 2ba3e69 ("mm/vmalloc: track which page-table levels were modified") Fixes: 86cf69f ("x86/mm/32: implement arch_sync_kernel_mappings()") Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> [x86-32] Tested-by: Pavel Machek <pavel@ucw.cz> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> [5.8+] Link: https://lkml.kernel.org/r/20200821123746.16904-1-joro@8bytes.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The evsel->unit borrows a pointer of pmu event or alias instead of owns a string. But tool event (duration_time) passes a result of strdup() caused a leak. It was found by ASAN during metric test: Direct leak of 210 byte(s) in 70 object(s) allocated from: #0 0x7fe366fca0b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5) #1 0x559fbbcc6ea3 in add_event_tool util/parse-events.c:414 #2 0x559fbbcc6ea3 in parse_events_add_tool util/parse-events.c:1414 #3 0x559fbbd8474d in parse_events_parse util/parse-events.y:439 #4 0x559fbbcc95da in parse_events__scanner util/parse-events.c:2096 #5 0x559fbbcc95da in __parse_events util/parse-events.c:2141 #6 0x559fbbc28555 in check_parse_id tests/pmu-events.c:406 #7 0x559fbbc28555 in check_parse_id tests/pmu-events.c:393 #8 0x559fbbc28555 in check_parse_cpu tests/pmu-events.c:415 #9 0x559fbbc28555 in test_parsing tests/pmu-events.c:498 #10 0x559fbbc0109b in run_test tests/builtin-test.c:410 #11 0x559fbbc0109b in test_and_print tests/builtin-test.c:440 #12 0x559fbbc03e69 in __cmd_test tests/builtin-test.c:695 #13 0x559fbbc03e69 in cmd_test tests/builtin-test.c:807 #14 0x559fbbc691f4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312 #15 0x559fbbb071a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364 #16 0x559fbbb071a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408 #17 0x559fbbb071a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538 #18 0x7fe366b68cc9 in __libc_start_main ../csu/libc-start.c:308 Fixes: f0fbb11 ("perf stat: Implement duration_time as a proper event") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200915031819.386559-6-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The test_generic_metric() missed to release entries in the pctx. Asan reported following leak (and more): Direct leak of 128 byte(s) in 1 object(s) allocated from: #0 0x7f4c9396980e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e) #1 0x55f7e748cc14 in hashmap_grow (/home/namhyung/project/linux/tools/perf/perf+0x90cc14) #2 0x55f7e748d497 in hashmap__insert (/home/namhyung/project/linux/tools/perf/perf+0x90d497) #3 0x55f7e7341667 in hashmap__set /home/namhyung/project/linux/tools/perf/util/hashmap.h:111 #4 0x55f7e7341667 in expr__add_ref util/expr.c:120 #5 0x55f7e7292436 in prepare_metric util/stat-shadow.c:783 #6 0x55f7e729556d in test_generic_metric util/stat-shadow.c:858 #7 0x55f7e712390b in compute_single tests/parse-metric.c:128 #8 0x55f7e712390b in __compute_metric tests/parse-metric.c:180 #9 0x55f7e712446d in compute_metric tests/parse-metric.c:196 #10 0x55f7e712446d in test_dcache_l2 tests/parse-metric.c:295 #11 0x55f7e712446d in test__parse_metric tests/parse-metric.c:355 #12 0x55f7e70be09b in run_test tests/builtin-test.c:410 #13 0x55f7e70be09b in test_and_print tests/builtin-test.c:440 #14 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661 #15 0x55f7e70c101a in cmd_test tests/builtin-test.c:807 #16 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312 #17 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364 #18 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408 #19 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538 #20 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308 Fixes: 6d432c4 ("perf tools: Add test_generic_metric function") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200915031819.386559-8-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The metricgroup__add_metric() can find multiple match for a metric group and it's possible to fail. Also it can fail in the middle like in resolve_metric() even for single metric. In those cases, the intermediate list and ids will be leaked like: Direct leak of 3 byte(s) in 1 object(s) allocated from: #0 0x7f4c938f40b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5) #1 0x55f7e71c1bef in __add_metric util/metricgroup.c:683 #2 0x55f7e71c31d0 in add_metric util/metricgroup.c:906 #3 0x55f7e71c3844 in metricgroup__add_metric util/metricgroup.c:940 #4 0x55f7e71c488d in metricgroup__add_metric_list util/metricgroup.c:993 #5 0x55f7e71c488d in parse_groups util/metricgroup.c:1045 #6 0x55f7e71c60a4 in metricgroup__parse_groups_test util/metricgroup.c:1087 #7 0x55f7e71235ae in __compute_metric tests/parse-metric.c:164 #8 0x55f7e7124650 in compute_metric tests/parse-metric.c:196 #9 0x55f7e7124650 in test_recursion_fail tests/parse-metric.c:318 #10 0x55f7e7124650 in test__parse_metric tests/parse-metric.c:356 #11 0x55f7e70be09b in run_test tests/builtin-test.c:410 #12 0x55f7e70be09b in test_and_print tests/builtin-test.c:440 #13 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661 #14 0x55f7e70c101a in cmd_test tests/builtin-test.c:807 #15 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312 #16 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364 #17 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408 #18 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538 #19 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308 Fixes: 83de0b7 ("perf metric: Collect referenced metrics in struct metric_ref_node") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200915031819.386559-9-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This fix is for a failure that occurred in the DWARF unwind perf test. Stack unwinders may probe memory when looking for frames. Memory sanitizer will poison and track uninitialized memory on the stack, and on the heap if the value is copied to the heap. This can lead to false memory sanitizer failures for the use of an uninitialized value. Avoid this problem by removing the poison on the copied stack. The full msan failure with track origins looks like: ==2168==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x559ceb10755b in handle_cfi elfutils/libdwfl/frame_unwind.c:648:8 #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #23 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559ceb106acf in __libdwfl_frame_reg_set elfutils/libdwfl/frame_unwind.c:77:22 #1 0x559ceb106acf in handle_cfi elfutils/libdwfl/frame_unwind.c:627:13 #2 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #3 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #4 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #5 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #6 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #7 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #8 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #9 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #10 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #11 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #12 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #13 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #14 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #15 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #16 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #17 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #18 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #19 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #20 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #21 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #22 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #23 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #24 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559ceb106a54 in handle_cfi elfutils/libdwfl/frame_unwind.c:613:9 #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #23 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559ceaff8800 in memory_read tools/perf/util/unwind-libdw.c:156:10 #1 0x559ceb10f053 in expr_eval elfutils/libdwfl/frame_unwind.c:501:13 #2 0x559ceb1060cc in handle_cfi elfutils/libdwfl/frame_unwind.c:603:18 #3 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #4 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #5 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #6 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #7 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #8 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #9 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #10 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #11 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #12 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #13 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #14 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #15 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #16 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #17 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #18 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #19 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #20 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #21 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #22 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #23 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #24 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #25 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559cea9027d9 in __msan_memcpy llvm/llvm-project/compiler-rt/lib/msan/msan_interceptors.cpp:1558:3 #1 0x559cea9d2185 in sample_ustack tools/perf/arch/x86/tests/dwarf-unwind.c:41:2 #2 0x559cea9d202c in test__arch_unwind_sample tools/perf/arch/x86/tests/dwarf-unwind.c:72:9 #3 0x559ceabc9cbd in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:106:6 #4 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #5 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #6 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #7 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #8 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #9 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #10 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #11 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #12 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #13 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #14 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #15 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #16 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #17 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was created by an allocation of 'bf' in the stack frame of function 'perf_event__synthesize_mmap_events' #0 0x559ceafc5f60 in perf_event__synthesize_mmap_events tools/perf/util/synthetic-events.c:445 SUMMARY: MemorySanitizer: use-of-uninitialized-value elfutils/libdwfl/frame_unwind.c:648:8 in handle_cfi Signed-off-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: clang-built-linux@googlegroups.com Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sandeep Dasgupta <sdasgup@google.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20201113182053.754625-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Routes are currently processed from a workqueue whereas nexthop objects are processed in system call context. This can result in the driver not finding a suitable nexthop group for a route and issuing a warning [1]. Fix this by ignoring such routes earlier in the process. The subsequent deletion notification will be ignored as well. [1] WARNING: CPU: 2 PID: 7754 at drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:4853 mlxsw_sp_router_fib_event_work+0x1112/0x1e00 [mlxsw_spectrum] [...] CPU: 2 PID: 7754 Comm: kworker/u8:0 Not tainted 5.11.0-rc6-cq-20210207-1 #16 Hardware name: Mellanox Technologies Ltd. MSN2100/SA001390, BIOS 5.6.5 05/24/2018 Workqueue: mlxsw_core_ordered mlxsw_sp_router_fib_event_work [mlxsw_spectrum] RIP: 0010:mlxsw_sp_router_fib_event_work+0x1112/0x1e00 [mlxsw_spectrum] Fixes: cdd6cfc ("mlxsw: spectrum_router: Allow programming routes with nexthop objects") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reported-by: Alex Veber <alexve@nvidia.com> Tested-by: Alex Veber <alexve@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * ae5e070 ("btrfs: qgroup: don't try to wait flushing if we're already holding a transaction") * 6f23277 ("btrfs: qgroup: don't commit transaction when we already hold the handle") Tried to solve various aspects of this but this was always a whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock scenario involving btrfs_delayed_node::mutex. Namely, one thread can call btrfs_dirty_inode as a result of reading a file and modifying its atime: PID: 6963 TASK: ffff8c7f3f94c000 CPU: 2 COMMAND: "test" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_timeout at ffffffffa52a1bdd #3 wait_for_completion at ffffffffa529eeea <-- sleeps with delayed node mutex held #4 start_delalloc_inodes at ffffffffc0380db5 #5 btrfs_start_delalloc_snapshot at ffffffffc0393836 #6 try_flush_qgroup at ffffffffc03f04b2 #7 __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes. #8 btrfs_delayed_update_inode at ffffffffc03e31aa <-- acquires delayed node mutex #9 btrfs_update_inode at ffffffffc0385ba8 #10 btrfs_dirty_inode at ffffffffc038627b <-- TRANSACTIION OPENED #11 touch_atime at ffffffffa4cf0000 #12 generic_file_read_iter at ffffffffa4c1f123 #13 new_sync_read at ffffffffa4ccdc8a #14 vfs_read at ffffffffa4cd0849 #15 ksys_read at ffffffffa4cd0bd1 #16 do_syscall_64 at ffffffffa4a052eb #17 entry_SYSCALL_64_after_hwframe at ffffffffa540008c This will cause an asynchronous work to flush the delalloc inodes to happen which can try to acquire the same delayed_node mutex: PID: 455 TASK: ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_preempt_disabled at ffffffffa529e80a #3 __mutex_lock at ffffffffa529fdcb <-- goes to sleep, never wakes up. #4 btrfs_delayed_update_inode at ffffffffc03e3143 <-- tries to acquire the mutex #5 btrfs_update_inode at ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding #6 cow_file_range_inline.constprop.78 at ffffffffc0386be7 #7 cow_file_range at ffffffffc03879c1 #8 btrfs_run_delalloc_range at ffffffffc038894c #9 writepage_delalloc at ffffffffc03a3c8f #10 __extent_writepage at ffffffffc03a4c01 #11 extent_write_cache_pages at ffffffffc03a500b #12 extent_writepages at ffffffffc03a6de2 #13 do_writepages at ffffffffa4c277eb #14 __filemap_fdatawrite_range at ffffffffa4c1e5bb #15 btrfs_run_delalloc_work at ffffffffc0380987 <-- starts running delayed nodes #16 normal_work_helper at ffffffffc03b706c #17 process_one_work at ffffffffa4aba4e4 #18 worker_thread at ffffffffa4aba6fd #19 kthread at ffffffffa4ac0a3d #20 ret_from_fork at ffffffffa54001ff To fully address those cases the complete fix is to never issue any flushing while holding the transaction or the delayed node lock. This patch achieves it by calling qgroup_reserve_meta directly which will either succeed without flushing or will fail and return -EDQUOT. In the latter case that return value is going to be propagated to btrfs_dirty_inode which will fallback to start a new transaction. That's fine as the majority of time we expect the inode will have BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly copying the in-memory state. Fixes: c53e965 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The evlist and the cpu/thread maps should be released together. Otherwise following error was reported by Asan. Note that this test still has memory leaks in DSOs so it still fails even after this change. I'll take a look at that too. # perf test -v 26 26: Object code reading : --- start --- test child forked, pid 154184 Looking at the vmlinux_path (8 entries long) symsrc__init: build id mismatch for vmlinux. symsrc__init: cannot get elf header. Using /proc/kcore for kernel data Using /proc/kallsyms for symbols Parsing event 'cycles' mmap size 528384B ... ================================================================= ==154184==ERROR: LeakSanitizer: detected memory leaks Direct leak of 439 byte(s) in 1 object(s) allocated from: #0 0x7fcb66e77037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x55ad9b7e821e in dso__new_id util/dso.c:1256 #2 0x55ad9b8cfd4a in __machine__addnew_vdso util/vdso.c:132 #3 0x55ad9b8cfd4a in machine__findnew_vdso util/vdso.c:347 #4 0x55ad9b845b7e in map__new util/map.c:176 #5 0x55ad9b8415a2 in machine__process_mmap2_event util/machine.c:1787 #6 0x55ad9b8fab16 in perf_tool__process_synth_event util/synthetic-events.c:64 #7 0x55ad9b8fab16 in perf_event__synthesize_mmap_events util/synthetic-events.c:499 #8 0x55ad9b8fbfdf in __event__synthesize_thread util/synthetic-events.c:741 #9 0x55ad9b8ff3e3 in perf_event__synthesize_thread_map util/synthetic-events.c:833 #10 0x55ad9b738585 in do_test_code_reading tests/code-reading.c:608 #11 0x55ad9b73b25d in test__code_reading tests/code-reading.c:722 #12 0x55ad9b6f28fb in run_test tests/builtin-test.c:428 #13 0x55ad9b6f28fb in test_and_print tests/builtin-test.c:458 #14 0x55ad9b6f4a53 in __cmd_test tests/builtin-test.c:679 #15 0x55ad9b6f4a53 in cmd_test tests/builtin-test.c:825 #16 0x55ad9b760cc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #17 0x55ad9b5eaa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #18 0x55ad9b5eaa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #19 0x55ad9b5eaa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #20 0x7fcb669acd09 in __libc_start_main ../csu/libc-start.c:308 ... SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s). test child finished with 1 ---- end ---- Object code reading: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-6-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pablo Neira Ayuso says: ==================== netfilter: flowtable enhancements [ This is v2 that includes documentation enhancements, including existing limitations. This is a rebase on top on net-next. ] The following patchset augments the Netfilter flowtable fastpath to support for network topologies that combine IP forwarding, bridge, classic VLAN devices, bridge VLAN filtering, DSA and PPPoE. This includes support for the flowtable software and hardware datapaths. The following pictures provides an example scenario: fast path! .------------------------. / \ | IP forwarding | | / \ \/ | br0 wan ..... eth0 . / \ host C -> veth1 veth2 . switch/router . . eth0 host A The bridge master device 'br0' has an IP address and a DHCP server is also assumed to be running to provide connectivity to host A which reaches the Internet through 'br0' as default gateway. Then, packet enters the IP forwarding path and Netfilter is used to NAT the packets before they leave through the wan device. The general idea is to accelerate forwarding by building a fast path that takes packets from the ingress path of the bridge port and place them in the egress path of the wan device (and vice versa). Hence, skipping the classic bridge and IP stack paths. ** Patch from #1 to #6 add the infrastructure which describes the list of netdevice hops to reach a given destination MAC address in the local network topology. Patch #1 adds dev_fill_forward_path() and .ndo_fill_forward_path() to netdev_ops. Patch #2 adds .ndo_fill_forward_path for vlan devices, which provides the next device hop via vlan->real_dev, the vlan ID and the protocol. Patch #3 adds .ndo_fill_forward_path for bridge devices, which allows to make lookups to the FDB to locate the next device hop (bridge port) in the forwarding path. Patch #4 extends bridge .ndo_fill_forward_path to support for bridge VLAN filtering. Patch #5 adds .ndo_fill_forward_path for PPPoE devices. Patch #6 adds .ndo_fill_forward_path for DSA. Patches from #7 to #14 update the flowtable software datapath: Patch #7 adds the transmit path type field to the flow tuple. Two transmit paths are supported so far: the neighbour and the xfrm transmit paths. Patch #8 and #9 update the flowtable datapath to use dev_fill_forward_path() to obtain the real ingress/egress device for the flowtable datapath. This adds the new ethernet xmit direct path to the flowtable. Patch #10 adds native flowtable VLAN support (up to 2 VLAN tags) through dev_fill_forward_path(). The flowtable stores the VLAN id and protocol in the flow tuple. Patch #11 adds native flowtable bridge VLAN filter support through dev_fill_forward_path(). Patch #12 adds native flowtable bridge PPPoE through dev_fill_forward_path(). Patch #13 adds DSA support through dev_fill_forward_path(). Patch #14 extends flowtable selftests to cover for flowtable software datapath enhancements. ** Patches from #15 to #20 update the flowtable hardware offload datapath: Patch #15 extends the flowtable hardware offload to support for the direct ethernet xmit path. This also includes VLAN support. Patch #16 stores the egress real device in the flow tuple. The software flowtable datapath uses dev_hard_header() to transmit packets, hence it might refer to VLAN/DSA/PPPoE software device, not the real ethernet device. Patch #17 deals with switchdev PVID hardware offload to skip it on egress. Patch #18 adds FLOW_ACTION_PPPOE_PUSH to the flow_offload action API. Patch #19 extends the flowtable hardware offload to support for PPPoE Patch #20 adds TC_SETUP_FT support for DSA. ** Patches from #20 to #23: Felix Fietkau adds a new driver which support hardware offload for the mtk PPE engine through the existing flow offload API which supports for the flowtable enhancements coming in this batch. Patch #24 extends the documentation and describe existing limitations. Please, apply, thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
I got several memory leak reports from Asan with a simple command. It was because VDSO is not released due to the refcount. Like in __dsos_addnew_id(), it should put the refcount after adding to the list. $ perf record true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ] ================================================================= ==692599==ERROR: LeakSanitizer: detected memory leaks Direct leak of 439 byte(s) in 1 object(s) allocated from: #0 0x7fea52341037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x559bce4aa8ee in dso__new_id util/dso.c:1256 #2 0x559bce59245a in __machine__addnew_vdso util/vdso.c:132 #3 0x559bce59245a in machine__findnew_vdso util/vdso.c:347 #4 0x559bce50826c in map__new util/map.c:175 #5 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787 #6 0x559bce512f6b in machines__deliver_event util/session.c:1481 #7 0x559bce515107 in perf_session__deliver_event util/session.c:1551 #8 0x559bce51d4d2 in do_flush util/ordered-events.c:244 #9 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323 #10 0x559bce519bea in __perf_session__process_events util/session.c:2268 #11 0x559bce519bea in perf_session__process_events util/session.c:2297 #12 0x559bce2e7a52 in process_buildids /home/namhyung/project/linux/tools/perf/builtin-record.c:1017 #13 0x559bce2e7a52 in record__finish_output /home/namhyung/project/linux/tools/perf/builtin-record.c:1234 #14 0x559bce2ed4f6 in __cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2026 #15 0x559bce2ed4f6 in cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2858 #16 0x559bce422db4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #17 0x559bce2acac8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #18 0x559bce2acac8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #19 0x559bce2acac8 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #20 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308 Indirect leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fea52341037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x559bce520907 in nsinfo__copy util/namespaces.c:169 #2 0x559bce50821b in map__new util/map.c:168 #3 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787 #4 0x559bce512f6b in machines__deliver_event util/session.c:1481 #5 0x559bce515107 in perf_session__deliver_event util/session.c:1551 #6 0x559bce51d4d2 in do_flush util/ordered-events.c:244 #7 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323 #8 0x559bce519bea in __perf_session__process_events util/session.c:2268 #9 0x559bce519bea in perf_session__process_events util/session.c:2297 #10 0x559bce2e7a52 in process_buildids /home/namhyung/project/linux/tools/perf/builtin-record.c:1017 #11 0x559bce2e7a52 in record__finish_output /home/namhyung/project/linux/tools/perf/builtin-record.c:1234 #12 0x559bce2ed4f6 in __cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2026 #13 0x559bce2ed4f6 in cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2858 #14 0x559bce422db4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #15 0x559bce2acac8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #16 0x559bce2acac8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #17 0x559bce2acac8 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #18 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s). Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210315045641.700430-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Similarly to bitmap functions, find_next_*_bit() users will benefit if we'll handle a case of bitmaps that fit into a single word inline. In the very best case, the compiler may replace a function call with a few instructions. This is the quite typical find_next_bit() user: unsigned int cpumask_next(int n, const struct cpumask *srcp) { /* -1 is a legal arg here. */ if (n != -1) cpumask_check(n); return find_next_bit(cpumask_bits(srcp), nr_cpumask_bits, n + 1); } EXPORT_SYMBOL(cpumask_next); Currently, on ARM64 the generated code looks like this: 0000000000000000 <cpumask_next>: 0: a9bf7bfd stp x29, x30, [sp, #-16]! 4: 11000402 add w2, w0, #0x1 8: aa0103e0 mov x0, x1 c: d2800401 mov x1, #0x40 // #64 10: 910003fd mov x29, sp 14: 93407c42 sxtw x2, w2 18: 94000000 bl 0 <find_next_bit> 1c: a8c17bfd ldp x29, x30, [sp], #16 20: d65f03c0 ret 24: d503201f nop After applying this patch: 0000000000000140 <cpumask_next>: 140: 11000400 add w0, w0, #0x1 144: 93407c00 sxtw x0, w0 148: f100fc1f cmp x0, #0x3f 14c: 54000168 b.hi 178 <cpumask_next+0x38> // b.pmore 150: f9400023 ldr x3, [x1] 154: 92800001 mov x1, #0xffffffffffffffff // #-1 158: 9ac02020 lsl x0, x1, x0 15c: 52800802 mov w2, #0x40 // #64 160: 8a030001 and x1, x0, x3 164: dac00020 rbit x0, x1 168: f100003f cmp x1, #0x0 16c: dac01000 clz x0, x0 170: 1a800040 csel w0, w2, w0, eq // eq = none 174: d65f03c0 ret 178: 52800800 mov w0, #0x40 // #64 17c: d65f03c0 ret find_next_bit() call is replaced with 6 instructions. find_next_bit() itself is 41 instructions plus function call overhead. Despite inlining, the scripts/bloat-o-meter report smaller .text size after applying the series: add/remove: 11/9 grow/shrink: 233/176 up/down: 5780/-6768 (-988) Link: https://lkml.kernel.org/r/20210401003153.97325-10-yury.norov@gmail.com Signed-off-by: Yury Norov <yury.norov@gmail.com> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Alexey Klimov <aklimov@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Sterba <dsterba@suse.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Jianpeng Ma <jianpeng.ma@intel.com> Cc: Joe Perches <joe@perches.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Rich Felker <dalias@libc.org> Cc: Stefano Brivio <sbrivio@redhat.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Wolfram Sang <wsa+renesas@sang-engineering.com> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calling the 'ql_sem_spinlock', the driver has already acquired the spin lock, so the driver should not call 'ssleep' in atomic context. This bug can be fixed by using 'mdelay' instead of 'ssleep'. The KASAN's log reveals it: [ 3.238124 ] BUG: scheduling while atomic: swapper/0/1/0x00000002 [ 3.238748 ] 2 locks held by swapper/0/1: [ 3.239151 ] #0: ffff88810177b240 (&dev->mutex){....}-{3:3}, at: __device_driver_lock+0x41/0x60 [ 3.240026 ] #1: ffff888107c60e28 (&qdev->hw_lock){....}-{2:2}, at: ql3xxx_probe+0x2aa/0xea0 [ 3.240873 ] Modules linked in: [ 3.241187 ] irq event stamp: 460854 [ 3.241541 ] hardirqs last enabled at (460853): [<ffffffff843051bf>] _raw_spin_unlock_irqrestore+0x4f/0x70 [ 3.242245 ] hardirqs last disabled at (460854): [<ffffffff843058ca>] _raw_spin_lock_irqsave+0x2a/0x70 [ 3.242245 ] softirqs last enabled at (446076): [<ffffffff846002e4>] __do_softirq+0x2e4/0x4b1 [ 3.242245 ] softirqs last disabled at (446069): [<ffffffff811ba5e0>] irq_exit_rcu+0x100/0x110 [ 3.242245 ] Preemption disabled at: [ 3.242245 ] [<ffffffff828ca5ba>] ql3xxx_probe+0x2aa/0xea0 [ 3.242245 ] Kernel panic - not syncing: scheduling while atomic [ 3.242245 ] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc1-00145 -gee7dc339169-dirty #16 [ 3.242245 ] Call Trace: [ 3.242245 ] dump_stack+0xba/0xf5 [ 3.242245 ] ? ql3xxx_probe+0x1f0/0xea0 [ 3.242245 ] panic+0x15a/0x3f2 [ 3.242245 ] ? vprintk+0x76/0x150 [ 3.242245 ] ? ql3xxx_probe+0x2aa/0xea0 [ 3.242245 ] __schedule_bug+0xae/0xe0 [ 3.242245 ] __schedule+0x72e/0xa00 [ 3.242245 ] schedule+0x43/0xf0 [ 3.242245 ] schedule_timeout+0x28b/0x500 [ 3.242245 ] ? del_timer_sync+0xf0/0xf0 [ 3.242245 ] ? msleep+0x2f/0x70 [ 3.242245 ] msleep+0x59/0x70 [ 3.242245 ] ql3xxx_probe+0x307/0xea0 [ 3.242245 ] ? _raw_spin_unlock_irqrestore+0x3a/0x70 [ 3.242245 ] ? pci_device_remove+0x110/0x110 [ 3.242245 ] local_pci_probe+0x45/0xa0 [ 3.242245 ] pci_device_probe+0x12b/0x1d0 [ 3.242245 ] really_probe+0x2a9/0x610 [ 3.242245 ] driver_probe_device+0x90/0x1d0 [ 3.242245 ] ? mutex_lock_nested+0x1b/0x20 [ 3.242245 ] device_driver_attach+0x68/0x70 [ 3.242245 ] __driver_attach+0x124/0x1b0 [ 3.242245 ] ? device_driver_attach+0x70/0x70 [ 3.242245 ] bus_for_each_dev+0xbb/0x110 [ 3.242245 ] ? rdinit_setup+0x45/0x45 [ 3.242245 ] driver_attach+0x27/0x30 [ 3.242245 ] bus_add_driver+0x1eb/0x2a0 [ 3.242245 ] driver_register+0xa9/0x180 [ 3.242245 ] __pci_register_driver+0x82/0x90 [ 3.242245 ] ? yellowfin_init+0x25/0x25 [ 3.242245 ] ql3xxx_driver_init+0x23/0x25 [ 3.242245 ] do_one_initcall+0x7f/0x3d0 [ 3.242245 ] ? rdinit_setup+0x45/0x45 [ 3.242245 ] ? rcu_read_lock_sched_held+0x4f/0x80 [ 3.242245 ] kernel_init_freeable+0x2aa/0x301 [ 3.242245 ] ? rest_init+0x2c0/0x2c0 [ 3.242245 ] kernel_init+0x18/0x190 [ 3.242245 ] ? rest_init+0x2c0/0x2c0 [ 3.242245 ] ? rest_init+0x2c0/0x2c0 [ 3.242245 ] ret_from_fork+0x1f/0x30 [ 3.242245 ] Dumping ftrace buffer: [ 3.242245 ] (ftrace buffer empty) [ 3.242245 ] Kernel Offset: disabled [ 3.242245 ] Rebooting in 1 seconds. Reported-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled. The reason is that nsfs tries to access ns->ops but the proc_ns_operations is not implemented in this case. [7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [7.670268] pgd = 32b54000 [7.670544] [00000010] *pgd=00000000 [7.671861] Internal error: Oops: 5 [#1] SMP ARM [7.672315] Modules linked in: [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16 [7.673309] Hardware name: Generic DT based system [7.673642] PC is at nsfs_evict+0x24/0x30 [7.674486] LR is at clear_inode+0x20/0x9c The same to tun SIOCGSKNS command. To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is disabled. Meanwhile move it to right place net/core/net_namespace.c. Signed-off-by: Changbin Du <changbin.du@gmail.com> Fixes: c62cce2 ("net: add an ioctl to get a socket network namespace") Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ASan reports a heap-buffer-overflow in elf_sec__is_text when using perf-top. The bug is caused by the fact that secstrs is built from runtime_ss, while shdr is built from syms_ss if shdr.sh_type != SHT_NOBITS. Therefore, they point to two different ELF files. This patch renames secstrs to secstrs_run and adds secstrs_sym, so that the correct secstrs is chosen depending on shdr.sh_type. $ ASAN_OPTIONS=abort_on_error=1:disable_coredump=0:unmap_shadow_on_exit=1 ./perf top ================================================================= ==363148==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61300009add6 at pc 0x00000049875c bp 0x7f4f56446440 sp 0x7f4f56445bf0 READ of size 1 at 0x61300009add6 thread T6 #0 0x49875b in StrstrCheck(void*, char*, char const*, char const*) (/home/user/linux/tools/perf/perf+0x49875b) #1 0x4d13a2 in strstr (/home/user/linux/tools/perf/perf+0x4d13a2) #2 0xacae36 in elf_sec__is_text /home/user/linux/tools/perf/util/symbol-elf.c:176:9 #3 0xac3ec9 in elf_sec__filter /home/user/linux/tools/perf/util/symbol-elf.c:187:9 #4 0xac2c3d in dso__load_sym /home/user/linux/tools/perf/util/symbol-elf.c:1254:20 #5 0x883981 in dso__load /home/user/linux/tools/perf/util/symbol.c:1897:9 #6 0x8e6248 in map__load /home/user/linux/tools/perf/util/map.c:332:7 #7 0x8e66e5 in map__find_symbol /home/user/linux/tools/perf/util/map.c:366:6 #8 0x7f8278 in machine__resolve /home/user/linux/tools/perf/util/event.c:707:13 #9 0x5f3d1a in perf_event__process_sample /home/user/linux/tools/perf/builtin-top.c:773:6 #10 0x5f30e4 in deliver_event /home/user/linux/tools/perf/builtin-top.c:1197:3 #11 0x908a72 in do_flush /home/user/linux/tools/perf/util/ordered-events.c:244:9 #12 0x905fae in __ordered_events__flush /home/user/linux/tools/perf/util/ordered-events.c:323:8 #13 0x9058db in ordered_events__flush /home/user/linux/tools/perf/util/ordered-events.c:341:9 #14 0x5f19b1 in process_thread /home/user/linux/tools/perf/builtin-top.c:1109:7 #15 0x7f4f6a21a298 in start_thread /usr/src/debug/glibc-2.33-16.fc34.x86_64/nptl/pthread_create.c:481:8 #16 0x7f4f697d0352 in clone ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 0x61300009add6 is located 10 bytes to the right of 332-byte region [0x61300009ac80,0x61300009adcc) allocated by thread T6 here: #0 0x4f3f7f in malloc (/home/user/linux/tools/perf/perf+0x4f3f7f) #1 0x7f4f6a0a88d9 (/lib64/libelf.so.1+0xa8d9) Thread T6 created by T0 here: #0 0x464856 in pthread_create (/home/user/linux/tools/perf/perf+0x464856) #1 0x5f06e0 in __cmd_top /home/user/linux/tools/perf/builtin-top.c:1309:6 #2 0x5ef19f in cmd_top /home/user/linux/tools/perf/builtin-top.c:1762:11 #3 0x7b28c0 in run_builtin /home/user/linux/tools/perf/perf.c:313:11 #4 0x7b119f in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8 #5 0x7b2423 in run_argv /home/user/linux/tools/perf/perf.c:409:2 #6 0x7b0c19 in main /home/user/linux/tools/perf/perf.c:539:3 #7 0x7f4f696f7b74 in __libc_start_main /usr/src/debug/glibc-2.33-16.fc34.x86_64/csu/../csu/libc-start.c:332:16 SUMMARY: AddressSanitizer: heap-buffer-overflow (/home/user/linux/tools/perf/perf+0x49875b) in StrstrCheck(void*, char*, char const*, char const*) Shadow bytes around the buggy address: 0x0c268000b560: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b570: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b590: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c268000b5a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0c268000b5b0: 00 00 00 00 00 00 00 00 00 04[fa]fa fa fa fa fa 0x0c268000b5c0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 0x0c268000b5d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c268000b5e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c268000b5f0: 07 fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==363148==ABORTING Suggested-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Riccardo Mancini <rickyman7@gmail.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Fabian Hemmer <copy@copy.sh> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Remi Bernon <rbernon@codeweavers.com> Link: http://lore.kernel.org/lkml/20210621222108.196219-1-rickyman7@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It's later supposed to be either a correct address or NULL. Without the initialization, it may contain an undefined value which results in the following segmentation fault: # perf top --sort comm -g --ignore-callees=do_idle terminates with: #0 0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6 #1 0x00007ffff55e3802 in strdup () from /lib64/libc.so.6 #2 0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489 #3 hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564 #4 0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420, sample_self=sample_self@entry=true) at util/hist.c:657 #5 0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0, sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288 #6 0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38) at util/hist.c:1056 #7 iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056 #8 0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0) at util/hist.c:1231 #9 0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0) at builtin-top.c:842 #10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202 #11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244 #12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323 #13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339 #14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341 #15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339 #16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114 #17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0 #18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6 If you look at the frame #2, the code is: 488 if (he->srcline) { 489 he->srcline = strdup(he->srcline); 490 if (he->srcline == NULL) 491 goto err_rawdata; 492 } If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish), it gets strdupped and strdupping a rubbish random string causes the problem. Also, if you look at the commit 1fb7d06, it adds the srcline property into the struct, but not initializing it everywhere needed. Committer notes: Now I see, when using --ignore-callees=do_idle we end up here at line 2189 in add_callchain_ip(): 2181 if (al.sym != NULL) { 2182 if (perf_hpp_list.parent && !*parent && 2183 symbol__match_regex(al.sym, &parent_regex)) 2184 *parent = al.sym; 2185 else if (have_ignore_callees && root_al && 2186 symbol__match_regex(al.sym, &ignore_callees_regex)) { 2187 /* Treat this symbol as the root, 2188 forgetting its callees. */ 2189 *root_al = al; 2190 callchain_cursor_reset(cursor); 2191 } 2192 } And the al that doesn't have the ->srcline field initialized will be copied to the root_al, so then, back to: 1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al, 1212 int max_stack_depth, void *arg) 1213 { 1214 int err, err2; 1215 struct map *alm = NULL; 1216 1217 if (al) 1218 alm = map__get(al->map); 1219 1220 err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent, 1221 iter->evsel, al, max_stack_depth); 1222 if (err) { 1223 map__put(alm); 1224 return err; 1225 } 1226 1227 err = iter->ops->prepare_entry(iter, al); 1228 if (err) 1229 goto out; 1230 1231 err = iter->ops->add_single_entry(iter, al); 1232 if (err) 1233 goto out; 1234 That al at line 1221 is what hist_entry_iter__add() (called from sample__resolve_callchain()) saw as 'root_al', and then: iter->ops->add_single_entry(iter, al); will go on with al->srcline with a bogus value, I'll add the above sequence to the cset and apply, thanks! Signed-off-by: Michael Petlan <mpetlan@redhat.com> CC: Milian Wolff <milian.wolff@kdab.com> Cc: Jiri Olsa <jolsa@redhat.com> Fixes: 1fb7d06 ("perf report Use srcline from callchain for hist entries") Link: https //lore.kernel.org/r/20210719145332.29747-1-mpetlan@redhat.com Reported-by: Juri Lelli <jlelli@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
[BUG] There is a bug report that injected ENOMEM error could leave a tree block locked while we return to user-space: BTRFS info (device loop0): enabling ssd optimizations FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 0 CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106 fail_dump lib/fault-inject.c:52 [inline] should_fail+0x13c/0x160 lib/fault-inject.c:146 should_failslab+0x5/0x10 mm/slab_common.c:1328 slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494 slab_alloc_node mm/slub.c:3120 [inline] slab_alloc mm/slub.c:3214 [inline] kmem_cache_alloc+0x44/0x280 mm/slub.c:3219 btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline] btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833 __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415 btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570 btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768 btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905 btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530 btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783 lookup_open+0x660/0x780 fs/namei.c:3282 open_last_lookups fs/namei.c:3352 [inline] path_openat+0x465/0xe20 fs/namei.c:3557 do_filp_open+0xe3/0x170 fs/namei.c:3588 do_sys_openat2+0x357/0x4a0 fs/open.c:1200 do_sys_open+0x87/0xd0 fs/open.c:1216 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x46ae99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99 RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800 RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017 R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0 ================================================ WARNING: lock held when returning to user space! 5.15.0-rc1 #16 Not tainted ------------------------------------------------ syz-executor/7579 is leaving the kernel with locks still held! 1 lock held by syz-executor/7579: #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112 [CAUSE] In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new extent buffer @buf is locked, but if later operations like adding delayed tree ref fail, we just free @buf without unlocking it, resulting above warning. [FIX] Unlock @buf in out_free_buf: label. Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/ CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/SEQPACKET to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. The test status will be printed as soon as it is finished, if there are error logs, it will be printed after the final summary line. By specifying "--debug", additional debug information on server/worker communication will be printed. Example output: > ./test_progs -n 15-20 -j [ 12.801730] bpf_testmod: loading out-of-tree module taints kernel. Launching 8 workers. #20 btf_split:OK #16 btf_endian:OK #18 btf_module:OK #17 btf_map_in_map:OK #19 btf_skc_cls_ingress:OK #15 btf_dump:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Yucong Sun <sunyucong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211006185619.364369-2-fallentree@fb.com
This reverts commit f1a456f. WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0 CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S 5.15.0-04194-gd852503f7711 #16 RIP: 0010:skb_try_coalesce+0x78b/0x7e0 Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61 RSP: 0018:ffff88881f449688 EFLAGS: 00010282 RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0 RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0 R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0 FS: 00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> tcp_try_coalesce+0xeb/0x290 ? tcp_parse_options+0x610/0x610 ? mark_held_locks+0x79/0xa0 tcp_queue_rcv+0x69/0x2f0 tcp_rcv_established+0xa49/0xd40 ? tcp_data_queue+0x18a0/0x18a0 tcp_v6_do_rcv+0x1c9/0x880 ? rt6_mtu_change_route+0x100/0x100 tcp_v6_rcv+0x1624/0x1830 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Attempting to defragment a Btrfs file containing a transparent huge page immediately deadlocks with the following stack trace: #0 context_switch (kernel/sched/core.c:4940:2) #1 __schedule (kernel/sched/core.c:6287:8) #2 schedule (kernel/sched/core.c:6366:3) #3 io_schedule (kernel/sched/core.c:8389:2) #4 wait_on_page_bit_common (mm/filemap.c:1356:4) #5 __lock_page (mm/filemap.c:1648:2) #6 lock_page (./include/linux/pagemap.h:625:3) #7 pagecache_get_page (mm/filemap.c:1910:4) #8 find_or_create_page (./include/linux/pagemap.h:420:9) #9 defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9) #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14) #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9) #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9) #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9) #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10) #15 vfs_ioctl (fs/ioctl.c:51:10) #16 __do_sys_ioctl (fs/ioctl.c:874:11) #17 __se_sys_ioctl (fs/ioctl.c:860:1) #18 __x64_sys_ioctl (fs/ioctl.c:860:1) #19 do_syscall_x64 (arch/x86/entry/common.c:50:14) #20 do_syscall_64 (arch/x86/entry/common.c:80:7) #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113) A huge page is represented by a compound page, which consists of a struct page for each PAGE_SIZE page within the huge page. The first struct page is the "head page", and the remaining are "tail pages". Defragmentation attempts to lock each page in the range. However, lock_page() on a tail page actually locks the corresponding head page. So, if defragmentation tries to lock more than one struct page in a compound page, it tries to lock the same head page twice and deadlocks with itself. Ideally, we should be able to defragment transparent huge pages. However, THP for filesystems is currently read-only, so a lot of code is not ready to use huge pages for I/O. For now, let's just return ETXTBUSY. This can be reproduced with the following on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y: $ cat create_thp_file.c #include <fcntl.h> #include <stdbool.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> static const char zeroes[1024 * 1024]; static const size_t FILE_SIZE = 2 * 1024 * 1024; int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } int fd = creat(argv[1], 0777); if (fd == -1) { perror("creat"); return EXIT_FAILURE; } size_t written = 0; while (written < FILE_SIZE) { ssize_t ret = write(fd, zeroes, sizeof(zeroes) < FILE_SIZE - written ? sizeof(zeroes) : FILE_SIZE - written); if (ret < 0) { perror("write"); return EXIT_FAILURE; } written += ret; } close(fd); fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } /* * Reserve some address space so that we can align the file mapping to * the huge page size. */ void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (placeholder_map == MAP_FAILED) { perror("mmap (placeholder)"); return EXIT_FAILURE; } void *aligned_address = (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1)); void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC, MAP_SHARED | MAP_FIXED, fd, 0); if (map == MAP_FAILED) { perror("mmap"); return EXIT_FAILURE; } if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) { perror("madvise"); return EXIT_FAILURE; } char *line = NULL; size_t line_capacity = 0; FILE *smaps_file = fopen("/proc/self/smaps", "r"); if (!smaps_file) { perror("fopen"); return EXIT_FAILURE; } for (;;) { for (size_t off = 0; off < FILE_SIZE; off += 4096) ((volatile char *)map)[off]; ssize_t ret; bool this_mapping = false; while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) { unsigned long start, end, huge; if (sscanf(line, "%lx-%lx", &start, &end) == 2) { this_mapping = (start <= (uintptr_t)map && (uintptr_t)map < end); } else if (this_mapping && sscanf(line, "FilePmdMapped: %ld", &huge) == 1 && huge > 0) { return EXIT_SUCCESS; } } sleep(6); rewind(smaps_file); fflush(smaps_file); } } $ ./create_thp_file huge $ btrfs fi defrag -czstd ./huge Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The exit function fixes a memory leak with the src field as detected by leak sanitizer. An example of which is: Indirect leak of 25133184 byte(s) in 207 object(s) allocated from: #0 0x7f199ecfe987 in __interceptor_calloc libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x55defe638224 in annotated_source__alloc_histograms util/annotate.c:803 #2 0x55defe6397e4 in symbol__hists util/annotate.c:952 #3 0x55defe639908 in symbol__inc_addr_samples util/annotate.c:968 #4 0x55defe63aa29 in hist_entry__inc_addr_samples util/annotate.c:1119 #5 0x55defe499a79 in hist_iter__report_callback tools/perf/builtin-report.c:182 #6 0x55defe7a859d in hist_entry_iter__add util/hist.c:1236 #7 0x55defe49aa63 in process_sample_event tools/perf/builtin-report.c:315 #8 0x55defe731bc8 in evlist__deliver_sample util/session.c:1473 #9 0x55defe731e38 in machines__deliver_event util/session.c:1510 #10 0x55defe732a23 in perf_session__deliver_event util/session.c:1590 #11 0x55defe72951e in ordered_events__deliver_event util/session.c:183 #12 0x55defe740082 in do_flush util/ordered-events.c:244 #13 0x55defe7407cb in __ordered_events__flush util/ordered-events.c:323 #14 0x55defe740a61 in ordered_events__flush util/ordered-events.c:341 #15 0x55defe73837f in __perf_session__process_events util/session.c:2390 #16 0x55defe7385ff in perf_session__process_events util/session.c:2420 ... Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin Liška <mliska@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20211112035124.94327-3-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The following error is reported when running "./test_progs -t for_each" under arm64: bpf_jit: multi-func JIT bug 58 != 56 [...] JIT doesn't support bpf-to-bpf calls The root cause is the size of BPF_PSEUDO_FUNC instruction increases from 2 to 3 after the address of called bpf-function is settled and there are two bpf-to-bpf calls in test_pkt_access. The generated instructions are shown below: 0x48: 21 00 C0 D2 movz x1, #0x1, lsl #32 0x4c: 21 00 80 F2 movk x1, #0x1 0x48: E1 3F C0 92 movn x1, #0x1ff, lsl #32 0x4c: 41 FE A2 F2 movk x1, #0x17f2, lsl #16 0x50: 81 70 9F F2 movk x1, #0xfb84 Fixing it by using emit_addr_mov_i64() for BPF_PSEUDO_FUNC, so the size of jited image will not change. Fixes: 69c087b ("bpf: Add bpf_for_each_map_elem() helper") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211231151018.3781550-1-houtao1@huawei.com
arm32 uses software to simulate the instruction replaced by kprobe. some instructions may be simulated by constructing assembly functions. therefore, before executing instruction simulation, it is necessary to construct assembly function execution environment in C language through binding registers. after kasan is enabled, the register binding relationship will be destroyed, resulting in instruction simulation errors and causing kernel panic. the kprobe emulate instruction function is distributed in three files: actions-common.c actions-arm.c actions-thumb.c, so disable KASAN when compiling these files. for example, use kprobe insert on cap_capable+20 after kasan enabled, the cap_capable assembly code is as follows: <cap_capable>: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr} e1a05000 mov r5, r0 e280006c add r0, r0, #108 ; 0x6c e1a04001 mov r4, r1 e1a06002 mov r6, r2 e59fa090 ldr sl, [pc, #144] ; ebfc7bf8 bl c03aa4b4 <__asan_load4> e595706c ldr r7, [r5, #108] ; 0x6c e2859014 add r9, r5, #20 ...... The emulate_ldr assembly code after enabling kasan is as follows: c06f1384 <emulate_ldr>: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr} e282803c add r8, r2, #60 ; 0x3c e1a05000 mov r5, r0 e7e37855 ubfx r7, r5, #16, #4 e1a00008 mov r0, r8 e1a09001 mov r9, r1 e1a04002 mov r4, r2 ebf35462 bl c03c6530 <__asan_load4> e357000f cmp r7, #15 e7e36655 ubfx r6, r5, #12, #4 e205a00f and sl, r5, #15 0a000001 beq c06f13bc <emulate_ldr+0x38> e0840107 add r0, r4, r7, lsl #2 ebf3545c bl c03c6530 <__asan_load4> e084010a add r0, r4, sl, lsl #2 ebf3545a bl c03c6530 <__asan_load4> e2890010 add r0, r9, #16 ebf35458 bl c03c6530 <__asan_load4> e5990010 ldr r0, [r9, #16] e12fff30 blx r0 e356000f cm r6, #15 1a000014 bne c06f1430 <emulate_ldr+0xac> e1a06000 mov r6, r0 e2840040 add r0, r4, #64 ; 0x40 ...... when running in emulate_ldr to simulate the ldr instruction, panic occurred, and the log is as follows: Unable to handle kernel NULL pointer dereference at virtual address 00000090 pgd = ecb46400 [00000090] *pgd=2e0fa003, *pmd=00000000 Internal error: Oops: 206 [#1] SMP ARM PC is at cap_capable+0x14/0xb0 LR is at emulate_ldr+0x50/0xc0 psr: 600d0293 sp : ecd63af8 ip : 00000004 fp : c0a7c30c r10: 00000000 r9 : c30897f4 r8 : ecd63cd4 r7 : 0000000f r6 : 0000000a r5 : e59fa090 r4 : ecd63c98 r3 : c06ae294 r2 : 00000000 r1 : b7611300 r0 : bf4ec008 Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user Control: 32c5387d Table: 2d546400 DAC: 55555555 Process bash (pid: 1643, stack limit = 0xecd60190) (cap_capable) from (kprobe_handler+0x218/0x340) (kprobe_handler) from (kprobe_trap_handler+0x24/0x48) (kprobe_trap_handler) from (do_undefinstr+0x13c/0x364) (do_undefinstr) from (__und_svc_finish+0x0/0x30) (__und_svc_finish) from (cap_capable+0x18/0xb0) (cap_capable) from (cap_vm_enough_memory+0x38/0x48) (cap_vm_enough_memory) from (security_vm_enough_memory_mm+0x48/0x6c) (security_vm_enough_memory_mm) from (copy_process.constprop.5+0x16b4/0x25c8) (copy_process.constprop.5) from (_do_fork+0xe8/0x55c) (_do_fork) from (SyS_clone+0x1c/0x24) (SyS_clone) from (__sys_trace_return+0x0/0x10) Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7) Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support") Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM") Signed-off-by: huangshaobo <huangshaobo6@huawei.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
When bringing down the netdevice or system shutdown, a panic can be triggered while accessing the sysfs path because the device is already removed. [ 755.549084] mlx5_core 0000:12:00.1: Shutdown was called [ 756.404455] mlx5_core 0000:12:00.0: Shutdown was called ... [ 757.937260] BUG: unable to handle kernel NULL pointer dereference at (null) [ 758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280 crash> bt ... PID: 12649 TASK: ffff8924108f2100 CPU: 1 COMMAND: "amsd" ... #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778 [exception RIP: dma_pool_alloc+0x1ab] RIP: ffffffff8ee11acb RSP: ffff89240e1a3968 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090 RBP: ffff89240e1a39c0 R8: 000000000001f080 R9: ffff8905ffc03c00 R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0 R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core] #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core] #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core] #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core] #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core] #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core] #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core] #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46 #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208 #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3 #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596 #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10 #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5 #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92 crash> net_device.state ffff89443b0c0000 state = 0x5 (__LINK_STATE_START| __LINK_STATE_NOCARRIER) To prevent this scenario, we also make sure that the netdevice is present. Signed-off-by: suresh kumar <suresh2514@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In current async pagefault logic, when a page is ready, KVM relies on kvm_arch_can_dequeue_async_page_present() to determine whether to deliver a READY event to the Guest. This function test token value of struct kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a READY event is finished by Guest. If value is zero meaning that a READY event is done, so the KVM can deliver another. But the kvm_arch_setup_async_pf() may produce a valid token with zero value, which is confused with previous mention and may lead the loss of this READY event. This bug may cause task blocked forever in Guest: INFO: task stress:7532 blocked for more than 1254 seconds. Not tainted 5.10.0 #16 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack: 0 pid: 7532 ppid: 1409 flags:0x00000080 Call Trace: __schedule+0x1e7/0x650 schedule+0x46/0xb0 kvm_async_pf_task_wait_schedule+0xad/0xe0 ? exit_to_user_mode_prepare+0x60/0x70 __kvm_handle_async_pf+0x4f/0xb0 ? asm_exc_page_fault+0x8/0x30 exc_page_fault+0x6f/0x110 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x402d00 RSP: 002b:00007ffd31912500 EFLAGS: 00010206 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000 Signed-off-by: Liang Zhang <zhangliang5@huawei.com> Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In remove_phb_dynamic() we use &phb->io_resource, after we've called device_unregister(&host_bridge->dev). But the unregister may have freed phb, because pcibios_free_controller_deferred() is the release function for the host_bridge. If there are no outstanding references when we call device_unregister() then phb will be freed out from under us. This has gone mainly unnoticed, but with slub_debug and page_poison enabled it can lead to a crash: PID: 7574 TASK: c0000000d492cb80 CPU: 13 COMMAND: "drmgr" #0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc #1 [c0000000e4f075d0] oops_end at c000000000029608 #2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4 #3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8 #4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30 Data SLB Access [380] exception frame: R0: c000000000167250 R1: c0000000e4f07a00 R2: c000000002a46100 R3: c000000002b39ce8 R4: 00000000000000c0 R5: 00000000000000a9 R6: 3894674d000000c0 R7: 0000000000000000 R8: 00000000000000ff R9: 0000000000000100 R10: 6b6b6b6b6b6b6b6b R11: 0000000000008000 R12: c00000000023da80 R13: c0000009ffd38b00 R14: 0000000000000000 R15: 000000011c87f0f0 R16: 0000000000000006 R17: 0000000000000003 R18: 0000000000000002 R19: 0000000000000004 R20: 0000000000000005 R21: 000000011c87ede8 R22: 000000011c87c5a8 R23: 000000011c87d3a0 R24: 0000000000000000 R25: 0000000000000001 R26: c0000000e4f07cc8 R27: c00000004d1cc400 R28: c0080000031d00e8 R29: c00000004d23d800 R30: c00000004d1d2400 R31: c00000004d1d2540 NIP: c000000000167258 MSR: 8000000000009033 OR3: c000000000e9f474 CTR: 0000000000000000 LR: c000000000167250 XER: 0000000020040003 CCR: 0000000024088420 MQ: 0000000000000000 DAR: 6b6b6b6b6b6b6ba3 DSISR: c0000000e4f07920 Syscall Result: fffffffffffffff2 [NIP : release_resource+56] [LR : release_resource+48] #5 [c0000000e4f07a00] release_resource at c000000000167258 (unreliable) #6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648 #7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io] #8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io] #9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c #10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504 #11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868 #12 [c0000000e4f07c70] new_sync_write at c00000000054339c #13 [c0000000e4f07d10] vfs_write at c000000000546624 #14 [c0000000e4f07d60] ksys_write at c0000000005469f4 #15 [c0000000e4f07db0] system_call_exception at c000000000030840 #16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168 To avoid it, we can take a reference to the host_bridge->dev until we're done using phb. Then when we drop the reference the phb will be freed. Fixes: 2dd9c11 ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)") Reported-by: David Dai <zdai@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lore.kernel.org/r/20220318034219.1188008-1-mpe@ellerman.id.au
The BPF STX/LDX instruction uses offset relative to the FP to address stack space. Since the BPF_FP locates at the top of the frame, the offset is usually a negative number. However, arm64 str/ldr immediate instruction requires that offset be a positive number. Therefore, this patch tries to convert the offsets. The method is to find the negative offset furthest from the FP firstly. Then add it to the FP, calculate a bottom position, called FPB, and then adjust the offsets in other STR/LDX instructions relative to FPB. FPB is saved using the callee-saved register x27 of arm64 which is not used yet. Before adjusting the offset, the patch checks every instruction to ensure that the FP does not change in run-time. If the FP may change, no offset is adjusted. For example, for the following bpftrace command: bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }' Without this patch, jited code(fragment): 0: bti c 4: stp x29, x30, [sp, #-16]! 8: mov x29, sp c: stp x19, x20, [sp, #-16]! 10: stp x21, x22, [sp, #-16]! 14: stp x25, x26, [sp, #-16]! 18: mov x25, sp 1c: mov x26, #0x0 // #0 20: bti j 24: sub sp, sp, #0x90 28: add x19, x0, #0x0 2c: mov x0, #0x0 // #0 30: mov x10, #0xffffffffffffff78 // #-136 34: str x0, [x25, x10] 38: mov x10, #0xffffffffffffff80 // #-128 3c: str x0, [x25, x10] 40: mov x10, #0xffffffffffffff88 // #-120 44: str x0, [x25, x10] 48: mov x10, #0xffffffffffffff90 // #-112 4c: str x0, [x25, x10] 50: mov x10, #0xffffffffffffff98 // #-104 54: str x0, [x25, x10] 58: mov x10, #0xffffffffffffffa0 // #-96 5c: str x0, [x25, x10] 60: mov x10, #0xffffffffffffffa8 // #-88 64: str x0, [x25, x10] 68: mov x10, #0xffffffffffffffb0 // #-80 6c: str x0, [x25, x10] 70: mov x10, #0xffffffffffffffb8 // #-72 74: str x0, [x25, x10] 78: mov x10, #0xffffffffffffffc0 // #-64 7c: str x0, [x25, x10] 80: mov x10, #0xffffffffffffffc8 // #-56 84: str x0, [x25, x10] 88: mov x10, #0xffffffffffffffd0 // #-48 8c: str x0, [x25, x10] 90: mov x10, #0xffffffffffffffd8 // #-40 94: str x0, [x25, x10] 98: mov x10, #0xffffffffffffffe0 // #-32 9c: str x0, [x25, x10] a0: mov x10, #0xffffffffffffffe8 // #-24 a4: str x0, [x25, x10] a8: mov x10, #0xfffffffffffffff0 // #-16 ac: str x0, [x25, x10] b0: mov x10, #0xfffffffffffffff8 // #-8 b4: str x0, [x25, x10] b8: mov x10, #0x8 // #8 bc: ldr x2, [x19, x10] [...] With this patch, jited code(fragment): 0: bti c 4: stp x29, x30, [sp, #-16]! 8: mov x29, sp c: stp x19, x20, [sp, #-16]! 10: stp x21, x22, [sp, #-16]! 14: stp x25, x26, [sp, #-16]! 18: stp x27, x28, [sp, #-16]! 1c: mov x25, sp 20: sub x27, x25, #0x88 24: mov x26, #0x0 // #0 28: bti j 2c: sub sp, sp, #0x90 30: add x19, x0, #0x0 34: mov x0, #0x0 // #0 38: str x0, [x27] 3c: str x0, [x27, #8] 40: str x0, [x27, #16] 44: str x0, [x27, #24] 48: str x0, [x27, #32] 4c: str x0, [x27, #40] 50: str x0, [x27, #48] 54: str x0, [x27, #56] 58: str x0, [x27, #64] 5c: str x0, [x27, #72] 60: str x0, [x27, #80] 64: str x0, [x27, #88] 68: str x0, [x27, #96] 6c: str x0, [x27, #104] 70: str x0, [x27, #112] 74: str x0, [x27, #120] 78: str x0, [x27, #128] 7c: ldr x2, [x19, #8] [...] Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
kernfs_remove supported NULL kernfs_node param to bail out but revent per-fs lock change introduced regression that dereferencing the param without NULL check so kernel goes crash. This patch checks the NULL kernfs_node in kernfs_remove and if so, just return. Quote from bug report by Jirka ``` The bug is triggered by running NAS Parallel benchmark suite on SuperMicro servers with 2x Xeon(R) Gold 6126 CPU. Here is the error log: [ 247.035564] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 247.036009] #PF: supervisor read access in kernel mode [ 247.036009] #PF: error_code(0x0000) - not-present page [ 247.036009] PGD 0 P4D 0 [ 247.036009] Oops: 0000 [#1] PREEMPT SMP PTI [ 247.058060] CPU: 1 PID: 6546 Comm: umount Not tainted 5.16.0393c3714081a53795bbff0e985d24146def6f57f+ #16 [ 247.058060] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 2.0b 03/07/2018 [ 247.058060] RIP: 0010:kernfs_remove+0x8/0x50 [ 247.058060] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 49 c7 c4 f4 ff ff ff eb b2 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 41 54 55 <48> 8b 47 08 48 89 fd 48 85 c0 48 0f 44 c7 4c 8b 60 50 49 83 c4 60 [ 247.058060] RSP: 0018:ffffbbfa48a27e48 EFLAGS: 00010246 [ 247.058060] RAX: 0000000000000001 RBX: ffffffff89e31f98 RCX: 0000000080200018 [ 247.058060] RDX: 0000000080200019 RSI: fffff6760786c900 RDI: 0000000000000000 [ 247.058060] RBP: ffffffff89e31f98 R08: ffff926b61b24d00 R09: 0000000080200018 [ 247.122048] R10: ffff926b61b24d00 R11: ffff926a8040c000 R12: ffff927bd09a2000 [ 247.122048] R13: ffffffff89e31fa0 R14: dead000000000122 R15: dead000000000100 [ 247.122048] FS: 00007f01be0a8c40(0000) GS:ffff926fa8e40000(0000) knlGS:0000000000000000 [ 247.122048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 247.122048] CR2: 0000000000000008 CR3: 00000001145c6003 CR4: 00000000007706e0 [ 247.122048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 247.122048] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 247.122048] PKRU: 55555554 [ 247.122048] Call Trace: [ 247.122048] <TASK> [ 247.122048] rdt_kill_sb+0x29d/0x350 [ 247.122048] deactivate_locked_super+0x36/0xa0 [ 247.122048] cleanup_mnt+0x131/0x190 [ 247.122048] task_work_run+0x5c/0x90 [ 247.122048] exit_to_user_mode_prepare+0x229/0x230 [ 247.122048] syscall_exit_to_user_mode+0x18/0x40 [ 247.122048] do_syscall_64+0x48/0x90 [ 247.122048] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 247.122048] RIP: 0033:0x7f01be2d735b ``` Link: https://bugzilla.kernel.org/show_bug.cgi?id=215696 Link: https://lore.kernel.org/lkml/CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com/ Fixes: 393c371 (kernfs: switch global kernfs_rwsem lock to per-fs lock) Cc: stable@vger.kernel.org Reported-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Jirka Hladky <jhladky@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Minchan Kim <minchan@kernel.org> Link: https://lore.kernel.org/r/20220427172152.3505364-1-minchan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Today doing a BPF tail call after a BPF to BPF call, that is from a subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features requires support from JIT. Tail call count has to be tracked through BPF to BPF calls, as well as through BPF tail calls to prevent unbounded chains of tail calls. arm64 BPF JIT stores the tail call count (TCC) in a dedicated register (X26). This makes it easier to support bpf2bpf calls mixed with tail calls than on x86 platform. In order to keep the tail call count in tact throughout bpf2bpf calls, all we need to do is tweak the program prologue generator. When emitting prologue for a subprogram, we skip the block that initializes the tail call count and emits a jump pad for the tail call. With this change, a sample execution flow where a bpf2bpf call is followed by a tail call would look like so: int entry(struct __sk_buff *skb): 0xffffffc0090151d4: paciasp 0xffffffc0090151d8: stp x29, x30, [sp, #-16]! 0xffffffc0090151dc: mov x29, sp 0xffffffc0090151e0: stp x19, x20, [sp, #-16]! 0xffffffc0090151e4: stp x21, x22, [sp, #-16]! 0xffffffc0090151e8: stp x25, x26, [sp, #-16]! 0xffffffc0090151ec: stp x27, x28, [sp, #-16]! 0xffffffc0090151f0: mov x25, sp 0xffffffc0090151f4: mov x26, #0x0 // <- init TCC only 0xffffffc0090151f8: bti j // in main prog 0xffffffc0090151fc: sub x27, x25, #0x0 0xffffffc009015200: sub sp, sp, #0x10 0xffffffc009015204: mov w1, #0x0 0xffffffc009015208: mov x10, #0xffffffffffffffff 0xffffffc00901520c: strb w1, [x25, x10] 0xffffffc009015210: mov x10, #0xffffffffffffd25c 0xffffffc009015214: movk x10, #0x902, lsl #16 0xffffffc009015218: movk x10, #0xffc0, lsl #32 0xffffffc00901521c: blr x10 -------------------. // bpf2bpf call 0xffffffc009015220: add x7, x0, #0x0 <-------------. 0xffffffc009015224: add sp, sp, #0x10 | | 0xffffffc009015228: ldp x27, x28, [sp], #16 | | 0xffffffc00901522c: ldp x25, x26, [sp], #16 | | 0xffffffc009015230: ldp x21, x22, [sp], #16 | | 0xffffffc009015234: ldp x19, x20, [sp], #16 | | 0xffffffc009015238: ldp x29, x30, [sp], #16 | | 0xffffffc00901523c: add x0, x7, #0x0 | | 0xffffffc009015240: autiasp | | 0xffffffc009015244: ret | | | | int subprog_tail(struct __sk_buff *skb): | | 0xffffffc00902d25c: paciasp <----------------------' | 0xffffffc00902d260: stp x29, x30, [sp, #-16]! | 0xffffffc00902d264: mov x29, sp | 0xffffffc00902d268: stp x19, x20, [sp, #-16]! | 0xffffffc00902d26c: stp x21, x22, [sp, #-16]! | 0xffffffc00902d270: stp x25, x26, [sp, #-16]! | 0xffffffc00902d274: stp x27, x28, [sp, #-16]! | 0xffffffc00902d278: mov x25, sp | 0xffffffc00902d27c: sub x27, x25, #0x0 | 0xffffffc00902d280: sub sp, sp, #0x10 | // <- end of prologue, notice: 0xffffffc00902d284: add x19, x0, #0x0 | // 1) TCC not touched, and 0xffffffc00902d288: mov w0, #0x1 | // 2) no tail call jump pad 0xffffffc00902d28c: mov x10, #0xfffffffffffffffc | 0xffffffc00902d290: str w0, [x25, x10] | 0xffffffc00902d294: mov x20, #0xffffff80ffffffff | 0xffffffc00902d298: movk x20, #0xc033, lsl #16 | 0xffffffc00902d29c: movk x20, #0x4e00 | 0xffffffc00902d2a0: add x0, x19, #0x0 | 0xffffffc00902d2a4: add x1, x20, #0x0 | 0xffffffc00902d2a8: mov x2, #0x0 | 0xffffffc00902d2ac: mov w10, #0x24 | 0xffffffc00902d2b0: ldr w10, [x1, x10] | 0xffffffc00902d2b4: add w2, w2, #0x0 | 0xffffffc00902d2b8: cmp w2, w10 | 0xffffffc00902d2bc: b.cs 0xffffffc00902d2f8 | 0xffffffc00902d2c0: mov w10, #0x21 | 0xffffffc00902d2c4: cmp x26, x10 | // TCC >= MAX_TAIL_CALL_CNT? 0xffffffc00902d2c8: b.cs 0xffffffc00902d2f8 | 0xffffffc00902d2cc: add x26, x26, #0x1 | // TCC++ 0xffffffc00902d2d0: mov w10, #0x110 | 0xffffffc00902d2d4: add x10, x1, x10 | 0xffffffc00902d2d8: lsl x11, x2, #3 | 0xffffffc00902d2dc: ldr x11, [x10, x11] | 0xffffffc00902d2e0: cbz x11, 0xffffffc00902d2f8 | 0xffffffc00902d2e4: mov w10, #0x30 | 0xffffffc00902d2e8: ldr x10, [x11, x10] | 0xffffffc00902d2ec: add x10, x10, #0x24 | 0xffffffc00902d2f0: add sp, sp, #0x10 | // <- destroy just current 0xffffffc00902d2f4: br x10 ---------------------. | // BPF stack frame 0xffffffc00902d2f8: mov x10, #0xfffffffffffffffc | | // before the tail call 0xffffffc00902d2fc: ldr w7, [x25, x10] | | 0xffffffc00902d300: add sp, sp, #0x10 | | 0xffffffc00902d304: ldp x27, x28, [sp], #16 | | 0xffffffc00902d308: ldp x25, x26, [sp], #16 | | 0xffffffc00902d30c: ldp x21, x22, [sp], #16 | | 0xffffffc00902d310: ldp x19, x20, [sp], #16 | | 0xffffffc00902d314: ldp x29, x30, [sp], #16 | | 0xffffffc00902d318: add x0, x7, #0x0 | | 0xffffffc00902d31c: autiasp | | 0xffffffc00902d320: ret | | | | int classifier_0(struct __sk_buff *skb): | | 0xffffffc008ff5874: paciasp | | 0xffffffc008ff5878: stp x29, x30, [sp, #-16]! | | 0xffffffc008ff587c: mov x29, sp | | 0xffffffc008ff5880: stp x19, x20, [sp, #-16]! | | 0xffffffc008ff5884: stp x21, x22, [sp, #-16]! | | 0xffffffc008ff5888: stp x25, x26, [sp, #-16]! | | 0xffffffc008ff588c: stp x27, x28, [sp, #-16]! | | 0xffffffc008ff5890: mov x25, sp | | 0xffffffc008ff5894: mov x26, #0x0 | | 0xffffffc008ff5898: bti j <----------------------' | 0xffffffc008ff589c: sub x27, x25, #0x0 | 0xffffffc008ff58a0: sub sp, sp, #0x0 | 0xffffffc008ff58a4: mov x0, #0xffffffc0ffffffff | 0xffffffc008ff58a8: movk x0, #0x8fc, lsl #16 | 0xffffffc008ff58ac: movk x0, #0x6000 | 0xffffffc008ff58b0: mov w1, #0x1 | 0xffffffc008ff58b4: str w1, [x0] | 0xffffffc008ff58b8: mov w7, #0x0 | 0xffffffc008ff58bc: mov sp, sp | 0xffffffc008ff58c0: ldp x27, x28, [sp], #16 | 0xffffffc008ff58c4: ldp x25, x26, [sp], #16 | 0xffffffc008ff58c8: ldp x21, x22, [sp], #16 | 0xffffffc008ff58cc: ldp x19, x20, [sp], #16 | 0xffffffc008ff58d0: ldp x29, x30, [sp], #16 | 0xffffffc008ff58d4: add x0, x7, #0x0 | 0xffffffc008ff58d8: autiasp | 0xffffffc008ff58dc: ret -------------------------------' Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com
…g the sock There is a race condition in vxlan that when deleting a vxlan device during receiving packets, there is a possibility that the sock is released after getting vxlan_sock vs from sk_user_data. Then in later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got NULL pointer dereference. e.g. #0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757 #1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d #2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48 #3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b #4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb #5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542 #6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62 [exception RIP: vxlan_ecn_decapsulate+0x3b] RIP: ffffffffc1014e7b RSP: ffffa25ec6978cb0 RFLAGS: 00010246 RAX: 0000000000000008 RBX: ffff8aa000888000 RCX: 0000000000000000 RDX: 000000000000000e RSI: ffff8a9fc7ab803e RDI: ffff8a9fd1168700 RBP: ffff8a9fc7ab803e R8: 0000000000700000 R9: 00000000000010ae R10: ffff8a9fcb748980 R11: 0000000000000000 R12: ffff8a9fd1168700 R13: ffff8aa000888000 R14: 00000000002a0000 R15: 00000000000010ae ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan] #8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507 #9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45 #10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807 #11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951 #12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde #13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b #14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139 #15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a #16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3 #17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca #18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3 Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh Fix this by waiting for all sk_user_data reader to finish before releasing the sock. Reported-by: Jianlin Shi <jishi@redhat.com> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Fixes: 6a93cc9 ("udp-tunnel: Add a few more UDP tunnel APIs") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We use "unsigned long" to store a PFN in the kernel and phys_addr_t to store a physical address. On a 64bit system, both are 64bit wide. However, on a 32bit system, the latter might be 64bit wide. This is, for example, the case on x86 with PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans 32bit. The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses that case, and assumes that the maximum PFN is limited by an 32bit phys_addr_t. This implies, that SWP_PFN_BITS will currently only be able to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong. Let's rely on the number of bits in phys_addr_t instead, but make sure to not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in is_pfn_swap_entry() unhappy. Note that swp_entry_t is effectively an unsigned long and the maximum swap offset shares that value with the swap type. For example, on an 8 GiB x86 PAE system with a kernel config based on Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently fail removing migration entries (remove_migration_ptes()), because mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as swp_offset_pfn() wrongly masks off PFN bits. For example, split_huge_page_to_list()->...->remap_page() will leave migration entries in place and continue to unlock the page. Later, when we stumble over these migration entries (e.g., via /proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these migration entries shouldn't exist anymore and the page was unlocked. [ 33.067591] kernel BUG at include/linux/swapops.h:497! [ 33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G E 6.1.0-rc8+ #16 [ 33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 [ 33.067606] EIP: pagemap_pmd_range+0x644/0x650 [ 33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31 [ 33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000 [ 33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68 [ 33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246 [ 33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0 [ 33.067625] Call Trace: [ 33.067628] ? madvise_free_pte_range+0x720/0x720 [ 33.067632] ? smaps_pte_range+0x4b0/0x4b0 [ 33.067634] walk_pgd_range+0x325/0x720 [ 33.067637] ? mt_find+0x1d6/0x3a0 [ 33.067641] ? mt_find+0x1d6/0x3a0 [ 33.067643] __walk_page_range+0x164/0x170 [ 33.067646] walk_page_range+0xf9/0x170 [ 33.067648] ? __kmem_cache_alloc_node+0x2a8/0x340 [ 33.067653] pagemap_read+0x124/0x280 [ 33.067658] ? default_llseek+0x101/0x160 [ 33.067662] ? smaps_account+0x1d0/0x1d0 [ 33.067664] vfs_read+0x90/0x290 [ 33.067667] ? do_madvise.part.0+0x24b/0x390 [ 33.067669] ? debug_smp_processor_id+0x12/0x20 [ 33.067673] ksys_pread64+0x58/0x90 [ 33.067675] __ia32_sys_ia32_pread64+0x1b/0x20 [ 33.067680] __do_fast_syscall_32+0x4c/0xc0 [ 33.067683] do_fast_syscall_32+0x29/0x60 [ 33.067686] do_SYSENTER_32+0x15/0x20 [ 33.067689] entry_SYSENTER_32+0x98/0xf1 Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it readable and consistent. [david@redhat.com: rely on sizeof(phys_addr_t) and min_t() instead] Link: https://lkml.kernel.org/r/20221206105737.69478-1-david@redhat.com [david@redhat.com: use "int" for comparison, as we're only comparing numbers < 64] Link: https://lkml.kernel.org/r/1f157500-2676-7cef-a84e-9224ed64e540@redhat.com Link: https://lkml.kernel.org/r/20221205150857.167583-1-david@redhat.com Fixes: 0d206b5 ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Set kprobe at 'jalr 1140(ra)' of vfs_write results in the following crash: [ 32.092235] Unable to handle kernel access to user memory without uaccess routines at virtual address 00aaaaaad77b1170 [ 32.093115] Oops [#1] [ 32.093251] Modules linked in: [ 32.093626] CPU: 0 PID: 135 Comm: ftracetest Not tainted 6.2.0-rc2-00013-gb0aa5e5df0cb-dirty #16 [ 32.093985] Hardware name: riscv-virtio,qemu (DT) [ 32.094280] epc : ksys_read+0x88/0xd6 [ 32.094855] ra : ksys_read+0xc0/0xd6 [ 32.095016] epc : ffffffff801cda80 ra : ffffffff801cdab8 sp : ff20000000d7bdc0 [ 32.095227] gp : ffffffff80f14000 tp : ff60000080f9cb40 t0 : ffffffff80f13e80 [ 32.095500] t1 : ffffffff8000c29c t2 : ffffffff800dbc54 s0 : ff20000000d7be60 [ 32.095716] s1 : 0000000000000000 a0 : ffffffff805a64ae a1 : ffffffff80a83708 [ 32.095921] a2 : ffffffff80f160a0 a3 : 0000000000000000 a4 : f229b0afdb165300 [ 32.096171] a5 : f229b0afdb165300 a6 : ffffffff80eeebd0 a7 : 00000000000003ff [ 32.096411] s2 : ff6000007ff76800 s3 : fffffffffffffff7 s4 : 00aaaaaad77b1170 [ 32.096638] s5 : ffffffff80f160a0 s6 : ff6000007ff76800 s7 : 0000000000000030 [ 32.096865] s8 : 00ffffffc3d97be0 s9 : 0000000000000007 s10: 00aaaaaad77c9410 [ 32.097092] s11: 0000000000000000 t3 : ffffffff80f13e48 t4 : ffffffff8000c29c [ 32.097317] t5 : ffffffff8000c29c t6 : ffffffff800dbc54 [ 32.097505] status: 0000000200000120 badaddr: 00aaaaaad77b1170 cause: 000000000000000d [ 32.098011] [<ffffffff801cdb72>] ksys_write+0x6c/0xd6 [ 32.098222] [<ffffffff801cdc06>] sys_write+0x2a/0x38 [ 32.098405] [<ffffffff80003c76>] ret_from_syscall+0x0/0x2 Since the rs1 and rd might be the same one, such as 'jalr 1140(ra)', hence it requires obtaining the target address from rs1 followed by updating rd. Fixes: c22b0bc ("riscv: Add kprobes supported") Signed-off-by: Liao Chang <liaochang1@huawei.com> Reviewed-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20230116064342.2092136-1-liaochang1@huawei.com [Palmer: Pick Guo's cleanup] Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Petr Machata says: ==================== bridge: Limit number of MDB entries per port, port-vlan The MDB maintained by the bridge is limited. When the bridge is configured for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its capacity. In SW datapath, the capacity is configurable through the IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a similar limit exists in the HW datapath for purposes of offloading. In order to prevent the issue of unilateral exhaustion of MDB resources, introduce two parameters in each of two contexts: - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled) per-port-VLAN number of MDB entries that the port is member in. - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled) per-port-VLAN maximum permitted number of MDB entries, or 0 for no limit. Per-port number of entries keeps track of the total number of MDB entries configured on a given port. The per-port-VLAN value then keeps track of the subset of MDB entries configured specifically for the given VLAN, on that port. The number is adjusted as port_groups are created and deleted, and therefore under multicast lock. A maximum value, if non-zero, then places a limit on the number of entries that can be configured in a given context. Attempts to add entries above the maximum are rejected. Rejection reason of netlink-based requests to add MDB entries is communicated through extack. This channel is unavailable for rejections triggered from the control path. To address this lack of visibility, the patchset adds a tracepoint, bridge:br_mdb_full: # perf record -e bridge:br_mdb_full & # [...] # perf script | cut -d: -f4- dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10 Another option to consume the tracepoint is e.g. through the bpftrace tool: # bpftrace -e ' tracepoint:bridge:br_mdb_full /args->af != 0/ { printf("dev %s src %s grp %s vid %u\n", str(args->dev), ntop(args->src), ntop(args->grp), args->vid); } tracepoint:bridge:br_mdb_full /args->af == 0/ { printf("dev %s grp %s vid %u\n", str(args->dev), macaddr(args->grpmac), args->vid); }' This tracepoint is triggered for mcast_hash_max exhaustions as well. The following is an example of how the feature is used. A more extensive example is available in patch #8: # bridge vlan set dev v1 vid 1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1 # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1 Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1. The patchset progresses as follows: - In patch #1, set strict_start_type at two bridge-related policies. The reason is we are adding a new attribute to one of these, and want the new attribute to be parsed strictly. The other was adjusted for completeness' sake. - In patches #2 to #5, br_mdb and br_multicast code is adjusted to make the following additions smoother. - In patch #6, add the tracepoint. - In patch #7, the code to maintain number of MDB entries is added as struct net_bridge_mcast_port::mdb_n_entries. The maximum is added, too, as struct net_bridge_mcast_port::mdb_max_entries, however at this point there is no way to set the value yet, and since 0 is treated as "no limit", the functionality doesn't change at this point. Note however, that mcast_hash_max violations already do trigger at this point. - In patch #8, netlink plumbing is added: reading of number of entries, and reading and writing of maximum. The per-port values are passed through RTM_NEWLINK / RTM_GETLINK messages in IFLA_BRPORT_MCAST_N_GROUPS and _MAX_GROUPS, inside IFLA_PROTINFO nest. The per-port-vlan values are passed through RTM_GETVLAN / RTM_NEWVLAN messages in BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS, _MAX_GROUPS, inside BRIDGE_VLANDB_ENTRY. The following patches deal with the selftest: - Patches #9 and #10 clean up and move around some selftest code. - Patches #11 to #14 add helpers and generalize the existing IGMP / MLD support to allow generating packets with configurable group addresses and varying source lists for (S,G) memberships. - Patch #15 adds code to generate IGMP leave and MLD done packets. - Patch #16 finally adds the selftest itself. v3: - Patch #7: - Access mdb_max_/_n_entries through READ_/WRITE_ONCE - Move extack setting to br_multicast_port_ngroups_inc_one(). Since we use NL_SET_ERR_MSG_FMT_MOD, the correct context (port / port-vlan) can be passed through an argument. This also removes the need for more READ/WRITE_ONCE's at the extack-setting site. - Patch #8: - Move the br_multicast_port_ctx_vlan_disabled() check out to the _vlan_ helpers callers. Thus these helpers cannot fail, which makes them very similar to the _port_ helpers. Have them take the MC context directly and unify them. v2: - Cover letter: - Add an example of a bpftrace-based probe script - Patch #6: - Report IPv4 as an IPv6-mapped address through the IPv6 buffer as well, to save ring buffer space. - Patch #7: - In br_multicast_port_ngroups_inc_one(), bounce if n>=max, not if n==max - Adjust extack messages to mention ngroups, now that the bounces appear when n>=max, not n==max - In __br_multicast_enable_port_ctx(), do not reset max to 0. Also do not count number of entries by going through _inc, as that would end up incorrectly bouncing the entries. - Patch #8: - Drop locks around accesses in br_multicast_{port,vlan}_ngroups_{get,set_max}(), - Drop bounces due to max<n in br_multicast_{port,vlan}_ngroups_set_max(). - Patch #12: - In the comment at payload_template_calc_checksum(), s/%#02x/%02x/, that's the mausezahn payload format. - Patch #16: - Adjust the tests that check setting max below n and reset of max on VLAN snooping enablement - Make test naming uniform - Enable testing of control path (IGMP/MLD) in mcast_vlan_snooping bridge - Reorganize the code so that test instances (per bridge type and configuration type) always come right after the test, in order of {d,q,qvs}{4,6}{cfg,ctl}. Then groups of selftests are at the end of the file. Similarly adjust invocation order of the tests. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
When a system with E810 with existing VFs gets rebooted the following hang may be observed. Pid 1 is hung in iavf_remove(), part of a network driver: PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow" #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930 #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf] #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513 #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429 #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4 #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice] #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice] #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice] #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1 #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386 #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870 #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6 #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159 #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169 #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7 RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90 R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005 R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b During reboot all drivers PM shutdown callbacks are invoked. In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE. In ice_shutdown() the call chain above is executed, which at some point calls iavf_remove(). However iavf_remove() expects the VF to be in one of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If that's not the case it sleeps forever. So if iavf_shutdown() gets invoked before iavf_remove() the system will hang indefinitely because the adapter is already in state __IAVF_REMOVE. Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE, as we already went through iavf_shutdown(). Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove") Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove") Reported-by: Marius Cornea <mcornea@redhat.com> Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
For a similar reason as commit f2c7ca8 ("drm/atomic-helper: Don't set deadline for modesets"), we need the crtc to be already active in order to compute a target vblank time for an async commit. Otherwise we get this splat reminding us that we are doing it wrong: ------------[ cut here ]------------ msm_dpu ae01000.mdp: drm_WARN_ON_ONCE(drm_drv_uses_atomic_modeset(dev)) WARNING: CPU: 7 PID: 1923 at drivers/gpu/drm/drm_vblank.c:728 drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x148/0x370 Modules linked in: snd_seq_dummy snd_seq snd_seq_device bridge stp llc tun vhost_vsock vhost vhost_iotlb vmw_vsock_virtio_transport_common vsock uinput rfcomm algif_hash algif_skcipher af_alg veth venus_dec venus_enc cros_ec_typec typec qcom_spmi_temp_alarm qcom_spmi_adc_tm5 qcom_spmi_adc5 xt_cgroup qcom_vadc_common qcom_stats hci_uart btqca xt_MASQUERADE venus_core 8021q coresight_tmc coresight_funnel coresight_etm4x coresight_replicator snd_soc_lpass_sc7180 coresight snd_soc_sc7180 ip6table_nat fuse ath10k_snoc ath10k_core ath mac80211 iio_trig_sysfs bluetooth cfg80211 cros_ec_sensors cros_ec_sensors_core ecdh_generic industrialio_triggered_buffer ecc kfifo_buf cros_ec_sensorhub r8153_ecm cdc_ether usbnet r8152 mii lzo_rle lzo_compress zram hid_vivaldi hid_google_hammer hid_vivaldi_common joydev CPU: 7 PID: 1923 Comm: DrmThread Not tainted 5.15.107-18853-g3be267609a0b-dirty #16 a1ffc1a66e79c21c3536d8c9a42e819236e39714 Hardware name: Google Wormdingler rev1+ BOE panel board (DT) pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x148/0x370 lr : drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x144/0x370 sp : ffffffc012e2b800 x29: ffffffc012e2b840 x28: ffffff8083676094 x27: ffffffc012e2bb28 x26: ffffff8084539800 x25: 0000000000000000 x24: ffffff8083676000 x23: ffffffd3c8cdc5a0 x22: ffffff80845b9d00 x21: ffffffc012e2b8b4 x20: ffffffc012e2b910 x19: 0000000000000001 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000010 x15: ffffffd3c8451a88 x14: 0000000000000003 x13: 0000000000000004 x12: 0000000000000001 x11: c0000000ffffdfff x10: ffffffd3c973ef58 x9 : 8ea3526b3cc95900 x8 : 8ea3526b3cc95900 x7 : 0000000000000000 x6 : 000000000000003a x5 : ffffffd3c99676cd x4 : 0000000000000000 x3 : ffffffc012e2b4b8 x2 : ffffffc012e2b4c0 x1 : 00000000ffffdfff x0 : 0000000000000000 Call trace: drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x148/0x370 drm_crtc_vblank_helper_get_vblank_timestamp+0x20/0x30 drm_crtc_get_last_vbltimestamp+0x68/0xb0 drm_crtc_next_vblank_start+0x5c/0xa8 msm_atomic_commit_tail+0x264/0x664 commit_tail+0xac/0x160 drm_atomic_helper_commit+0x160/0x168 drm_atomic_commit+0xfc/0x128 drm_atomic_helper_disable_plane+0x8c/0x110 __setplane_atomic+0x10c/0x138 drm_mode_cursor_common+0x3a8/0x410 drm_mode_cursor_ioctl+0x48/0x70 drm_ioctl_kernel+0xe0/0x158 drm_ioctl+0x25c/0x4d8 __arm64_sys_ioctl+0x98/0xd0 invoke_syscall+0x4c/0x100 el0_svc_common+0x98/0x104 do_el0_svc+0x30/0x90 el0_svc+0x20/0x50 el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x1a4/0x1a8 ---[ end trace a0f587e1ab9589e8 ]--- Fixes: 52ff0d3 ("drm/msm/atomic: Switch to vblank_start helper") Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com> Patchwork: https://patchwork.freedesktop.org/patch/532727/ Link: https://lore.kernel.org/r/20230418164158.549873-1-robdclark@gmail.com
The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, the per cpu upcall counters are allocated after the vport is created and inserted into the system. This could lead to the datapath accessing the counters before they are allocated resulting in a kernel Oops. Here is an example: PID: 59693 TASK: ffff0005f4f51500 CPU: 0 COMMAND: "ovs-vswitchd" #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4 #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60 #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58 #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388 #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68 #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch] ... PID: 58682 TASK: ffff0005b2f0bf00 CPU: 0 COMMAND: "kworker/0:3" #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758 #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994 #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8 #3 [ffff80000a5d3120] die at ffffb70f0628234c #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8 #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4 #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4 #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710 #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74 #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24 #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch] #13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch] #14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch] #15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch] #16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch] #17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90 We moved the per cpu upcall counter allocation to the existing vport alloc and free functions to solve this. Fixes: 95637d9 ("net: openvswitch: release vport resources on failure") Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
…hes-in-mlxsw' Petr Machata says: ==================== selftests: Preparations for out-of-order-operations patches in mlxsw The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. Over the course of the following several patchsets, mlxsw code is going to be adjusted to diminish the space of wrongly offloaded configurations. Ideally the offload state will reflect the actual state, regardless of the sequence of operation used to construct that state. Several selftests build configurations that will not be offloadable in the future on some systems. The reason is that what will get offloaded is the actual configuration, not the configuration steps. For example, when a port is added to a bridge that has an IP address, that bridge will get a RIF, which it would not have with the current code. But on Nvidia Spectrum-1 machines, MAC addresses of all RIFs need to have the same prefix, which the bridge will violate. The RIF thus couldn't be created, and the enslavement is therefore canceled, because it would lead to an unoffloadable configuration. This breaks some selftests. In this patchset, adjust selftests to avoid the configurations that mlxsw would be incapable of offloading, while maintaining relevance with regards to the feature that is being tested. There are generally two cases of fixes: - Disabling IPv6 autogen on bridges that do not participate in routing, either because of the abovementioned requirement to keep the same MAC prefix on all in-HW router interfaces, or, on 802.1ad bridges, because in-HW router interfaces are not supported at all. - Setting the bridge MAC address to what it will become after the first member port is attached, so that the in-HW router interface is created with a supported MAC address. The patchset is then split thus: - Patches #1-#7 adjust generic selftests - Patches #8-#16 adjust mlxsw-specific selftests ==================== Link: https://lore.kernel.org/r/cover.1687265905.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The buffer is used to save register mapping in a sample. Normally perf samples don't have any register so the string should be empty. But it missed to initialize the buffer when the size is 0. And it's passed to PyUnicode_FromString() with a garbage data. So it returns NULL due to invalid input (instead of an empty unicode string object) which causes a segfault like below: Thread 2.1 "perf" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff7c83780 (LWP 193775)] 0x00007ffff6dbca2e in PyDict_SetItem () from /lib/x86_64-linux-gnu/libpython3.11.so.1.0 (gdb) bt #0 0x00007ffff6dbca2e in PyDict_SetItem () from /lib/x86_64-linux-gnu/libpython3.11.so.1.0 #1 0x00007ffff6dbf848 in PyDict_SetItemString () from /lib/x86_64-linux-gnu/libpython3.11.so.1.0 #2 0x000055555575824d in pydict_set_item_string_decref (val=0x0, key=0x5555557f96e3 "iregs", dict=0x7ffff5f7f780) at util/scripting-engines/trace-event-python.c:145 #3 set_regs_in_dict (evsel=0x555555efc370, sample=0x7fffffffb870, dict=0x7ffff5f7f780) at util/scripting-engines/trace-event-python.c:776 #4 get_perf_sample_dict (sample=sample@entry=0x7fffffffb870, evsel=evsel@entry=0x555555efc370, al=al@entry=0x7fffffffb2e0, addr_al=addr_al@entry=0x0, callchain=callchain@entry=0x7ffff63ef440) at util/scripting-engines/trace-event-python.c:923 #5 0x0000555555758ec1 in python_process_tracepoint (sample=0x7fffffffb870, evsel=0x555555efc370, al=0x7fffffffb2e0, addr_al=0x0) at util/scripting-engines/trace-event-python.c:1044 #6 0x00005555555c5db8 in process_sample_event (tool=<optimized out>, event=<optimized out>, sample=<optimized out>, evsel=0x555555efc370, machine=0x555555ef4d68) at builtin-script.c:2421 #7 0x00005555556b7793 in perf_session__deliver_event (session=0x555555ef4b60, event=0x7ffff62ff7d0, tool=0x7fffffffc150, file_offset=30672, file_path=0x555555efb8a0 "perf.data") at util/session.c:1639 #8 0x00005555556bc864 in do_flush (show_progress=true, oe=0x555555efb700) at util/ordered-events.c:245 #9 __ordered_events__flush (oe=oe@entry=0x555555efb700, how=how@entry=OE_FLUSH__FINAL, timestamp=timestamp@entry=0) at util/ordered-events.c:324 #10 0x00005555556bd06e in ordered_events__flush (oe=oe@entry=0x555555efb700, how=how@entry=OE_FLUSH__FINAL) at util/ordered-events.c:342 #11 0x00005555556b9d63 in __perf_session__process_events (session=0x555555ef4b60) at util/session.c:2465 #12 perf_session__process_events (session=0x555555ef4b60) at util/session.c:2627 #13 0x00005555555cb1d0 in __cmd_script (script=0x7fffffffc150) at builtin-script.c:2839 #14 cmd_script (argc=<optimized out>, argv=<optimized out>) at builtin-script.c:4365 #15 0x0000555555650811 in run_builtin (p=p@entry=0x555555ed8948 <commands+456>, argc=argc@entry=4, argv=argv@entry=0x7fffffffe240) at perf.c:323 #16 0x0000555555597eb3 in handle_internal_command (argv=0x7fffffffe240, argc=4) at perf.c:377 #17 run_argv (argv=<synthetic pointer>, argcp=<synthetic pointer>) at perf.c:421 #18 main (argc=4, argv=0x7fffffffe240) at perf.c:537 Fixes: 51cfe7a ("perf python: Avoid 2 leak sanitizer issues") Cc: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Petr Machata says: ==================== mlxsw: Permit enslavement to netdevices with uppers The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. Similarly, detaching a front panel port from a configured topology means unoffloading of this whole topology -- VLAN uppers, next hops, etc. Attaching the port back is then not permitted at all. If it were, it would not result in a working configuration, because much of mlxsw is written to react to changes in immediate configuration. There is nothing that would go visit netdevices in the attached-to topology and offload existing routes and VLAN memberships, for example. In this patchset, introduce a number of replays to be invoked so that this sort of post-hoc offload is supported. Then remove the vetoes that disallowed enslavement of front panel ports to other netdevices with uppers. The patchset progresses as follows: - In patch #1, fix an issue in the bridge driver. To my knowledge, the issue could not have resulted in a buggy behavior previously, and thus is packaged with this patchset instead of being sent separately to net. - In patch #2, add a new helper to the switchdev code. - In patch #3, drop mlxsw selftests that will not be relevant after this patchset anymore. - Patches #4, #5, #6, #7 and #8 prepare the codebase for smoother introduction of the rest of the code. - Patches #9, #10, #11, #12, #13 and #14 replay various aspects of upper configuration when a front panel port is introduced into a topology. Individual patches take care of bridge and LAG RIF memberships, switchdev replay, nexthop and neighbors replay, and MACVLAN offload. - Patches #15 and #16 introduce RIFs for newly-relevant netdevices when a front panel port is enslaved (in which case all uppers are newly relevant), or, respectively, deslaved (in which case the newly-relevant netdevice is the one being deslaved). - Up until this point, the introduced scaffolding was not really used, because mlxsw still forbids enslavement of mlxsw netdevices to uppers with uppers. In patch #17, this condition is finally relaxed. A sizable selftest suite is available to test all this new code. That will be sent in a separate patchset. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
The cited commit holds encap tbl lock unconditionally when setting up dests. But it may cause the following deadlock: PID: 1063722 TASK: ffffa062ca5d0000 CPU: 13 COMMAND: "handler8" #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91 #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528 #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core] #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core] #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core] #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core] #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core] #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core] #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core] #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core] #13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8 #14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower] #15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower] #16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0 [1031218.028143] wait_for_completion+0x24/0x30 [1031218.028589] mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core] [1031218.029256] mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core] [1031218.029885] process_one_work+0x24e/0x510 Actually no need to hold encap tbl lock if there is no encap action. Fix it by checking if encap action exists or not before holding encap tbl lock. Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
kernel test robot reported slab-out-of-bounds access in strlen(). [0] Commit 06d4c8a ("af_unix: Fix fortify_panic() in unix_bind_bsd().") removed unix_mkname_bsd() call in unix_bind_bsd(). If sunaddr->sun_path is not terminated by user and we don't enable CONFIG_INIT_STACK_ALL_ZERO=y, strlen() will do the out-of-bounds access during file creation. Let's go back to strlen()-with-sockaddr_storage way and pack all 108 trickiness into unix_mkname_bsd() with bold comments. [0]: BUG: KASAN: slab-out-of-bounds in strlen (lib/string.c:?) Read of size 1 at addr ffff000015492777 by task fortify_strlen_/168 CPU: 0 PID: 168 Comm: fortify_strlen_ Not tainted 6.5.0-rc1-00333-g3329b603ebba #16 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace (arch/arm64/kernel/stacktrace.c:235) show_stack (arch/arm64/kernel/stacktrace.c:242) dump_stack_lvl (lib/dump_stack.c:107) print_report (mm/kasan/report.c:365 mm/kasan/report.c:475) kasan_report (mm/kasan/report.c:590) __asan_report_load1_noabort (mm/kasan/report_generic.c:378) strlen (lib/string.c:?) getname_kernel (./include/linux/fortify-string.h:? fs/namei.c:226) kern_path_create (fs/namei.c:3926) unix_bind (net/unix/af_unix.c:1221 net/unix/af_unix.c:1324) __sys_bind (net/socket.c:1792) __arm64_sys_bind (net/socket.c:1801) invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52) el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147) do_el0_svc (arch/arm64/kernel/syscall.c:189) el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?) el0t_64_sync (arch/arm64/kernel/entry.S:591) Allocated by task 168: kasan_set_track (mm/kasan/common.c:45 mm/kasan/common.c:52) kasan_save_alloc_info (mm/kasan/generic.c:512) __kasan_kmalloc (mm/kasan/common.c:383) __kmalloc (mm/slab_common.c:? mm/slab_common.c:998) unix_bind (net/unix/af_unix.c:257 net/unix/af_unix.c:1213 net/unix/af_unix.c:1324) __sys_bind (net/socket.c:1792) __arm64_sys_bind (net/socket.c:1801) invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52) el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147) do_el0_svc (arch/arm64/kernel/syscall.c:189) el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?) el0t_64_sync (arch/arm64/kernel/entry.S:591) The buggy address belongs to the object at ffff000015492700 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 0 bytes to the right of allocated 119-byte region [ffff000015492700, ffff000015492777) The buggy address belongs to the physical page: page:00000000aeab52ba refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x55492 anon flags: 0x3fffc0000000200(slab|node=0|zone=0|lastcpupid=0xffff) page_type: 0xffffffff() raw: 03fffc0000000200 ffff0000084018c0 fffffc00003d0e00 0000000000000005 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff000015492600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff000015492680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff000015492700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 fc ^ ffff000015492780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff000015492800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 06d4c8a ("af_unix: Fix fortify_panic() in unix_bind_bsd().") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/netdev/202307262110.659e5e8-oliver.sang@intel.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230726190828.47874-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Christoph Biedl reported early OOM on recent kernels: swapper: page allocation failure: order:0, mode:0x100(__GFP_ZERO), nodemask=(null) CPU: 0 PID: 0 Comm: swapper Not tainted 6.3.0-rc4+ #16 Hardware name: 9000/785/C3600 Backtrace: [<10408594>] show_stack+0x48/0x5c [<10e152d8>] dump_stack_lvl+0x48/0x64 [<10e15318>] dump_stack+0x24/0x34 [<105cf7f8>] warn_alloc+0x10c/0x1c8 [<105d068c>] __alloc_pages+0xbbc/0xcf8 [<105d0e4c>] __get_free_pages+0x28/0x78 [<105ad10c>] __pte_alloc_kernel+0x30/0x98 [<10406934>] set_fixmap+0xec/0xf4 [<10411ad4>] patch_map.constprop.0+0xa8/0xdc [<10411bb0>] __patch_text_multiple+0xa8/0x208 [<10411d78>] patch_text+0x30/0x48 [<1041246c>] arch_jump_label_transform+0x90/0xcc [<1056f734>] jump_label_update+0xd4/0x184 [<1056fc9c>] static_key_enable_cpuslocked+0xc0/0x110 [<1056fd08>] static_key_enable+0x1c/0x2c [<1011362c>] init_mem_debugging_and_hardening+0xdc/0xf8 [<1010141c>] start_kernel+0x5f0/0xa98 [<10105da8>] start_parisc+0xb8/0xe4 Mem-Info: active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 slab_reclaimable:0 slab_unreclaimable:0 mapped:0 shmem:0 pagetables:0 sec_pagetables:0 bounce:0 kernel_misc_reclaimable:0 free:0 free_pcp:0 free_cma:0 Node 0 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB +writeback_tmp:0kB kernel_stack:0kB pagetables:0kB sec_pagetables:0kB all_unreclaimable? no Normal free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB +present:1048576kB managed:1039360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB 0 total pagecache pages 0 pages in swap cache Free swap = 0kB Total swap = 0kB 262144 pages RAM 0 pages HighMem/MovableOnly 2304 pages reserved Backtrace: [<10411d78>] patch_text+0x30/0x48 [<1041246c>] arch_jump_label_transform+0x90/0xcc [<1056f734>] jump_label_update+0xd4/0x184 [<1056fc9c>] static_key_enable_cpuslocked+0xc0/0x110 [<1056fd08>] static_key_enable+0x1c/0x2c [<1011362c>] init_mem_debugging_and_hardening+0xdc/0xf8 [<1010141c>] start_kernel+0x5f0/0xa98 [<10105da8>] start_parisc+0xb8/0xe4 Kernel Fault: Code=15 (Data TLB miss fault) at addr 0f7fe3c0 CPU: 0 PID: 0 Comm: swapper Not tainted 6.3.0-rc4+ #16 Hardware name: 9000/785/C3600 This happens because patching static key code temporarily maps it via fixmap and if it happens before page allocator is initialized set_fixmap() cannot allocate memory using pte_alloc_kernel(). Make sure that fixmap page tables are preallocated early so that pte_offset_kernel() in set_fixmap() never resorts to pte allocation. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Helge Deller <deller@gmx.de> Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Tested-by: John David Anglin <dave.anglin@bell.net> Cc: <stable@vger.kernel.org> # v6.4+
The following processes run into a deadlock. CPU 41 was waiting for CPU 29 to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29 was hung by that spinlock with IRQs disabled. PID: 17360 TASK: ffff95c1090c5c40 CPU: 41 COMMAND: "mrdiagd" !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0 !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0 !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0 # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0 # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0 # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0 # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0 # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0 # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0 # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0 #10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0 #11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0 #12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0 #13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0 #14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0 #15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0 #16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0 #17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0 #18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0 #19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0 #20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0 #21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0 #22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0 #23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0 #24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0 #25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0 #26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0 #27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0 PID: 17355 TASK: ffff95c1090c3d80 CPU: 29 COMMAND: "mrdiagd" !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0 !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0 # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0 # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0 # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0 # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0 # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0 # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0 # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0 # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0 #10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0 #11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0 #12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0 #13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0 #14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0 #15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0 #16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0 #17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0 The lock is used to synchronize different sysfs operations, it doesn't protect any resource that will be touched by an interrupt. Consequently it's not required to disable IRQs. Replace the spinlock with a mutex to fix the deadlock. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fix an error detected by memory sanitizer: ``` ==4033==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55fb0fbedfc7 in read_alias_info tools/perf/util/pmu.c:457:6 #1 0x55fb0fbea339 in check_info_data tools/perf/util/pmu.c:1434:2 #2 0x55fb0fbea339 in perf_pmu__check_alias tools/perf/util/pmu.c:1504:9 #3 0x55fb0fbdca85 in parse_events_add_pmu tools/perf/util/parse-events.c:1429:32 #4 0x55fb0f965230 in parse_events_parse tools/perf/util/parse-events.y:299:6 #5 0x55fb0fbdf6b2 in parse_events__scanner tools/perf/util/parse-events.c:1822:8 #6 0x55fb0fbdf8c1 in __parse_events tools/perf/util/parse-events.c:2094:8 #7 0x55fb0fa8ffa9 in parse_events tools/perf/util/parse-events.h:41:9 #8 0x55fb0fa8ffa9 in test_event tools/perf/tests/parse-events.c:2393:8 #9 0x55fb0fa8f458 in test__pmu_events tools/perf/tests/parse-events.c:2551:15 #10 0x55fb0fa6d93f in run_test tools/perf/tests/builtin-test.c:242:9 #11 0x55fb0fa6d93f in test_and_print tools/perf/tests/builtin-test.c:271:8 #12 0x55fb0fa6d082 in __cmd_test tools/perf/tests/builtin-test.c:442:5 #13 0x55fb0fa6d082 in cmd_test tools/perf/tests/builtin-test.c:564:9 #14 0x55fb0f942720 in run_builtin tools/perf/perf.c:322:11 #15 0x55fb0f942486 in handle_internal_command tools/perf/perf.c:375:8 #16 0x55fb0f941dab in run_argv tools/perf/perf.c:419:2 #17 0x55fb0f941dab in main tools/perf/perf.c:535:3 ``` Fixes: 7b723db ("perf pmu: Be lazy about loading event info files from sysfs") Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20230914022425.1489035-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
…alid If smb2 request from client is invalid, The following kernel oops could happen. The patch e2b76ab: "ksmbd: add support for read compound" leads this issue. When request is invalid, It doesn't set anything in the response buffer. This patch add missing set invalid parameter error response. [ 673.085542] ksmbd: cli req too short, len 184 not 142. cmd:5 mid:109 [ 673.085580] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 673.085591] #PF: supervisor read access in kernel mode [ 673.085600] #PF: error_code(0x0000) - not-present page [ 673.085608] PGD 0 P4D 0 [ 673.085620] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 673.085631] CPU: 3 PID: 1039 Comm: kworker/3:0 Not tainted 6.6.0-rc2-tmt #16 [ 673.085643] Hardware name: AZW U59/U59, BIOS JTKT001 05/05/2022 [ 673.085651] Workqueue: ksmbd-io handle_ksmbd_work [ksmbd] [ 673.085719] RIP: 0010:ksmbd_conn_write+0x68/0xc0 [ksmbd] [ 673.085808] RAX: 0000000000000000 RBX: ffff88811ade4f00 RCX: 0000000000000000 [ 673.085817] RDX: 0000000000000000 RSI: ffff88810c2a9780 RDI: ffff88810c2a9ac0 [ 673.085826] RBP: ffffc900005e3e00 R08: 0000000000000000 R09: 0000000000000000 [ 673.085834] R10: ffffffffa3168160 R11: 63203a64626d736b R12: ffff8881057c8800 [ 673.085842] R13: ffff8881057c8820 R14: ffff8882781b2380 R15: ffff8881057c8800 [ 673.085852] FS: 0000000000000000(0000) GS:ffff888278180000(0000) knlGS:0000000000000000 [ 673.085864] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 673.085872] CR2: 0000000000000000 CR3: 000000015b63c000 CR4: 0000000000350ee0 [ 673.085883] Call Trace: [ 673.085890] <TASK> [ 673.085900] ? show_regs+0x6a/0x80 [ 673.085916] ? __die+0x25/0x70 [ 673.085926] ? page_fault_oops+0x154/0x4b0 [ 673.085938] ? tick_nohz_tick_stopped+0x18/0x50 [ 673.085954] ? __irq_work_queue_local+0xba/0x140 [ 673.085967] ? do_user_addr_fault+0x30f/0x6c0 [ 673.085979] ? exc_page_fault+0x79/0x180 [ 673.085992] ? asm_exc_page_fault+0x27/0x30 [ 673.086009] ? ksmbd_conn_write+0x68/0xc0 [ksmbd] [ 673.086067] ? ksmbd_conn_write+0x46/0xc0 [ksmbd] [ 673.086123] handle_ksmbd_work+0x28d/0x4b0 [ksmbd] [ 673.086177] process_one_work+0x178/0x350 [ 673.086193] ? __pfx_worker_thread+0x10/0x10 [ 673.086202] worker_thread+0x2f3/0x420 [ 673.086210] ? _raw_spin_unlock_irqrestore+0x27/0x50 [ 673.086222] ? __pfx_worker_thread+0x10/0x10 [ 673.086230] kthread+0x103/0x140 [ 673.086242] ? __pfx_kthread+0x10/0x10 [ 673.086253] ret_from_fork+0x39/0x60 [ 673.086263] ? __pfx_kthread+0x10/0x10 [ 673.086274] ret_from_fork_asm+0x1b/0x30 Fixes: e2b76ab ("ksmbd: add support for read compound") Reported-by: Tom Talpey <tom@talpey.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
The following call trace shows a deadlock issue due to recursive locking of mutex "device_mutex". First lock acquire is in target_for_each_device() and second in target_free_device(). PID: 148266 TASK: ffff8be21ffb5d00 CPU: 10 COMMAND: "iscsi_ttx" #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224 #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7 #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3 #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod] #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod] #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod] #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod] #13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod] #14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod] #15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod] #16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07 #17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod] #18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod] #19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080 #20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364 Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Andrii Nakryiko says: ==================== BPF register bounds logic and testing improvements This patch set adds a big set of manual and auto-generated test cases validating BPF verifier's register bounds tracking and deduction logic. See details in the last patch. We start with building a tester that validates existing <range> vs <scalar> verifier logic for range bounds. To make all this work, BPF verifier's logic needed a bunch of improvements to handle some cases that previously were not covered. This had no implications as to correctness of verifier logic, but it was incomplete enough to cause significant disagreements with alternative implementation of register bounds logic that tests in this patch set implement. So we need BPF verifier logic improvements to make all the tests pass. This is what we do in patches #3 through #9. The end goal of this work, though, is to extend BPF verifier range state tracking such as to allow to derive new range bounds when comparing non-const registers. There is some more investigative work required to investigate and fix existing potential issues with range tracking as part of ALU/ALU64 operations, so <range> x <range> part of v5 patch set ([0]) is dropped until these issues are sorted out. For now, we include preparatory refactorings and clean ups, that set up BPF verifier code base to extend the logic to <range> vs <range> logic in subsequent patch set. Patches #10-#16 perform preliminary refactorings without functionally changing anything. But they do clean up check_cond_jmp_op() logic and generalize a bunch of other pieces in is_branch_taken() logic. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=797178&state=* v5->v6: - dropped <range> vs <range> patches (original patches #18 through #23) to add more register range sanity checks and fix preexisting issues; - comments improvements, addressing other feedback on first 17 patches (Eduard, Alexei); v4->v5: - added entirety of verifier reg bounds tracking changes, now handling <range> vs <range> cases (Alexei); - added way more comments trying to explain why deductions added are correct, hopefully they are useful and clarify things a bit (Daniel, Shung-Hsi); - added two preliminary selftests fixes necessary for RELEASE=1 build to work again, it keeps breaking. v3->v4: - improvements to reg_bounds tester (progress report, split 32-bit and 64-bit ranges, fix various verbosity output issues, etc); v2->v3: - fix a subtle little-endianness assumption inside parge_reg_state() (CI); v1->v2: - fix compilation when building selftests with llvm-16 toolchain (CI). ==================== Link: https://lore.kernel.org/r/20231102033759.2541186-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Petr Machata says: ==================== mlxsw: Support CFF flood mode The registers to configure to initialize a flood table differ between the controlled and CFF flood modes. In therefore needs to be an op. Add it, hook up the current init to the existing families, and invoke the op. PGT is an in-HW table that maps addresses to sets of ports. Then when some HW process needs a set of ports as an argument, instead of embedding the actual set in the dynamic configuration, what gets configured is the address referencing the set. The HW then works with the appropriate PGT entry. Among other allocations, the PGT currently contains two large blocks for bridge flooding: one for 802.1q and one for 802.1d. Within each of these blocks are three tables, for unknown-unicast, multicast and broadcast flooding: . . . | 802.1q | 802.1d | . . . | UC | MC | BC | UC | MC | BC | \______ _____/ \_____ ______/ v v FID flood vectors Thus each FID (which corresponds to an 802.1d bridge or one VLAN in an 802.1q bridge) uses three flood vectors spread across a fairly large region of PGT. This way of organizing the flood table (called "controlled") is not very flexible. E.g. to decrease a bridge scale and store more IP MC vectors, one would need to completely rewrite the bridge PGT blocks, or resort to hacks such as storing individual MC flood vectors into unused part of the bridge table. In order to address these shortcomings, Spectrum-2 and above support what is called CFF flood mode, for Compressed FID Flooding. In CFF flood mode, each FID has a little table of its own, with three entries adjacent to each other, one for unknown-UC, one for MC, one for BC. This allows for a much more fine-grained approach to PGT management, where bits of it are allocated on demand. . . . | FID | FID | FID | FID | FID | . . . |U|M|B|U|M|B|U|M|B|U|M|B|U|M|B| \_____________ _____________/ v FID flood vectors Besides the FID table organization, the CFF flood mode also impacts Router Subport (RSP) table. This table contains flood vectors for rFIDs, which are FIDs that reference front panel ports or LAGs. The RSP table contains two entries per front panel port and LAG, one for unknown-UC traffic, and one for everything else. Currently, the FW allocates and manages the table in its own part of PGT. rFIDs are marked with flood_rsp bit and managed specially. In CFF mode, rFIDs are managed as all other FIDs. The driver therefore has to allocate and maintain the flood vectors. Like with bridge FIDs, this is more work, but increases flexibility of the system. The FW currently supports both the controlled and CFF flood modes. To shed complexity, in the future it should only support CFF flood mode. Hence this patchset, which adds CFF flood mode support to mlxsw. Since mlxsw needs to maintain both the controlled mode as well as CFF mode support, we will keep the layout as compatible as possible. The bridge tables will stay in the same overall shape, just their inner organization will change from flood mode -> FID to FID -> flood mode. Likewise will RSP be kept as a contiguous block of PGT memory, as was the case when the FW maintained it. - The way FIDs get configured under the CFF flood mode differs from the currently used controlled mode. The simple approach of having several globally visible arrays for spectrum.c to statically choose from no longer works. Patch #1 thus privatizes all FID initialization and finalization logic, and exposes it as ops instead. - Patch #2 renames the ops that are specific to the controlled mode, to make room in the namespace for the CFF variants. Patch #3 extracts a helper to compute flood table base out of mlxsw_sp_fid_flood_table_mid(). - The op fid_setup configured fid_offset, i.e. the number of this FID within its family. For rFIDs in CFF mode, to determine this number, the driver will need to do fallible queries. Thus in patch #4, make the FID setup operation fallible as well. - Flood mode initialization routine differs between the controlled and CFF flood modes. The controlled mode needs to configure flood table layout, which the CFF mode does not need to do. In patch #5, move mlxsw_sp_fid_flood_table_init() up so that the following patch can make use of it. In patch #6, add an op to be invoked per table (if defined). - The current way of determining PGT allocation size depends on the number of FIDs and number of flood tables. RFIDs however have PGT footprint depending not on number of FIDs, but on number of ports and LAGs, because which ports an rFID should flood to does not depend on the FID itself, but on the port or LAG that it references. Therefore in patch #7, add FID family ops for determining PGT allocation size. - As elaborated above, layout of PGT will differ between controlled and CFF flood modes. In CFF mode, it will further differ between rFIDs and other FIDs (as described at previous patch). The way to pack the SFMR register to configure a FID will likewise differ from controlled to CFF. Thus in patches #8 and #9 add FID family ops to determine PGT base address for a FID and to pack SFMR. - Patches #10 and #11 add more bits for RSP support. In patch #10, add a new traffic type enumerator, for non-UC traffic. This is a combination of BC and MC traffic, but the way that mlxsw maps these mnemonic names to actual traffic type configurations requires that we have a new name to describe this class of traffic. Patch #11 then adds hooks necessary for RSP table maintenance. As ports come and go, and join and leave LAGs, it is necessary to update flood vectors that the rFIDs use. These new hooks will make that possible. - Patches #12, #13 and #14 introduce flood profiles. These have been implicit so far, but the way that CFF flood mode works with profile IDs requires that we make them explicit. Thus in patch #12, introduce flood profile objects as a set of flood tables that FID families then refer to. The FID code currently only uses a single flood profile. In patch #13, add a flood profile ID to flood profile objects. In patch #14, when in CFF mode, configure SFFP according to the existing flood profiles (or the one that exists as of that point). - Patches #15 and #16 add code to implement, respectively, bridge FIDs and RSP FIDs in CFF mode. - In patch #17, toggle flood_mode_prefer_cff on Spectrum-2 and above, which makes the newly-added code live. ==================== Link: https://lore.kernel.org/r/cover.1701183891.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ Upstream commit 19ecbe8 ] If komeda_pipeline_unbound_components() returns -EDEADLK, it means that a deadlock happened in the locking context. Currently, komeda is not dealing with the deadlock properly,producing the following output when CONFIG_DEBUG_WW_MUTEX_SLOWPATH is enabled: ------------[ cut here ]------------ [ 26.103984] WARNING: CPU: 2 PID: 345 at drivers/gpu/drm/arm/display/komeda/komeda_pipeline_state.c:1248 komeda_release_unclaimed_resources+0x13c/0x170 [ 26.117453] Modules linked in: [ 26.120511] CPU: 2 PID: 345 Comm: composer@2.1-se Kdump: loaded Tainted: G W 5.10.110-SE-SDK1.8-dirty #16 [ 26.131374] Hardware name: Siengine Se1000 Evaluation board (DT) [ 26.137379] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--) [ 26.143385] pc : komeda_release_unclaimed_resources+0x13c/0x170 [ 26.149301] lr : komeda_release_unclaimed_resources+0xbc/0x170 [ 26.155130] sp : ffff800017b8b8d0 [ 26.158442] pmr_save: 000000e0 [ 26.161493] x29: ffff800017b8b8d0 x28: ffff000cf2f96200 [ 26.166805] x27: ffff000c8f5a8800 x26: 0000000000000000 [ 26.172116] x25: 0000000000000038 x24: ffff8000116a0140 [ 26.177428] x23: 0000000000000038 x22: ffff000cf2f96200 [ 26.182739] x21: ffff000cfc300300 x20: ffff000c8ab77080 [ 26.188051] x19: 0000000000000003 x18: 0000000000000000 [ 26.193362] x17: 0000000000000000 x16: 0000000000000000 [ 26.198672] x15: b400e638f738ba38 x14: 0000000000000000 [ 26.203983] x13: 0000000106400a00 x12: 0000000000000000 [ 26.209294] x11: 0000000000000000 x10: 0000000000000000 [ 26.214604] x9 : ffff800012f80000 x8 : ffff000ca3308000 [ 26.219915] x7 : 0000000ff3000000 x6 : ffff80001084034c [ 26.225226] x5 : ffff800017b8bc40 x4 : 000000000000000f [ 26.230536] x3 : ffff000ca3308000 x2 : 0000000000000000 [ 26.235847] x1 : 0000000000000000 x0 : ffffffffffffffdd [ 26.241158] Call trace: [ 26.243604] komeda_release_unclaimed_resources+0x13c/0x170 [ 26.249175] komeda_crtc_atomic_check+0x68/0xf0 [ 26.253706] drm_atomic_helper_check_planes+0x138/0x1f4 [ 26.258929] komeda_kms_check+0x284/0x36c [ 26.262939] drm_atomic_check_only+0x40c/0x714 [ 26.267381] drm_atomic_nonblocking_commit+0x1c/0x60 [ 26.272344] drm_mode_atomic_ioctl+0xa3c/0xb8c [ 26.276787] drm_ioctl_kernel+0xc4/0x120 [ 26.280708] drm_ioctl+0x268/0x534 [ 26.284109] __arm64_sys_ioctl+0xa8/0xf0 [ 26.288030] el0_svc_common.constprop.0+0x80/0x240 [ 26.292817] do_el0_svc+0x24/0x90 [ 26.296132] el0_svc+0x20/0x30 [ 26.299185] el0_sync_handler+0xe8/0xf0 [ 26.303018] el0_sync+0x1a4/0x1c0 [ 26.306330] irq event stamp: 0 [ 26.309384] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 26.315650] hardirqs last disabled at (0): [<ffff800010056d34>] copy_process+0x5d0/0x183c [ 26.323825] softirqs last enabled at (0): [<ffff800010056d34>] copy_process+0x5d0/0x183c [ 26.331997] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 26.338261] ---[ end trace 20ae984fa860184a ]--- [ 26.343021] ------------[ cut here ]------------ [ 26.347646] WARNING: CPU: 3 PID: 345 at drivers/gpu/drm/drm_modeset_lock.c:228 drm_modeset_drop_locks+0x84/0x90 [ 26.357727] Modules linked in: [ 26.360783] CPU: 3 PID: 345 Comm: composer@2.1-se Kdump: loaded Tainted: G W 5.10.110-SE-SDK1.8-dirty #16 [ 26.371645] Hardware name: Siengine Se1000 Evaluation board (DT) [ 26.377647] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--) [ 26.383649] pc : drm_modeset_drop_locks+0x84/0x90 [ 26.388351] lr : drm_mode_atomic_ioctl+0x860/0xb8c [ 26.393137] sp : ffff800017b8bb10 [ 26.396447] pmr_save: 000000e0 [ 26.399497] x29: ffff800017b8bb10 x28: 0000000000000001 [ 26.404807] x27: 0000000000000038 x26: 0000000000000002 [ 26.410115] x25: ffff000cecbefa00 x24: ffff000cf2f96200 [ 26.415423] x23: 0000000000000001 x22: 0000000000000018 [ 26.420731] x21: 0000000000000001 x20: ffff800017b8bc10 [ 26.426039] x19: 0000000000000000 x18: 0000000000000000 [ 26.431347] x17: 0000000002e8bf2c x16: 0000000002e94c6b [ 26.436655] x15: 0000000002ea48b9 x14: ffff8000121f0300 [ 26.441963] x13: 0000000002ee2ca8 x12: ffff80001129cae0 [ 26.447272] x11: ffff800012435000 x10: ffff000ed46b5e88 [ 26.452580] x9 : ffff000c9935e600 x8 : 0000000000000000 [ 26.457888] x7 : 000000008020001e x6 : 000000008020001f [ 26.463196] x5 : ffff80001085fbe0 x4 : fffffe0033a59f20 [ 26.468504] x3 : 000000008020001e x2 : 0000000000000000 [ 26.473813] x1 : 0000000000000000 x0 : ffff000c8f596090 [ 26.479122] Call trace: [ 26.481566] drm_modeset_drop_locks+0x84/0x90 [ 26.485918] drm_mode_atomic_ioctl+0x860/0xb8c [ 26.490359] drm_ioctl_kernel+0xc4/0x120 [ 26.494278] drm_ioctl+0x268/0x534 [ 26.497677] __arm64_sys_ioctl+0xa8/0xf0 [ 26.501598] el0_svc_common.constprop.0+0x80/0x240 [ 26.506384] do_el0_svc+0x24/0x90 [ 26.509697] el0_svc+0x20/0x30 [ 26.512748] el0_sync_handler+0xe8/0xf0 [ 26.516580] el0_sync+0x1a4/0x1c0 [ 26.519891] irq event stamp: 0 [ 26.522943] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 26.529207] hardirqs last disabled at (0): [<ffff800010056d34>] copy_process+0x5d0/0x183c [ 26.537379] softirqs last enabled at (0): [<ffff800010056d34>] copy_process+0x5d0/0x183c [ 26.545550] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 26.551812] ---[ end trace 20ae984fa860184b ]--- According to the call trace information,it can be located to be WARN_ON(IS_ERR(c_st)) in the komeda_pipeline_unbound_components function; Then follow the function. komeda_pipeline_unbound_components -> komeda_component_get_state_and_set_user -> komeda_pipeline_get_state_and_set_crtc -> komeda_pipeline_get_state ->drm_atomic_get_private_obj_state -> drm_atomic_get_private_obj_state -> drm_modeset_lock komeda_pipeline_unbound_components -> komeda_component_get_state_and_set_user -> komeda_component_get_state -> drm_atomic_get_private_obj_state -> drm_modeset_lock ret = drm_modeset_lock(&obj->lock, state->acquire_ctx); if (ret) return ERR_PTR(ret); Here it return -EDEADLK. deal with the deadlock as suggested by [1], using the function drm_modeset_backoff(). [1] https://docs.kernel.org/gpu/drm-kms.html?highlight=kms#kms-locking Therefore, handling this problem can be solved by adding return -EDEADLK back to the drm_modeset_backoff processing flow in the drm_mode_atomic_ioctl function. Signed-off-by: baozhu.liu <lucas.liu@siengine.com> Signed-off-by: menghui.huang <menghui.huang@siengine.com> Reviewed-by: Liviu Dudau <liviu.dudau@arm.com> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230804013117.6870-1-menghui.huang@siengine.com Signed-off-by: Sasha Levin <sashal@kernel.org>
When creating ceq_0 during probing irdma, cqp.sc_cqp will be sent as a cqp_request to cqp->sc_cqp.sq_ring. If the request is pending when removing the irdma driver or unplugging its aux device, cqp.sc_cqp will be dereferenced as wrong struct in irdma_free_pending_cqp_request(). PID: 3669 TASK: ffff88aef892c000 CPU: 28 COMMAND: "kworker/28:0" #0 [fffffe0000549e38] crash_nmi_callback at ffffffff810e3a34 #1 [fffffe0000549e40] nmi_handle at ffffffff810788b2 #2 [fffffe0000549ea0] default_do_nmi at ffffffff8107938f #3 [fffffe0000549eb8] do_nmi at ffffffff81079582 #4 [fffffe0000549ef0] end_repeat_nmi at ffffffff82e016b4 [exception RIP: native_queued_spin_lock_slowpath+1291] RIP: ffffffff8127e72b RSP: ffff88aa841ef778 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff88b01f849700 RCX: ffffffff8127e47e RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff83857ec0 RBP: ffff88afe3e4efc8 R8: ffffed15fc7c9dfa R9: ffffed15fc7c9dfa R10: 0000000000000001 R11: ffffed15fc7c9df9 R12: 0000000000740000 R13: ffff88b01f849708 R14: 0000000000000003 R15: ffffed1603f092e1 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 -- <NMI exception stack> -- #5 [ffff88aa841ef778] native_queued_spin_lock_slowpath at ffffffff8127e72b #6 [ffff88aa841ef7b0] _raw_spin_lock_irqsave at ffffffff82c22aa4 #7 [ffff88aa841ef7c8] __wake_up_common_lock at ffffffff81257363 #8 [ffff88aa841ef888] irdma_free_pending_cqp_request at ffffffffa0ba12cc [irdma] #9 [ffff88aa841ef958] irdma_cleanup_pending_cqp_op at ffffffffa0ba1469 [irdma] #10 [ffff88aa841ef9c0] irdma_ctrl_deinit_hw at ffffffffa0b2989f [irdma] #11 [ffff88aa841efa28] irdma_remove at ffffffffa0b252df [irdma] #12 [ffff88aa841efae8] auxiliary_bus_remove at ffffffff8219afdb #13 [ffff88aa841efb00] device_release_driver_internal at ffffffff821882e6 #14 [ffff88aa841efb38] bus_remove_device at ffffffff82184278 #15 [ffff88aa841efb88] device_del at ffffffff82179d23 #16 [ffff88aa841efc48] ice_unplug_aux_dev at ffffffffa0eb1c14 [ice] #17 [ffff88aa841efc68] ice_service_task at ffffffffa0d88201 [ice] #18 [ffff88aa841efde8] process_one_work at ffffffff811c589a #19 [ffff88aa841efe60] worker_thread at ffffffff811c71ff #20 [ffff88aa841eff10] kthread at ffffffff811d87a0 #21 [ffff88aa841eff50] ret_from_fork at ffffffff82e0022f Fixes: 44d9e52 ("RDMA/irdma: Implement device initialization definitions") Link: https://lore.kernel.org/r/20231130081415.891006-1-lishifeng@sangfor.com.cn Suggested-by: "Ismail, Mustafa" <mustafa.ismail@intel.com> Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When processing a packed profile in unpack_profile() described like "profile :ns::samba-dcerpcd /usr/lib*/samba/{,samba/}samba-dcerpcd {...}" a string ":samba-dcerpcd" is unpacked as a fully-qualified name and then passed to aa_splitn_fqname(). aa_splitn_fqname() treats ":samba-dcerpcd" as only containing a namespace. Thus it returns NULL for tmpname, meanwhile tmpns is non-NULL. Later aa_alloc_profile() crashes as the new profile name is NULL now. general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 6 PID: 1657 Comm: apparmor_parser Not tainted 6.7.0-rc2-dirty #16 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:strlen+0x1e/0xa0 Call Trace: <TASK> ? strlen+0x1e/0xa0 aa_policy_init+0x1bb/0x230 aa_alloc_profile+0xb1/0x480 unpack_profile+0x3bc/0x4960 aa_unpack+0x309/0x15e0 aa_replace_profiles+0x213/0x33c0 policy_update+0x261/0x370 profile_replace+0x20e/0x2a0 vfs_write+0x2af/0xe00 ksys_write+0x126/0x250 do_syscall_64+0x46/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ---[ end trace 0000000000000000 ]--- RIP: 0010:strlen+0x1e/0xa0 It seems such behaviour of aa_splitn_fqname() is expected and checked in other places where it is called (e.g. aa_remove_profiles). Well, there is an explicit comment "a ns name without a following profile is allowed" inside. AFAICS, nothing can prevent unpacked "name" to be in form like ":samba-dcerpcd" - it is passed from userspace. Deny the whole profile set replacement in such case and inform user with EPROTO and an explaining message. Found by Linux Verification Center (linuxtesting.org). Fixes: 04dc715 ("apparmor: audit policy ns specified in policy load") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: John Johansen <john.johansen@canonical.com>
vhost_worker will call tun call backs to receive packets. If too many illegal packets arrives, tun_do_read will keep dumping packet contents. When console is enabled, it will costs much more cpu time to dump packet and soft lockup will be detected. net_ratelimit mechanism can be used to limit the dumping rate. PID: 33036 TASK: ffff949da6f20000 CPU: 23 COMMAND: "vhost-32980" #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e #3 [fffffe00003fced0] do_nmi at ffffffff8922660d #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663 [exception RIP: io_serial_in+20] RIP: ffffffff89792594 RSP: ffffa655314979e8 RFLAGS: 00000002 RAX: ffffffff89792500 RBX: ffffffff8af428a0 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff8af428a0 RBP: 0000000000002710 R8: 0000000000000004 R9: 000000000000000f R10: 0000000000000000 R11: ffffffff8acbf64f R12: 0000000000000020 R13: ffffffff8acbf698 R14: 0000000000000058 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07 #12 [ffffa65531497b68] printk at ffffffff89318306 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun] #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun] #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net] #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost] #18 [ffffa65531497f10] kthread at ffffffff892d2e72 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors") Signed-off-by: Lei Chen <lei.chen@smartx.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20240415020247.2207781-1-lei.chen@smartx.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = *(size *)(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile long *)sk; return 0; } BPF Program before | BPF Program after ------------------ | ----------------- 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: 8008342 ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We currently have the following code in tcp_check_req():
Can we instead, for mp_join child, in subflow_syn_recv_sock():
?
If not, document properly why (here and/or directly in the code)
Note: no need to call subflow_syn_recv_sock() / sock_rps_save_rxhash() in subflow_syn_recv_sock(), tcp_check_req() will do that, since we will still return child != NULL
The text was updated successfully, but these errors were encountered: