Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pty04 sporadically fails #674

Closed
jstancek opened this issue May 5, 2020 · 16 comments
Closed

pty04 sporadically fails #674

jstancek opened this issue May 5, 2020 · 16 comments

Comments

@jstancek
Copy link
Contributor

jstancek commented May 5, 2020

pty04 is running sporadically into TBROKs:

5.6.10-3dce3c1 x86_64 (J:4234886)

   1	<<<test_start>>>
     2	tag=pty04 stime=1588598661
     3	cmdline="pty04"
     4	contacts=""
     5	analysis=exit
     6	<<<test_output>>>
     7	tst_test.c:1244: INFO: Timeout per run is 0h 05m 00s
     8	pty04.c:95: INFO: PTS path is /dev/pts/0
     9	pty04.c:149: INFO: Netdev is sl0
    10	pty04.c:158: INFO: Netdev MTU is 8192 (we set 8192)
    11	pty04.c:175: INFO: Bound netdev 2378 to socket 9
    12	pty04.c:208: INFO: Reading from socket 9
    13	pty04.c:123: PASS: Wrote PTY 1
    14	pty04.c:133: PASS: Wrote PTY 2
    15	pty04.c:213: PASS: Read netdev 1
    16	tst_checkpoint.c:147: BROK: pty04.c:248: tst_checkpoint_wait(0, 10000): ETIMEDOUT (110)
    17	tst_test.c:373: BROK: Reported by child (956159)
    18	safe_macros.c:258: BROK: pty04.c:214: read(9,0x20d8410,8191) failed, returned -1: ENETDOWN (100)
    19	pty04.c:138: PASS: Writing to PTY interrupted by hangup
    20	tst_test.c:373: WARN: Reported by child (956157)

5.6.10-55754d7 aarch64 (J:4234757)

 1	<<<test_start>>>
     2	tag=pty04 stime=1588582886
     3	cmdline="pty04"
     4	contacts=""
     5	analysis=exit
     6	<<<test_output>>>
     7	tst_test.c:1244: INFO: Timeout per run is 0h 05m 00s
     8	pty04.c:95: INFO: PTS path is /dev/pts/0
     9	pty04.c:149: INFO: Netdev is sl0
    10	pty04.c:158: INFO: Netdev MTU is 8192 (we set 8192)
    11	pty04.c:175: INFO: Bound netdev 894 to socket 9
    12	pty04.c:123: PASS: Wrote PTY 1
    13	pty04.c:208: INFO: Reading from socket 9
    14	safe_macros.c:258: BROK: pty04.c:210: read(9,0x3fa97410,8191) failed, returned 90: SUCCESS (0)
    15	pty04.c:133: PASS: Wrote PTY 2
    16	tst_test.c:373: BROK: Reported by child (790085)
    17	pty04.c:138: PASS: Writing to PTY interrupted by hangup
    18	tst_checkpoint.c:147: BROK: pty04.c:248: tst_checkpoint_wait(0, 10000): ETIMEDOUT (110)
    19	tst_test.c:373: WARN: Reported by child (790087)
@richiejp
Copy link
Contributor

richiejp commented May 5, 2020

This looks like an issue I thought I had fixed by limiting the MTU to the internal chunk size. Probably the read is sleeping waiting for "unthrottle" which is not sent by the PTY with SLIP. I was debating whether this is a kernel bug or not and decided I didn't want to risk breaking anything in the kernel by implementing it (it is probably sent by TTY drivers for physical ports, just not PTY).

Probably I should use asynchronous reads instead to avoid sleeping in the kernel.

@pevik
Copy link
Member

pevik commented May 5, 2020

IMHO before merging 3c8f84e it was on every second run (detected with -i 2). But now it fails on first run. I reported it on ML (https://patchwork.ozlabs.org/project/ltp/patch/20200505101625.25020-1-rpalethorpe@suse.com/).

@richiejp
Copy link
Contributor

richiejp commented May 7, 2020

If I run 10 x pty04 -i 10 in parallel (with the slcan patch) I start to get some errors, like:

pty04.c:248: FAIL: can_id = 1600085855 != 1
pty04.c:254: FAIL: can_dlc = 95 != CAN_MAX_DLEN

and

safe_macros.c:260: BROK: pty04.c:306: read(11,0x7efe55704001,8191) failed, returned 16: SUCCESS (0)

and timeouts similar to what Jan reported. This looks like data corruption happening somewhere in the kernel, but I don't see how.

@richiejp
Copy link
Contributor

It seems I have fixed the timeouts by retrying the reads. Even though they are blocking and the read size is the same as the packet size, it seems read can return before the full packet is read which is probably what I should have assumed from the start.

However it also seems I am getting SLIP packets on SLCAN bound sockets or vice versa.

[238/883] pty04.c:282: FAIL: Corrupt data (max 64 of 8191 bytes shown): data[0..8] = 01 00 00 00 08 00 00 00

It doesn't appear netdevice indexes or PTS numbers are reused. Perhaps some line discipline structure is reused? Or it has something to do with binding to raw sockets. At any rate I will send another patch to the mailing list.

@richiejp
Copy link
Contributor

I sent another patch which increases the timeout amongst other tings. This seems to fix it in the simple case.

It seems the kernel stalls; possibly trying to allocate memory. Sometimes it even returns ENOBUFS and prints a stack trace for the memory failure followed by OOM killer:

[  179.069553] rcu: INFO: rcu_sched self-detected stall on CPU
[  179.070313] rcu: 	3-....: (1 GPs behind) idle=abe/1/0x4000000000000002 softirq=6607/6608 fqs=6499
[  179.071383] 	(t=26001 jiffies g=18189 q=0)
[  179.071890] NMI backtrace for cpu 3
[  179.072359] CPU: 3 PID: 240 Comm: kworker/u8:5 Not tainted 5.7.0-rc4+ #49
[  179.073177] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt4
[  179.074531] Workqueue: events_unbound flush_to_ldisc
[  179.075151] Call Trace:
[  179.075466]  <IRQ>
[  179.075735]  dump_stack+0x97/0xe0
[  179.076151]  nmi_cpu_backtrace.cold+0x46/0x80
[  179.076689]  ? lapic_can_unplug_cpu.cold+0x46/0x46
[  179.077273]  nmi_trigger_cpumask_backtrace+0x125/0x154
[  179.077914]  rcu_dump_cpu_stacks+0x1a0/0x1ea
[  179.078467]  rcu_sched_clock_irq.cold+0x518/0x978
[  179.079083]  update_process_times+0x24/0x50
[  179.079618]  tick_sched_handle+0xf7/0x160
[  179.080122]  tick_sched_timer+0x38/0xf0
[  179.080598]  __hrtimer_run_queues+0x4ae/0xa90
[  179.081133]  ? tick_sched_do_timer+0x150/0x150
[  179.081694]  ? enqueue_hrtimer+0x2d0/0x2d0
[  179.082234]  hrtimer_interrupt+0x2bf/0x740
[  179.082765]  smp_apic_timer_interrupt+0x109/0x4e0
[  179.083327]  apic_timer_interrupt+0xf/0x20
[  179.083820]  </IRQ>
[  179.084085] RIP: 0010:trace_graph_entry+0x37b/0x720
[  179.084660] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 44 03 00 00 41 f6 46 18 01 0f 85 d9 00 00 00 4c 8d f
[  179.086781] RSP: 0018:ffffc90000427470 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  179.087660] RAX: 1ffffffff552ba7e RBX: ffffffffa70f05c0 RCX: 0000000000000000
[  179.088488] RDX: dffffc0000000000 RSI: 0000000000000002 RDI: ffffffffa98a7c58
[  179.089311] RBP: ffffc900004274e0 R08: ffffc900004274e8 R09: ffffffffab24f0cf
[  179.090135] R10: ffffffffab24f0d7 R11: ffffffffab24f0db R12: 0000000000000000
[  179.090960] R13: ffffffffaa382720 R14: 0000000000000246 R15: ffffffffaa382748
[  179.091910]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.092835]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.093746]  function_graph_enter+0x34b/0x520
[  179.094342]  ? ftrace_graph_stop+0x10/0x10
[  179.094885]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.095652]  ? ftrace_graph_caller+0x6b/0xa0
[  179.096236]  ? __trace_graph_entry+0x4e/0x1a0
[  179.096829]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.097599]  ? unwind_next_frame+0xea8/0x18a0
[  179.098178]  prepare_ftrace_return+0xa1/0xf0
[  179.098697]  ftrace_graph_caller+0x6b/0xa0
[  179.099186]  ? core_kernel_data+0x20/0x20
[  179.099671]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.100365]  ? deref_stack_reg+0x5/0xe0
[  179.100853]  deref_stack_reg+0x5/0xe0
[  179.101296]  ? ftrace_graph_caller+0xa0/0xa0
[  179.101803]  unwind_next_frame+0xea8/0x18a0
[  179.102302]  ? kmem_cache_alloc+0xce/0x250
[  179.102803]  ? deref_stack_reg+0xe0/0xe0
[  179.103279]  ? kmem_cache_alloc+0xce/0x250
[  179.103778]  ? profile_setup.cold+0xa1/0xa1
[  179.104279]  arch_stack_walk+0x7f/0xe0
[  179.104744]  ? __build_skb+0x1f/0x50
[  179.105183]  stack_trace_save+0x91/0xc0
[  179.105648]  ? stack_trace_consume_entry+0x160/0x160
[  179.106246]  ? stack_trace_save+0x5/0xc0
[  179.106723]  save_stack+0x1b/0x40
[  179.107124]  ? save_stack+0x1b/0x40
[  179.107546]  ? __kasan_kmalloc.constprop.0+0xc2/0xd0
[  179.108135]  ? kmem_cache_alloc+0xce/0x250
[  179.108621]  ? trace_buffer_lock_reserve+0x17/0x50
[  179.109193]  ? __trace_graph_entry+0x4e/0x1a0
[  179.109726]  ? kmem_cache_size+0x10/0x10
[  179.110201]  ? function_graph_enter+0x34b/0x520
[  179.110740]  ? ftrace_graph_stop+0x10/0x10
[  179.111227]  ? fs_reclaim_acquire.part.0+0x30/0x30
[  179.111797]  ? kmem_cache_size+0x10/0x10
[  179.112271]  ? ftrace_graph_stop+0x10/0x10
[  179.112763]  ? kmem_cache_size+0x10/0x10
[  179.113232]  ? kmem_cache_alloc+0x4c/0x250
[  179.113721]  ? prepare_ftrace_return+0xa1/0xf0
[  179.114250]  ? __build_skb+0x1f/0x50
[  179.114683]  ? ftrace_graph_caller+0x6b/0xa0
[  179.115221]  ? ftrace_graph_caller+0x6b/0xa0
[  179.115738]  ? kasan_unpoison_shadow+0x33/0x40
[  179.116267]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[  179.116838]  ? __build_skb+0x1f/0x50
[  179.117268]  kmem_cache_alloc+0xce/0x250
[  179.117743]  __build_skb+0x1f/0x50
[  179.118155]  __netdev_alloc_skb+0x158/0x2d0
[  179.118661]  slcan_receive_buf+0x6fd/0xbb0 [slcan]
[  179.119227]  ? slcan_transmit+0x3e0/0x3e0 [slcan]
[  179.119807]  ? slcan_transmit+0x3e0/0x3e0 [slcan]
[  179.120379]  ? slcan_transmit+0x3e0/0x3e0 [slcan]
[  179.120952]  tty_ldisc_receive_buf+0x100/0x1d0
[  179.121483]  ? tty_ldisc_receive_buf+0x5/0x1d0
[  179.122019]  tty_port_default_receive_buf+0x61/0x90
[  179.122603]  flush_to_ldisc+0x1a7/0x3e0
[  179.123082]  process_one_work+0x926/0x16d0
[  179.123591]  ? pwq_dec_nr_in_flight+0x290/0x290
[  179.124136]  ? ftrace_graph_caller+0x6b/0xa0
[  179.124654]  ? rwlock_bug.part.0+0x90/0x90
[  179.125158]  worker_thread+0x92/0xfa0
[  179.125622]  ? process_one_work+0x16d0/0x16d0
[  179.126145]  kthread+0x2ba/0x3b0
[  179.126542]  ? kthread_create_on_node+0xd0/0xd0
[  179.127086]  ret_from_fork+0x3a/0x50

@pevik pevik closed this as completed in 035336b May 14, 2020
@pevik
Copy link
Member

pevik commented May 14, 2020

@richiejp Well, previous issue has not been fixed, so feel free to reopen. But IMHO that needs kernel fix and at least original report for TBROK got fixed (thanks!).

@richiejp
Copy link
Contributor

I sent in yet another patch to fix a bug in retry_read.

Also it appears the test fails on PowerVM with this:

tst_test.c:1246: INFO: Timeout per run is 0h 05m 00s
pty04.c:130: INFO: PTS path is /dev/pts/2
pty04.c:207: INFO: Netdev is sl0
pty04.c:216: INFO: Netdev MTU is 8192 (we set 8192)
pty04.c:233: INFO: Bound netdev 325 to socket 5
tst_buffers.c:55: INFO: Test is using guarded buffers
tst_buffers.c:55: INFO: Test is using guarded buffers
pty04.c:324: INFO: Reading from socket 5
pty04.c:180: PASS: Wrote PTY N_SLIP 3 (1)
pty04.c:191: PASS: Wrote PTY N_SLIP 3 (2)
pty04.c:284: FAIL: Corrupt data (max 1024 of 8191 bytes shown): data[0..316] =
pty04.c:284: FAIL: 00 00 5e 00 01 03 46 11 eb 8d 86 ea 08 00 45 10
pty04.c:284: FAIL: 00 88 61 61 40 00 40 06 a0 e7 0a a1 19 3a 0a a3
pty04.c:284: FAIL: 09 9a 00 16 af 6a e4 9b e6 b6 6f ec 19 09 80 18
pty04.c:284: FAIL: 01 f5 00 00 00 00 01 01 08 0a 2c e1 61 03 66 bd
pty04.c:284: FAIL: 80 a8 67 b0 03 73 6a 04 1e a6 6a 04 1d 08 e0 9f
pty04.c:284: FAIL: 39 ae 30 18 00 40 01 55 6b 8e 7e a0 df 0d ae cd
pty04.c:284: FAIL: 56 7c 44 34 4a bd f6 4a be 2d cb 23 47 a6 b2 a9
pty04.c:284: FAIL: 2f 57 f8 74 98 65 a7 a2 cc ff 63 c1 24 ca 9c 1e
pty04.c:284: FAIL: b6 3b 55 0e da 84 c7 d6 15 7d c1 c6 22 a7 d7 04
pty04.c:284: FAIL: 38 3e ea 03 4b b0 00 00 5e 00 01 03 46 11 eb 8d
pty04.c:284: FAIL: 86 ea 08 00 45 10 00 98 61 62 40 00 40 06 a0 d6
pty04.c:284: FAIL: 0a a1 19 3a 0a a3 09 9a 00 16 af 6a e4 9b e7 0a
pty04.c:284: FAIL: 6f ec 19 09 80 18 01 f5 00 00 00 00 01 01 08 0a
pty04.c:284: FAIL: 2c e1 61 03 66 bd 80 a8 7a b0 03 4b af f6 06 24
pty04.c:284: FAIL: 94 99 fa 69 80 69 5e a3 a0 ab c3 3b 29 af 75 5c
pty04.c:284: FAIL: 49 8b 0a 87 4e 22 18 7d 8e 77 77 f2 e4 c8 ce a7
pty04.c:284: FAIL: ad d1 7f 3f d0 91 36 c9 97 5c 2d 4a b9 c9 3a 04
pty04.c:284: FAIL: 9f 27 1e 4b 85 01 af 29 a0 d1 b6 88 c4 36 41 5b
pty04.c:284: FAIL: a9 ce 22 90 67 a6 9c 71 c7 0c c2 5e 41 3a d8 49
pty04.c:284: FAIL: c2 22 67 fe 3c 5c a4 6f 8e fd e1 6d
pty04.c:288: INFO: Will continue test without data checking
pty04.c:328: PASS: Read netdev N_SLIP 5 (1)
pty04.c:332: PASS: Read netdev N_SLIP 5 (2)
pty04.c:338: PASS: Reading data from netdev interrupted by hangup
pty04.c:366: INFO: Sent hangup ioctl to PTS
pty04.c:196: PASS: Writing to PTY interrupted by hangup

Possibly that is a packet from another interface, or something else.

@richiejp
Copy link
Contributor

Yup it was a packet from another interface. By default raw packet sockets collect info from all interfaces. I have sent another patch to the ML.

@richiejp
Copy link
Contributor

Still sporadically fails

@jstancek
Copy link
Contributor Author

jstancek commented Jun 3, 2020

Just hit it too with 5.7.0-b23c477 on x86_64

tst_test.c:1244: INFO: Timeout per run is 0h 05m 00s
pty04.c:130: INFO: PTS path is /dev/pts/0
pty04.c:207: INFO: Netdev is sl0
pty04.c:216: INFO: Netdev MTU is 8192 (we set 8192)
pty04.c:233: INFO: Bound netdev 3354 to socket 9
tst_buffers.c:55: INFO: Test is using guarded buffers
pty04.c:324: INFO: Reading from socket 9
tst_buffers.c:55: INFO: Test is using guarded buffers
pty04.c:180: PASS: Wrote PTY N_SLIP 6 (1)
pty04.c:191: PASS: Wrote PTY N_SLIP 6 (2)
pty04.c:328: PASS: Read netdev N_SLIP 9 (1)
tst_checkpoint.c:147: BROK: pty04.c:364: tst_checkpoint_wait(0, 100000): ETIMEDOUT (110)
tst_test.c:373: BROK: Reported by child (1178909)
pty04.c:306: BROK: Read 0 of 8191 bytes: ENETDOWN (100)
pty04.c:196: PASS: Writing to PTY interrupted by hangup
tst_test.c:373: WARN: Reported by child (1178907)

@jstancek jstancek reopened this Jun 3, 2020
@ywangwrs
Copy link

Any update on this issue?

@richiejp
Copy link
Contributor

Nope, sorry. It is on my backlog, but low priority. Are you using SLIP or SLCAN or is the issue just that the tests are randomly failing?

@ywangwrs
Copy link

It is a random failure in our daily LTP test.

@richiejp
Copy link
Contributor

On 5.10 RC I keep hitting a softlockup inside slc_bump (slcan) if I run the test repeatedly. The stack trace points to different locations within slc_bump and there is no clear infinite loop or anything like that. So this is probably a kernel issue. Will investigate further.

@richiejp
Copy link
Contributor

It appears that sometimes the test sends a lot of data and the kernel requires a long time to process it. The test should work in a reasonably reliable way so I have sent a patch to the mailing list to fix that:
http://lists.linux.it/pipermail/ltp/2020-October/019501.html

I don't think there is an issue with the kernel other than being slow which is probably not a real world issue with SLCAN.

@metan-ucw
Copy link
Member

This should be fixed in e40bcd5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants