pty04 sporadically fails #674

jstancek · 2020-05-05T08:50:10Z

pty04 is running sporadically into TBROKs:

5.6.10-3dce3c1 x86_64 (J:4234886)

   1	<<<test_start>>>
     2	tag=pty04 stime=1588598661
     3	cmdline="pty04"
     4	contacts=""
     5	analysis=exit
     6	<<<test_output>>>
     7	tst_test.c:1244: INFO: Timeout per run is 0h 05m 00s
     8	pty04.c:95: INFO: PTS path is /dev/pts/0
     9	pty04.c:149: INFO: Netdev is sl0
    10	pty04.c:158: INFO: Netdev MTU is 8192 (we set 8192)
    11	pty04.c:175: INFO: Bound netdev 2378 to socket 9
    12	pty04.c:208: INFO: Reading from socket 9
    13	pty04.c:123: PASS: Wrote PTY 1
    14	pty04.c:133: PASS: Wrote PTY 2
    15	pty04.c:213: PASS: Read netdev 1
    16	tst_checkpoint.c:147: BROK: pty04.c:248: tst_checkpoint_wait(0, 10000): ETIMEDOUT (110)
    17	tst_test.c:373: BROK: Reported by child (956159)
    18	safe_macros.c:258: BROK: pty04.c:214: read(9,0x20d8410,8191) failed, returned -1: ENETDOWN (100)
    19	pty04.c:138: PASS: Writing to PTY interrupted by hangup
    20	tst_test.c:373: WARN: Reported by child (956157)

5.6.10-55754d7 aarch64 (J:4234757)

 1	<<<test_start>>>
     2	tag=pty04 stime=1588582886
     3	cmdline="pty04"
     4	contacts=""
     5	analysis=exit
     6	<<<test_output>>>
     7	tst_test.c:1244: INFO: Timeout per run is 0h 05m 00s
     8	pty04.c:95: INFO: PTS path is /dev/pts/0
     9	pty04.c:149: INFO: Netdev is sl0
    10	pty04.c:158: INFO: Netdev MTU is 8192 (we set 8192)
    11	pty04.c:175: INFO: Bound netdev 894 to socket 9
    12	pty04.c:123: PASS: Wrote PTY 1
    13	pty04.c:208: INFO: Reading from socket 9
    14	safe_macros.c:258: BROK: pty04.c:210: read(9,0x3fa97410,8191) failed, returned 90: SUCCESS (0)
    15	pty04.c:133: PASS: Wrote PTY 2
    16	tst_test.c:373: BROK: Reported by child (790085)
    17	pty04.c:138: PASS: Writing to PTY interrupted by hangup
    18	tst_checkpoint.c:147: BROK: pty04.c:248: tst_checkpoint_wait(0, 10000): ETIMEDOUT (110)
    19	tst_test.c:373: WARN: Reported by child (790087)

The text was updated successfully, but these errors were encountered:

richiejp · 2020-05-05T10:42:34Z

This looks like an issue I thought I had fixed by limiting the MTU to the internal chunk size. Probably the read is sleeping waiting for "unthrottle" which is not sent by the PTY with SLIP. I was debating whether this is a kernel bug or not and decided I didn't want to risk breaking anything in the kernel by implementing it (it is probably sent by TTY drivers for physical ports, just not PTY).

Probably I should use asynchronous reads instead to avoid sleeping in the kernel.

pevik · 2020-05-05T14:58:55Z

IMHO before merging 3c8f84e it was on every second run (detected with -i 2). But now it fails on first run. I reported it on ML (https://patchwork.ozlabs.org/project/ltp/patch/20200505101625.25020-1-rpalethorpe@suse.com/).

richiejp · 2020-05-07T10:55:37Z

If I run 10 x pty04 -i 10 in parallel (with the slcan patch) I start to get some errors, like:

pty04.c:248: FAIL: can_id = 1600085855 != 1
pty04.c:254: FAIL: can_dlc = 95 != CAN_MAX_DLEN

and

safe_macros.c:260: BROK: pty04.c:306: read(11,0x7efe55704001,8191) failed, returned 16: SUCCESS (0)

and timeouts similar to what Jan reported. This looks like data corruption happening somewhere in the kernel, but I don't see how.

richiejp · 2020-05-12T14:01:58Z

It seems I have fixed the timeouts by retrying the reads. Even though they are blocking and the read size is the same as the packet size, it seems read can return before the full packet is read which is probably what I should have assumed from the start.

However it also seems I am getting SLIP packets on SLCAN bound sockets or vice versa.

[238/883] pty04.c:282: FAIL: Corrupt data (max 64 of 8191 bytes shown): data[0..8] = 01 00 00 00 08 00 00 00

It doesn't appear netdevice indexes or PTS numbers are reused. Perhaps some line discipline structure is reused? Or it has something to do with binding to raw sockets. At any rate I will send another patch to the mailing list.

richiejp · 2020-05-13T14:12:03Z

I sent another patch which increases the timeout amongst other tings. This seems to fix it in the simple case.

It seems the kernel stalls; possibly trying to allocate memory. Sometimes it even returns ENOBUFS and prints a stack trace for the memory failure followed by OOM killer:

[  179.069553] rcu: INFO: rcu_sched self-detected stall on CPU
[  179.070313] rcu: 	3-....: (1 GPs behind) idle=abe/1/0x4000000000000002 softirq=6607/6608 fqs=6499
[  179.071383] 	(t=26001 jiffies g=18189 q=0)
[  179.071890] NMI backtrace for cpu 3
[  179.072359] CPU: 3 PID: 240 Comm: kworker/u8:5 Not tainted 5.7.0-rc4+ #49
[  179.073177] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt4
[  179.074531] Workqueue: events_unbound flush_to_ldisc
[  179.075151] Call Trace:
[  179.075466]  <IRQ>
[  179.075735]  dump_stack+0x97/0xe0
[  179.076151]  nmi_cpu_backtrace.cold+0x46/0x80
[  179.076689]  ? lapic_can_unplug_cpu.cold+0x46/0x46
[  179.077273]  nmi_trigger_cpumask_backtrace+0x125/0x154
[  179.077914]  rcu_dump_cpu_stacks+0x1a0/0x1ea
[  179.078467]  rcu_sched_clock_irq.cold+0x518/0x978
[  179.079083]  update_process_times+0x24/0x50
[  179.079618]  tick_sched_handle+0xf7/0x160
[  179.080122]  tick_sched_timer+0x38/0xf0
[  179.080598]  __hrtimer_run_queues+0x4ae/0xa90
[  179.081133]  ? tick_sched_do_timer+0x150/0x150
[  179.081694]  ? enqueue_hrtimer+0x2d0/0x2d0
[  179.082234]  hrtimer_interrupt+0x2bf/0x740
[  179.082765]  smp_apic_timer_interrupt+0x109/0x4e0
[  179.083327]  apic_timer_interrupt+0xf/0x20
[  179.083820]  </IRQ>
[  179.084085] RIP: 0010:trace_graph_entry+0x37b/0x720
[  179.084660] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 44 03 00 00 41 f6 46 18 01 0f 85 d9 00 00 00 4c 8d f
[  179.086781] RSP: 0018:ffffc90000427470 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  179.087660] RAX: 1ffffffff552ba7e RBX: ffffffffa70f05c0 RCX: 0000000000000000
[  179.088488] RDX: dffffc0000000000 RSI: 0000000000000002 RDI: ffffffffa98a7c58
[  179.089311] RBP: ffffc900004274e0 R08: ffffc900004274e8 R09: ffffffffab24f0cf
[  179.090135] R10: ffffffffab24f0d7 R11: ffffffffab24f0db R12: 0000000000000000
[  179.090960] R13: ffffffffaa382720 R14: 0000000000000246 R15: ffffffffaa382748
[  179.091910]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.092835]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.093746]  function_graph_enter+0x34b/0x520
[  179.094342]  ? ftrace_graph_stop+0x10/0x10
[  179.094885]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.095652]  ? ftrace_graph_caller+0x6b/0xa0
[  179.096236]  ? __trace_graph_entry+0x4e/0x1a0
[  179.096829]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.097599]  ? unwind_next_frame+0xea8/0x18a0
[  179.098178]  prepare_ftrace_return+0xa1/0xf0
[  179.098697]  ftrace_graph_caller+0x6b/0xa0
[  179.099186]  ? core_kernel_data+0x20/0x20
[  179.099671]  ? __read_once_size_nocheck.constprop.0+0x10/0x10
[  179.100365]  ? deref_stack_reg+0x5/0xe0
[  179.100853]  deref_stack_reg+0x5/0xe0
[  179.101296]  ? ftrace_graph_caller+0xa0/0xa0
[  179.101803]  unwind_next_frame+0xea8/0x18a0
[  179.102302]  ? kmem_cache_alloc+0xce/0x250
[  179.102803]  ? deref_stack_reg+0xe0/0xe0
[  179.103279]  ? kmem_cache_alloc+0xce/0x250
[  179.103778]  ? profile_setup.cold+0xa1/0xa1
[  179.104279]  arch_stack_walk+0x7f/0xe0
[  179.104744]  ? __build_skb+0x1f/0x50
[  179.105183]  stack_trace_save+0x91/0xc0
[  179.105648]  ? stack_trace_consume_entry+0x160/0x160
[  179.106246]  ? stack_trace_save+0x5/0xc0
[  179.106723]  save_stack+0x1b/0x40
[  179.107124]  ? save_stack+0x1b/0x40
[  179.107546]  ? __kasan_kmalloc.constprop.0+0xc2/0xd0
[  179.108135]  ? kmem_cache_alloc+0xce/0x250
[  179.108621]  ? trace_buffer_lock_reserve+0x17/0x50
[  179.109193]  ? __trace_graph_entry+0x4e/0x1a0
[  179.109726]  ? kmem_cache_size+0x10/0x10
[  179.110201]  ? function_graph_enter+0x34b/0x520
[  179.110740]  ? ftrace_graph_stop+0x10/0x10
[  179.111227]  ? fs_reclaim_acquire.part.0+0x30/0x30
[  179.111797]  ? kmem_cache_size+0x10/0x10
[  179.112271]  ? ftrace_graph_stop+0x10/0x10
[  179.112763]  ? kmem_cache_size+0x10/0x10
[  179.113232]  ? kmem_cache_alloc+0x4c/0x250
[  179.113721]  ? prepare_ftrace_return+0xa1/0xf0
[  179.114250]  ? __build_skb+0x1f/0x50
[  179.114683]  ? ftrace_graph_caller+0x6b/0xa0
[  179.115221]  ? ftrace_graph_caller+0x6b/0xa0
[  179.115738]  ? kasan_unpoison_shadow+0x33/0x40
[  179.116267]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[  179.116838]  ? __build_skb+0x1f/0x50
[  179.117268]  kmem_cache_alloc+0xce/0x250
[  179.117743]  __build_skb+0x1f/0x50
[  179.118155]  __netdev_alloc_skb+0x158/0x2d0
[  179.118661]  slcan_receive_buf+0x6fd/0xbb0 [slcan]
[  179.119227]  ? slcan_transmit+0x3e0/0x3e0 [slcan]
[  179.119807]  ? slcan_transmit+0x3e0/0x3e0 [slcan]
[  179.120379]  ? slcan_transmit+0x3e0/0x3e0 [slcan]
[  179.120952]  tty_ldisc_receive_buf+0x100/0x1d0
[  179.121483]  ? tty_ldisc_receive_buf+0x5/0x1d0
[  179.122019]  tty_port_default_receive_buf+0x61/0x90
[  179.122603]  flush_to_ldisc+0x1a7/0x3e0
[  179.123082]  process_one_work+0x926/0x16d0
[  179.123591]  ? pwq_dec_nr_in_flight+0x290/0x290
[  179.124136]  ? ftrace_graph_caller+0x6b/0xa0
[  179.124654]  ? rwlock_bug.part.0+0x90/0x90
[  179.125158]  worker_thread+0x92/0xfa0
[  179.125622]  ? process_one_work+0x16d0/0x16d0
[  179.126145]  kthread+0x2ba/0x3b0
[  179.126542]  ? kthread_create_on_node+0xd0/0xd0
[  179.127086]  ret_from_fork+0x3a/0x50

pevik · 2020-05-14T07:17:56Z

@richiejp Well, previous issue has not been fixed, so feel free to reopen. But IMHO that needs kernel fix and at least original report for TBROK got fixed (thanks!).

richiejp · 2020-05-14T13:35:25Z

I sent in yet another patch to fix a bug in retry_read.

Also it appears the test fails on PowerVM with this:

tst_test.c:1246: INFO: Timeout per run is 0h 05m 00s
pty04.c:130: INFO: PTS path is /dev/pts/2
pty04.c:207: INFO: Netdev is sl0
pty04.c:216: INFO: Netdev MTU is 8192 (we set 8192)
pty04.c:233: INFO: Bound netdev 325 to socket 5
tst_buffers.c:55: INFO: Test is using guarded buffers
tst_buffers.c:55: INFO: Test is using guarded buffers
pty04.c:324: INFO: Reading from socket 5
pty04.c:180: PASS: Wrote PTY N_SLIP 3 (1)
pty04.c:191: PASS: Wrote PTY N_SLIP 3 (2)
pty04.c:284: FAIL: Corrupt data (max 1024 of 8191 bytes shown): data[0..316] =
pty04.c:284: FAIL: 00 00 5e 00 01 03 46 11 eb 8d 86 ea 08 00 45 10
pty04.c:284: FAIL: 00 88 61 61 40 00 40 06 a0 e7 0a a1 19 3a 0a a3
pty04.c:284: FAIL: 09 9a 00 16 af 6a e4 9b e6 b6 6f ec 19 09 80 18
pty04.c:284: FAIL: 01 f5 00 00 00 00 01 01 08 0a 2c e1 61 03 66 bd
pty04.c:284: FAIL: 80 a8 67 b0 03 73 6a 04 1e a6 6a 04 1d 08 e0 9f
pty04.c:284: FAIL: 39 ae 30 18 00 40 01 55 6b 8e 7e a0 df 0d ae cd
pty04.c:284: FAIL: 56 7c 44 34 4a bd f6 4a be 2d cb 23 47 a6 b2 a9
pty04.c:284: FAIL: 2f 57 f8 74 98 65 a7 a2 cc ff 63 c1 24 ca 9c 1e
pty04.c:284: FAIL: b6 3b 55 0e da 84 c7 d6 15 7d c1 c6 22 a7 d7 04
pty04.c:284: FAIL: 38 3e ea 03 4b b0 00 00 5e 00 01 03 46 11 eb 8d
pty04.c:284: FAIL: 86 ea 08 00 45 10 00 98 61 62 40 00 40 06 a0 d6
pty04.c:284: FAIL: 0a a1 19 3a 0a a3 09 9a 00 16 af 6a e4 9b e7 0a
pty04.c:284: FAIL: 6f ec 19 09 80 18 01 f5 00 00 00 00 01 01 08 0a
pty04.c:284: FAIL: 2c e1 61 03 66 bd 80 a8 7a b0 03 4b af f6 06 24
pty04.c:284: FAIL: 94 99 fa 69 80 69 5e a3 a0 ab c3 3b 29 af 75 5c
pty04.c:284: FAIL: 49 8b 0a 87 4e 22 18 7d 8e 77 77 f2 e4 c8 ce a7
pty04.c:284: FAIL: ad d1 7f 3f d0 91 36 c9 97 5c 2d 4a b9 c9 3a 04
pty04.c:284: FAIL: 9f 27 1e 4b 85 01 af 29 a0 d1 b6 88 c4 36 41 5b
pty04.c:284: FAIL: a9 ce 22 90 67 a6 9c 71 c7 0c c2 5e 41 3a d8 49
pty04.c:284: FAIL: c2 22 67 fe 3c 5c a4 6f 8e fd e1 6d
pty04.c:288: INFO: Will continue test without data checking
pty04.c:328: PASS: Read netdev N_SLIP 5 (1)
pty04.c:332: PASS: Read netdev N_SLIP 5 (2)
pty04.c:338: PASS: Reading data from netdev interrupted by hangup
pty04.c:366: INFO: Sent hangup ioctl to PTS
pty04.c:196: PASS: Writing to PTY interrupted by hangup

Possibly that is a packet from another interface, or something else.

richiejp · 2020-05-15T12:11:44Z

Yup it was a packet from another interface. By default raw packet sockets collect info from all interfaces. I have sent another patch to the ML.

richiejp · 2020-05-20T12:32:49Z

Still sporadically fails

jstancek · 2020-06-03T10:48:34Z

Just hit it too with 5.7.0-b23c477 on x86_64

tst_test.c:1244: INFO: Timeout per run is 0h 05m 00s
pty04.c:130: INFO: PTS path is /dev/pts/0
pty04.c:207: INFO: Netdev is sl0
pty04.c:216: INFO: Netdev MTU is 8192 (we set 8192)
pty04.c:233: INFO: Bound netdev 3354 to socket 9
tst_buffers.c:55: INFO: Test is using guarded buffers
pty04.c:324: INFO: Reading from socket 9
tst_buffers.c:55: INFO: Test is using guarded buffers
pty04.c:180: PASS: Wrote PTY N_SLIP 6 (1)
pty04.c:191: PASS: Wrote PTY N_SLIP 6 (2)
pty04.c:328: PASS: Read netdev N_SLIP 9 (1)
tst_checkpoint.c:147: BROK: pty04.c:364: tst_checkpoint_wait(0, 100000): ETIMEDOUT (110)
tst_test.c:373: BROK: Reported by child (1178909)
pty04.c:306: BROK: Read 0 of 8191 bytes: ENETDOWN (100)
pty04.c:196: PASS: Writing to PTY interrupted by hangup
tst_test.c:373: WARN: Reported by child (1178907)

ywangwrs · 2020-10-20T16:18:31Z

Any update on this issue?

richiejp · 2020-10-21T08:38:26Z

Nope, sorry. It is on my backlog, but low priority. Are you using SLIP or SLCAN or is the issue just that the tests are randomly failing?

ywangwrs · 2020-10-21T15:48:49Z

It is a random failure in our daily LTP test.

richiejp · 2020-10-27T11:40:14Z

On 5.10 RC I keep hitting a softlockup inside slc_bump (slcan) if I run the test repeatedly. The stack trace points to different locations within slc_bump and there is no clear infinite loop or anything like that. So this is probably a kernel issue. Will investigate further.

richiejp · 2020-10-28T17:21:48Z

It appears that sometimes the test sends a lot of data and the kernel requires a long time to process it. The test should work in a reasonably reliable way so I have sent a patch to the mailing list to fix that:
http://lists.linux.it/pipermail/ltp/2020-October/019501.html

I don't think there is an issue with the kernel other than being slow which is probably not a real world issue with SLCAN.

metan-ucw · 2021-01-22T10:19:03Z

This should be fixed in e40bcd5

pevik closed this as completed in 035336b May 14, 2020

jstancek reopened this Jun 3, 2020

metan-ucw closed this as completed Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pty04 sporadically fails #674

pty04 sporadically fails #674

jstancek commented May 5, 2020

richiejp commented May 5, 2020

pevik commented May 5, 2020

richiejp commented May 7, 2020

richiejp commented May 12, 2020

richiejp commented May 13, 2020

pevik commented May 14, 2020

richiejp commented May 14, 2020

richiejp commented May 15, 2020

richiejp commented May 20, 2020

jstancek commented Jun 3, 2020

ywangwrs commented Oct 20, 2020

richiejp commented Oct 21, 2020

ywangwrs commented Oct 21, 2020

richiejp commented Oct 27, 2020

richiejp commented Oct 28, 2020

metan-ucw commented Jan 22, 2021

pty04 sporadically fails #674

pty04 sporadically fails #674

Comments

jstancek commented May 5, 2020

richiejp commented May 5, 2020

pevik commented May 5, 2020

richiejp commented May 7, 2020

richiejp commented May 12, 2020

richiejp commented May 13, 2020

pevik commented May 14, 2020

richiejp commented May 14, 2020

richiejp commented May 15, 2020

richiejp commented May 20, 2020

jstancek commented Jun 3, 2020

ywangwrs commented Oct 20, 2020

richiejp commented Oct 21, 2020

ywangwrs commented Oct 21, 2020

richiejp commented Oct 27, 2020

richiejp commented Oct 28, 2020

metan-ucw commented Jan 22, 2021