Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9.1.6 drbd_req_destroy: Logic BUG #27

Closed
koszik opened this issue Feb 17, 2022 · 8 comments
Closed

9.1.6 drbd_req_destroy: Logic BUG #27

koszik opened this issue Feb 17, 2022 · 8 comments

Comments

@koszik
Copy link

koszik commented Feb 17, 2022

I have two secondary nodes with disks (A, B) and one diskless node acting as primary, and I wanted to know what happens if the connection between C and A/B experiences a problem.

To test that, while sending requests to the device, I drop packets on node C going to A and then B as well, then re-enable the connections, like this:

fio &
iptables -I OUTPUT -d nodeA -j DROP
sleep 20
iptables -I OUTPUT -d nodeB -j DROP
sleep 20
iptables -D OUTPUT -d nodeA -j DROP
sleep 20
iptables -D OUTPUT -d nodeB -j DROP

Configuration:

common {
    options {
        on-no-data-accessible suspend-io;
        on-no-quorum suspend-io;
    }
    disk {
        c-plan-ahead 7;
        c-fill-target 10M;
        c-min-rate 4M;
        c-max-rate 3000M;
    }
    net {
        max-buffers             36k;
        sndbuf-size            1024k;
        rcvbuf-size            2048k;
        verify-alg crc32;
    }
}

resource u12215 {
    protocol C;
    device minor 854;
    disk /dev/nodeC/u12215;
    meta-disk /dev/nodeC/md-u12215;
    on nodeA { node-id 10; address ipv4 nodeA:30854; }
    on nodeB { node-id 11; address ipv4 nodeB:30854; }
    on nodeC { node-id 7; address ipv4 nodeC:30854; disk none; }
    connection-mesh { hosts nodeA nodeB nodeC; }
}


When sending read requests, the result is drbd854: IO ERROR: neither local nor remote data, sector xxx, and it returns the error to fio (even though the configuration forbids that).

Feb 18 00:20:16 nodeC kernel: drbd: initialized. Version: 9.1.6 (api:2/proto:110-121)
Feb 18 00:20:16 nodeC kernel: drbd: GIT-hash: f85adeb71f16a0aead1e875665e2fb68852d94eb build by root@nodeC, 2022-02-17 21:05:39
Feb 18 00:20:16 nodeC kernel: drbd: registered as block device major 147
Feb 18 00:20:23 nodeC kernel: drbd u12215: Starting worker thread (from drbdsetup [5840])
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: Starting sender thread (from drbdsetup [5847])
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbdsetup [5849])
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: conn( StandAlone -> Unconnected )
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: Starting receiver thread (from drbd_w_u12215 [5841])
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: conn( Unconnected -> Connecting )
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: conn( StandAlone -> Unconnected )
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: Starting receiver thread (from drbd_w_u12215 [5841])
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: Handshake to peer 10 successful: Agreed network protocol version 121
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: Starting ack_recv thread (from drbd_r_u12215 [5857])
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [5858])
Feb 18 00:20:23 nodeC kernel: drbd u12215: Preparing cluster-wide state change 447733307 (7->10 499/146)
Feb 18 00:20:23 nodeC kernel: drbd u12215/0 drbd854: size = 68 GB (71688192 KB)
Feb 18 00:20:23 nodeC kernel: drbd u12215: State change 447733307: primary_nodes=0, weak_nodes=0
Feb 18 00:20:23 nodeC kernel: drbd u12215: Committing cluster-wide state change 447733307 (25ms)
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeA: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 00:20:23 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
Feb 18 00:20:23 nodeC kernel: drbd u12215: Preparing cluster-wide state change 595176349 (7->11 499/146)
Feb 18 00:20:23 nodeC kernel: drbd u12215: State change 595176349: primary_nodes=0, weak_nodes=0
Feb 18 00:20:23 nodeC kernel: drbd u12215: Committing cluster-wide state change 595176349 (28ms)
Feb 18 00:20:23 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 00:20:23 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )

Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: sock was shut down by peer
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown )
Feb 18 00:20:44 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: ack_receiver terminated
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: Terminating ack_recv thread
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: Terminating sender thread
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: Starting sender thread (from drbd_r_u12215 [5857])
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: Connection closed
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: conn( BrokenPipe -> Unconnected )
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: Restarting receiver thread
Feb 18 00:20:44 nodeC kernel: drbd u12215 nodeA: conn( Unconnected -> Connecting )

Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: PingAck did not arrive in time.
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Feb 18 00:21:05 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: Terminating sender thread
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [5858])
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: Connection closed
Feb 18 00:21:05 nodeC kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 15032400+8
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: conn( NetworkFailure -> Unconnected )
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: Restarting receiver thread
Feb 18 00:21:05 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )

Feb 18 00:21:14 nodeC kernel: drbd u12215 nodeA: Handshake to peer 10 successful: Agreed network protocol version 121
Feb 18 00:21:14 nodeC kernel: drbd u12215 nodeA: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 00:21:14 nodeC kernel: drbd u12215 nodeA: Starting ack_recv thread (from drbd_r_u12215 [5857])
Feb 18 00:21:14 nodeC kernel: drbd u12215: Preparing cluster-wide state change 31959748 (7->10 499/146)
Feb 18 00:21:14 nodeC kernel: drbd u12215: State change 31959748: primary_nodes=0, weak_nodes=0
Feb 18 00:21:14 nodeC kernel: drbd u12215: Committing cluster-wide state change 31959748 (24ms)
Feb 18 00:21:14 nodeC kernel: drbd u12215 nodeA: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 00:21:14 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )

Feb 18 00:21:43 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 00:21:43 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 00:21:43 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [5858])
Feb 18 00:21:43 nodeC kernel: drbd u12215: Preparing cluster-wide state change 1385702553 (7->11 499/146)
Feb 18 00:21:43 nodeC kernel: drbd u12215: State change 1385702553: primary_nodes=0, weak_nodes=0
Feb 18 00:21:43 nodeC kernel: drbd u12215: Committing cluster-wide state change 1385702553 (28ms)
Feb 18 00:21:43 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 00:21:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )

Using write requests, the logic BUG appears:

Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: PingAck did not arrive in time.
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Feb 18 00:26:21 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: ack_receiver terminated
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: Terminating ack_recv thread
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: Terminating sender thread
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: Starting sender thread (from drbd_r_u12215 [5857])
Feb 18 00:26:21 nodeC kernel: drbd u12215/0 drbd854: sending new current UUID: DF9A66D16D3C4651
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: Connection closed
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: conn( NetworkFailure -> Unconnected )
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: Restarting receiver thread
Feb 18 00:26:21 nodeC kernel: drbd u12215 nodeA: conn( Unconnected -> Connecting )

Feb 18 00:26:31 nodeC kernel: drbd u12215 nodeB: Preparing remote state change 3466013136

Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: sock was shut down by peer
Feb 18 00:26:51 nodeC kernel: drbd u12215: susp-io( no -> no-disk)
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown )
Feb 18 00:26:51 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 00:26:51 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 00:26:51 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 00:26:51 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 00:26:51 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: Terminating sender thread
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [5858])
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: Aborting remote state change 3466013136 commit not possible
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: Connection closed
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: conn( BrokenPipe -> Unconnected )
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: Restarting receiver thread
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeA: Handshake to peer 10 successful: Agreed network protocol version 121
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeA: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeA: Starting ack_recv thread (from drbd_r_u12215 [5857])
Feb 18 00:26:51 nodeC kernel: drbd u12215: Preparing cluster-wide state change 2236164137 (7->10 499/145)
Feb 18 00:26:51 nodeC kernel: drbd u12215: State change 2236164137: primary_nodes=80, weak_nodes=FFFFFFFFFFFFFB7F
Feb 18 00:26:51 nodeC kernel: drbd u12215: Committing cluster-wide state change 2236164137 (48ms)
Feb 18 00:26:51 nodeC kernel: drbd u12215 nodeA: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 00:26:51 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( DUnknown -> Inconsistent ) repl( Off -> Established ) resync-susp( no -> peer )
Feb 18 00:26:51 nodeC kernel: drbd u12215: susp-io( no-disk -> no)

Feb 18 00:26:53 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( Inconsistent -> Outdated ) resync-susp( peer -> no )

Feb 18 00:27:01 nodeC kernel: drbd u12215: Two-phase commit 0 timeout

Feb 18 00:27:11 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 00:27:11 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 00:27:11 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [5858])
Feb 18 00:27:11 nodeC kernel: drbd u12215: Preparing cluster-wide state change 2438872345 (7->11 499/145)
Feb 18 00:27:11 nodeC kernel: drbd u12215: State change 2438872345: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
Feb 18 00:27:11 nodeC kernel: drbd u12215: Committing cluster-wide state change 2438872345 (23ms)
Feb 18 00:27:11 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 00:27:11 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated ) repl( Off -> Established )
Feb 18 00:27:11 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> UpToDate )

Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeB: Remote failed to finish a request within 87644ms > ko-count (7) * timeout (60 * 0.1s)
Feb 18 00:27:58 nodeC kernel: drbd u12215: susp-io( no -> no-disk)
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: conn( Connected -> Timeout ) peer( Secondary -> Unknown )
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Terminating sender thread
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [5858])
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Connection closed
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeA: Preparing remote state change 546097680
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: conn( Timeout -> Unconnected )
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Restarting receiver thread
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeA: Committing remote state change 546097680 (primary_nodes=80)
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated )
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [5858])
Feb 18 00:27:58 nodeC kernel: drbd u12215: Preparing cluster-wide state change 3267636032 (7->11 499/145)
Feb 18 00:27:58 nodeC kernel: drbd u12215: State change 3267636032: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
Feb 18 00:27:58 nodeC kernel: drbd u12215: Committing cluster-wide state change 3267636032 (28ms)
Feb 18 00:27:58 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeB: repl( Off -> Established )
Feb 18 00:27:58 nodeC kernel: drbd u12215: susp-io( no-disk -> no)
Feb 18 00:27:58 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> UpToDate )

Feb 18 00:28:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: Remote failed to finish a request within 132700ms > ko-count (7) * timeout (60 * 0.1s)
Feb 18 00:28:43 nodeC kernel: drbd u12215: susp-io( no -> no-disk)
Feb 18 00:28:43 nodeC kernel: drbd u12215 nodeB: conn( Connected -> Timeout ) peer( Secondary -> Unknown )
Feb 18 00:28:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 00:28:43 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 00:28:43 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 00:28:43 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 00:28:43 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 00:28:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 00:28:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 00:28:43 nodeC kernel: drbd u12215/0 drbd854: drbd_req_destroy: Logic BUG rq_state: 284000, completion_ref = 1

Feb 18 00:29:09 nodeC kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [drbd_s_u12215:7045]
Feb 18 00:29:09 nodeC kernel: Modules linked in: dm_mod vhost_net vhost vhost_iotlb tap tun drbd_transport_tcp(OE) drbd(OE) nfsv3 nfs_acl nfs lockd grace fscache netconsole ixgbe mdio dca mlx4_ib mlx4_en nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc mlx4_core nft_compat nft_counter nf_tables nfnetlink rpcrdma rdma_ucm
b_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad iw_cm ib_ipoib ib_cm ib_uverbs ib_core sunrpc vfat fat intel_rapl_msr iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct1
0dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel i2c_algo_bit rapl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops intel_cstate cdc_ether intel_uncore drm pcspkr usbnet joydev mii ipmi_ssif lpc_ich tg3 ipmi_si ipmi_devintf ipmi_msghandler mei_me mei wmi acpi_power_meter xfs libcrc32c raid1 sr_mod cdro
m sd_mod t10_pi sg ahci libahci crc32c_intel libata megaraid_sas
Feb 18 00:29:09 nodeC kernel: [last unloaded: drbd]
Feb 18 00:29:09 nodeC kernel: CPU: 0 PID: 7045 Comm: drbd_s_u12215 Tainted: G           OE    --------- -  - 4.18.0-348.12.2.el8_5.x86_64 #1
Feb 18 00:29:09 nodeC kernel: Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.13.0 05/14/2021
Feb 18 00:29:09 nodeC kernel: RIP: 0010:drbd_sender+0x208/0x3c0 [drbd]
Feb 18 00:29:09 nodeC kernel: Code: be 05 00 00 00 48 89 ef 48 8d 4c 24 10 e8 10 57 01 00 48 8b 53 10 b8 00 fe ff ff f0 0f c1 82 30 01 00 00 fb 66 0f 1f 44 00 00 <48> 83 7c 24 10 00 74 0d 48 8d 74 24 10 4c 89 ff e8 f3 55 01 00 48
Feb 18 00:29:09 nodeC kernel: RSP: 0018:ffffb97b59197e80 EFLAGS: 00000247 ORIG_RAX: ffffffffffffff13
Feb 18 00:29:09 nodeC kernel: RAX: 0000000000000200 RBX: ffffa12913dba000 RCX: 000000000000031c
Feb 18 00:29:09 nodeC kernel: RDX: ffffa12767670000 RSI: 0000000000284000 RDI: ffffa1268f5d7e78
Feb 18 00:29:09 nodeC kernel: RBP: ffffa1268f5d7cb8 R08: ffffa1268f5d7cce R09: 0000000000000000
Feb 18 00:29:09 nodeC kernel: R10: ffffa12913dba000 R11: ffffa12913dba050 R12: ffffa12913dba4f0
Feb 18 00:29:09 nodeC kernel: R13: ffffa1261f8aa000 R14: ffffa12913dba048 R15: ffffa127629da000
Feb 18 00:29:09 nodeC kernel: FS:  0000000000000000(0000) GS:ffffa1247f600000(0000) knlGS:0000000000000000
Feb 18 00:29:09 nodeC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 18 00:29:09 nodeC kernel: CR2: 000000c001da3010 CR3: 000000223b210003 CR4: 00000000003726f0
Feb 18 00:29:09 nodeC kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 18 00:29:09 nodeC kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 18 00:29:09 nodeC kernel: Call Trace:
Feb 18 00:29:09 nodeC kernel: ? __drbd_next_peer_device_ref+0x140/0x140 [drbd]
Feb 18 00:29:09 nodeC kernel: drbd_thread_setup+0x5e/0x160 [drbd]
Feb 18 00:29:09 nodeC kernel: ? __drbd_next_peer_device_ref+0x140/0x140 [drbd]
Feb 18 00:29:09 nodeC kernel: kthread+0x116/0x130
Feb 18 00:29:09 nodeC kernel: ? kthread_flush_work_fn+0x10/0x10
Feb 18 00:29:09 nodeC kernel: ret_from_fork+0x35/0x40


[root@nodeC log]# cat /sys/block/drbd854/inflight
       0        5


I can consistently reproduce these bugs.

@JoelColledge
Copy link
Contributor

Thanks for the report.

What are the results of this test with DRBD 9.1.5 and 9.0.32?

What load are you applying with fio?

@JoelColledge
Copy link
Contributor

For the record, rq_state: 284000 corresponds to RQ_WRITE|RQ_UNPLUG|RQ_COMPLETION_SUSP.

The disassembly around drbd_sender+0x208 (0x208==520) on this kernel is:

/drbd/drbd/drbd_sender.c:
3133			read_lock_irq(&connection->resource->state_rwlock);
   0x00000000000100fb <+459>:	mov    0x10(%rbx),%rax
   0x00000000000100ff <+463>:	lea    0x130(%rax),%rdi
   0x0000000000010106 <+470>:	callq  0x1010b <drbd_sender+475>

3134			__req_mod(req, SEND_CANCELED, peer_device, &m);
   0x000000000001010b <+475>:	mov    %r13,%rdx
   0x000000000001010e <+478>:	mov    $0x5,%esi
   0x0000000000010113 <+483>:	mov    %rbp,%rdi
   0x0000000000010116 <+486>:	lea    0x10(%rsp),%rcx
   0x000000000001011b <+491>:	callq  0x10120 <drbd_sender+496>

./include/asm-generic/qrwlock.h:
117
   0x0000000000010120 <+496>:	mov    0x10(%rbx),%rdx

./arch/x86/include/asm/atomic.h:
177
   0x0000000000010124 <+500>:	mov    $0xfffffe00,%eax
   0x0000000000010129 <+505>:	lock xadd %eax,0x130(%rdx)

./arch/x86/include/asm/paravirt.h:
804
   0x0000000000010131 <+513>:	callq  *0x0

/drbd/drbd/drbd_sender.c:
3136			if (m.bio)
   0x0000000000010138 <+520>:	cmpq   $0x0,0x10(%rsp)
   0x000000000001013e <+526>:	je     0x1014d <drbd_sender+541>

3137				complete_master_bio(device, &m);
   0x0000000000010140 <+528>:	lea    0x10(%rsp),%rsi
   0x0000000000010145 <+533>:	mov    %r15,%rdi
   0x0000000000010148 <+536>:	callq  0x1014d <drbd_sender+541>

@JoelColledge
Copy link
Contributor

One more test to perform would be blocking both directions with iptables. This is a particularly unusual test because the network connection is only broken in one direction. Try blocking both for comparison.

@koszik
Copy link
Author

koszik commented Feb 18, 2022

fio cmdline:

fio --ioengine=libaio --iodepth=4 --rw=randread --bs=4k --size=10g --numjobs=1 --direct=1 --sync=1 --name=x --filename=/dev/drbd854

(or randwrite, but that is going to ext4 inside a vm on this device)

I tried the test again with blocking both directions.

Read test resulted in the same outcome, but for the write test it's different, now there's an IO error reported to the application and a stuck request, but not the logic bug (this could be due to random chance, only tried it once):

Feb 18 14:18:06 nodeC kernel: drbd u12215 nodeA: PingAck did not arrive in time.
Feb 18 14:18:06 nodeC kernel: drbd u12215 nodeA: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Feb 18 14:18:06 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 14:18:06 nodeC kernel: drbd u12215 nodeA: ack_receiver terminated
Feb 18 14:18:06 nodeC kernel: drbd u12215 nodeA: Terminating ack_recv thread
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: Terminating sender thread
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: Starting sender thread (from drbd_r_u12215 [78220])
Feb 18 14:18:07 nodeC kernel: drbd u12215/0 drbd854: sending new current UUID: CF34C6AD6BADB163
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: Connection closed
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: conn( NetworkFailure -> Unconnected )
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: Restarting receiver thread
Feb 18 14:18:07 nodeC kernel: drbd u12215 nodeA: conn( Unconnected -> Connecting )

Feb 18 14:18:26 nodeC kernel: drbd u12215 nodeB: PingAck did not arrive in time.
Feb 18 14:18:26 nodeC kernel: drbd u12215: susp-io( no -> no-disk)
Feb 18 14:18:26 nodeC kernel: drbd u12215 nodeB: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Feb 18 14:18:26 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 14:18:26 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 14:18:26 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:18:26 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 14:18:26 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:18:26 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:18:26 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:18:26 nodeC kernel: drbd u12215 nodeB: Terminating sender thread
Feb 18 14:18:26 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [78221])
Feb 18 14:18:27 nodeC kernel: drbd u12215 nodeB: Connection closed
Feb 18 14:18:27 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
Feb 18 14:18:27 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 14:18:27 nodeC kernel: drbd u12215 nodeB: conn( NetworkFailure -> Unconnected )
Feb 18 14:18:27 nodeC kernel: drbd u12215 nodeB: Restarting receiver thread
Feb 18 14:18:27 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )

Feb 18 14:18:39 nodeC kernel: drbd u12215 nodeA: Handshake to peer 10 successful: Agreed network protocol version 121
Feb 18 14:18:39 nodeC kernel: drbd u12215 nodeA: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 14:18:39 nodeC kernel: drbd u12215 nodeA: Starting ack_recv thread (from drbd_r_u12215 [78220])
Feb 18 14:18:39 nodeC kernel: drbd u12215: Preparing cluster-wide state change 104547609 (7->10 499/145)
Feb 18 14:18:39 nodeC kernel: drbd u12215: State change 104547609: primary_nodes=80, weak_nodes=FFFFFFFFFFFFFB7F
Feb 18 14:18:39 nodeC kernel: drbd u12215: Committing cluster-wide state change 104547609 (24ms)
Feb 18 14:18:39 nodeC kernel: drbd u12215 nodeA: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 14:18:39 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( DUnknown -> Consistent ) repl( Off -> Established )
Feb 18 14:18:39 nodeC kernel: drbd u12215: susp-io( no-disk -> no)
Feb 18 14:18:39 nodeC kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 0+0
Feb 18 14:18:39 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( Consistent -> UpToDate )

Feb 18 14:18:59 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 14:18:59 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 14:18:59 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [78221])
Feb 18 14:19:00 nodeC kernel: drbd u12215: Preparing cluster-wide state change 1681449179 (7->11 499/145)
Feb 18 14:19:00 nodeC kernel: drbd u12215: State change 1681449179: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
Feb 18 14:19:00 nodeC kernel: drbd u12215: Committing cluster-wide state change 1681449179 (28ms)
Feb 18 14:19:00 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 14:19:00 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated ) repl( Off -> Established )
Feb 18 14:19:00 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> UpToDate )

Feb 18 14:19:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: Remote failed to finish a request within 87433ms > ko-count (7) * timeout (60 * 0.1s)
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: conn( Connected -> Timeout ) peer( Secondary -> Unknown )
Feb 18 14:19:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: Terminating sender thread
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [78221])
Feb 18 14:19:43 nodeC kernel: drbd u12215/0 drbd854: sending new current UUID: FE90848345F0258D
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: Connection closed
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeA: Preparing remote state change 2931506494
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: conn( Timeout -> Unconnected )
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: Restarting receiver thread
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )
Feb 18 14:19:43 nodeC kernel: drbd u12215 nodeA: Committing remote state change 2931506494 (primary_nodes=80)
Feb 18 14:19:43 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated )
Feb 18 14:19:44 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 14:19:44 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 14:19:44 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [78221])
Feb 18 14:19:44 nodeC kernel: drbd u12215: Preparing cluster-wide state change 3841230571 (7->11 499/145)
Feb 18 14:19:44 nodeC kernel: drbd u12215: State change 3841230571: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
Feb 18 14:19:44 nodeC kernel: drbd u12215: Committing cluster-wide state change 3841230571 (28ms)
Feb 18 14:19:44 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 14:19:44 nodeC kernel: drbd u12215/0 drbd854 nodeB: repl( Off -> Established )
Feb 18 14:19:44 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> Inconsistent ) resync-susp( no -> peer )
Feb 18 14:19:44 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( Inconsistent -> UpToDate ) resync-susp( peer -> no )

[root@nodeC ~]# cat /sys/block/drbd854/inflight
       1        0

Testing symmetric blocking with 9.1.5:

version: 9.1.5 (api:2/proto:110-121)
GIT-hash: f41bc23cec1e919932ce35947f1b4e67d9e6db74 build by mockbuild@, 2021-12-18 14:03:28
Transports (api:17): tcp (9.1.5)

Read: same as before, IO error as soon as the second connection times out.

Write: similar to 9.1.6, with IO error after nodeA reconnecting, but also the logic bug:

Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: PingAck did not arrive in time.
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Feb 18 14:44:33 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: ack_receiver terminated
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: Terminating ack_recv thread
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: Terminating sender thread
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: Starting sender thread (from drbd_r_u12215 [2970])
Feb 18 14:44:33 nodeC kernel: drbd u12215/0 drbd854: sending new current UUID: AB135B0DD32864CF
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: Connection closed
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: conn( NetworkFailure -> Unconnected )
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: Restarting receiver thread
Feb 18 14:44:33 nodeC kernel: drbd u12215 nodeA: conn( Unconnected -> Connecting )
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: PingAck did not arrive in time.
Feb 18 14:44:53 nodeC kernel: drbd u12215: susp-io( no -> no-disk)
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Feb 18 14:44:53 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )

Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 14:44:53 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 14:44:53 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:44:53 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:44:53 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: Terminating sender thread
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [2971])
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: Connection closed
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: conn( NetworkFailure -> Unconnected )
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: Restarting receiver thread
Feb 18 14:44:53 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )

Feb 18 14:45:03 nodeC kernel: drbd u12215 nodeA: Handshake to peer 10 successful: Agreed network protocol version 121
Feb 18 14:45:03 nodeC kernel: drbd u12215 nodeA: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 14:45:03 nodeC kernel: drbd u12215 nodeA: Starting ack_recv thread (from drbd_r_u12215 [2970])
Feb 18 14:45:03 nodeC kernel: drbd u12215: Preparing cluster-wide state change 2772152590 (7->10 499/145)
Feb 18 14:45:03 nodeC kernel: drbd u12215: State change 2772152590: primary_nodes=80, weak_nodes=FFFFFFFFFFFFFB7F
Feb 18 14:45:03 nodeC kernel: drbd u12215: Committing cluster-wide state change 2772152590 (76ms)
Feb 18 14:45:03 nodeC kernel: drbd u12215 nodeA: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 14:45:03 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( DUnknown -> Inconsistent ) repl( Off -> Established ) resync-susp( no -> peer )
Feb 18 14:45:03 nodeC kernel: drbd u12215: susp-io( no-disk -> no)
Feb 18 14:45:03 nodeC kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 4721080+24
Feb 18 14:45:03 nodeC kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 0+0

Feb 18 14:45:05 nodeC kernel: drbd u12215/0 drbd854 nodeA: pdsk( Inconsistent -> Outdated ) resync-susp( peer -> no )

Feb 18 14:45:25 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 14:45:25 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 14:45:25 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [2971])
Feb 18 14:45:25 nodeC kernel: drbd u12215: Preparing cluster-wide state change 1015297695 (7->11 499/145)
Feb 18 14:45:25 nodeC kernel: drbd u12215: State change 1015297695: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
Feb 18 14:45:25 nodeC kernel: drbd u12215: Committing cluster-wide state change 1015297695 (24ms)
Feb 18 14:45:25 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 14:45:25 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated ) repl( Off -> Established )
Feb 18 14:45:25 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> UpToDate )

Feb 18 14:46:11 nodeC kernel: drbd u12215/0 drbd854 nodeB: Remote failed to finish a request within 89175ms > ko-count (7) * timeout (60 * 0.1s)
Feb 18 14:46:11 nodeC kernel: drbd u12215: susp-io( no -> no-disk)
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: conn( Connected -> Timeout ) peer( Secondary -> Unknown )
Feb 18 14:46:11 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 14:46:11 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 14:46:11 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:46:11 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:46:11 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: Terminating sender thread
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [2971])
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: Connection closed
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeA: Preparing remote state change 3864336905
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: conn( Timeout -> Unconnected )
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: Restarting receiver thread
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )
Feb 18 14:46:11 nodeC kernel: drbd u12215 nodeA: Committing remote state change 3864336905 (primary_nodes=80)
Feb 18 14:46:11 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated )

Feb 18 14:46:12 nodeC kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
Feb 18 14:46:12 nodeC kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Feb 18 14:46:12 nodeC kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [2971])
Feb 18 14:46:12 nodeC kernel: drbd u12215: Preparing cluster-wide state change 3509072553 (7->11 499/145)
Feb 18 14:46:12 nodeC kernel: drbd u12215: State change 3509072553: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
Feb 18 14:46:12 nodeC kernel: drbd u12215: Committing cluster-wide state change 3509072553 (28ms)
Feb 18 14:46:12 nodeC kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 18 14:46:12 nodeC kernel: drbd u12215/0 drbd854 nodeB: repl( Off -> Established )
Feb 18 14:46:12 nodeC kernel: drbd u12215: susp-io( no-disk -> no)
Feb 18 14:46:12 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> UpToDate )

Feb 18 14:46:56 nodeC kernel: drbd u12215/0 drbd854 nodeB: Remote failed to finish a request within 134231ms > ko-count (7) * timeout (60 * 0.1s)
Feb 18 14:46:56 nodeC kernel: drbd u12215: susp-io( no -> no-disk)
Feb 18 14:46:56 nodeC kernel: drbd u12215 nodeB: conn( Connected -> Timeout ) peer( Secondary -> Unknown )
Feb 18 14:46:56 nodeC kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 18 14:46:56 nodeC kernel: drbd u12215 nodeB: ack_receiver terminated
Feb 18 14:46:56 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:46:56 nodeC kernel: drbd u12215 nodeB: Terminating ack_recv thread
Feb 18 14:46:56 nodeC kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:46:56 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
Feb 18 14:46:56 nodeC kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
Feb 18 14:46:56 nodeC kernel: drbd u12215/0 drbd854: drbd_req_destroy: Logic BUG rq_state: 284000, completion_ref = 1

Feb 18 14:47:24 nodeC kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [drbd_s_u12215:4007]
Feb 18 14:47:24 nodeC kernel: Modules linked in: dm_mod vhost_net vhost vhost_iotlb tap tun drbd_transport_tcp(OE) drbd(OE) nfsv3 nfs_acl nfs lockd grace fscache netconsole ixgbe mdio dca mlx4_ib mlx4_en nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc mlx4_core nft_compat nft_counter nf_tables nfnetlink rpcrdma rdma_ucm
b_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core sunrpc vfat fat intel_rapl_msr iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct1
0dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel i2c_algo_bit rapl drm_kms_helper intel_cstate syscopyarea intel_uncore sysfillrect cdc_ether sysimgblt pcspkr fb_sys_fops usbnet drm mii ipmi_ssif lpc_ich tg3 ipmi_si ipmi_devintf ipmi_msghandler mei_me mei wmi acpi_power_meter xfs libcrc32c raid1 sr_mod cdrom sd_mo
d t10_pi sg ahci libahci crc32c_intel libata megaraid_sas
Feb 18 14:47:24 nodeC kernel: [last unloaded: mlx4_core]
Feb 18 14:47:24 nodeC kernel: CPU: 0 PID: 4007 Comm: drbd_s_u12215 Tainted: G           OE    --------- -  - 4.18.0-348.12.2.el8_5.x86_64 #1
Feb 18 14:47:24 nodeC kernel: Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.13.0 05/14/2021
Feb 18 14:47:24 nodeC kernel: RIP: 0010:drbd_sender+0x208/0x3c0 [drbd]
Feb 18 14:47:24 nodeC kernel: Code: be 07 00 00 00 48 89 ef 48 8d 4c 24 10 e8 70 57 01 00 48 8b 53 10 b8 00 fe ff ff f0 0f c1 82 30 01 00 00 fb 66 0f 1f 44 00 00 <48> 83 7c 24 10 00 74 0d 48 8d 74 24 10 4c 89 ff e8 53 56 01 00 48
Feb 18 14:47:24 nodeC kernel: RSP: 0018:ffffb00f5a7ffe80 EFLAGS: 00000247 ORIG_RAX: ffffffffffffff13
Feb 18 14:47:24 nodeC kernel: RAX: 0000000000000200 RBX: ffffa0a6d63c9000 RCX: 000000000000031c
Feb 18 14:47:24 nodeC kernel: RDX: ffffa0a8f82fa800 RSI: 0000000000284000 RDI: ffffa0aaab9f9280
Feb 18 14:47:24 nodeC kernel: RBP: ffffa0aaab9f90c0 R08: ffffa0aaab9f90d6 R09: 0000000000000000
Feb 18 14:47:24 nodeC kernel: R10: ffffa0a6d63c9000 R11: ffffa0a6d63c9050 R12: ffffa0a6d63c94f0
Feb 18 14:47:24 nodeC kernel: R13: ffffa0a8f82f9000 R14: ffffa0a6d63c9048 R15: ffffa0a6d02cd800
Feb 18 14:47:24 nodeC kernel: FS:  0000000000000000(0000) GS:ffffa0a53f600000(0000) knlGS:0000000000000000
Feb 18 14:47:24 nodeC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 18 14:47:24 nodeC kernel: CR2: 000000c00218d000 CR3: 0000007ffb610006 CR4: 00000000003726f0
Feb 18 14:47:24 nodeC kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 18 14:47:24 nodeC kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 18 14:47:24 nodeC kernel: Call Trace:
Feb 18 14:47:24 nodeC kernel: ? __drbd_next_peer_device_ref+0x140/0x140 [drbd]
Feb 18 14:47:24 nodeC kernel: drbd_thread_setup+0x5e/0x160 [drbd]
Feb 18 14:47:24 nodeC kernel: ? __drbd_next_peer_device_ref+0x140/0x140 [drbd]
Feb 18 14:47:24 nodeC kernel: kthread+0x116/0x130
Feb 18 14:47:24 nodeC kernel: ? kthread_flush_work_fn+0x10/0x10
Feb 18 14:47:24 nodeC kernel: ret_from_fork+0x35/0x40

With 9.0.32 (only on nodeC, A/B are still 9.1.6):

Read: as before, IO error on second disconnect

Write: results in a kernel crash instead of complaining of logic bugs:

version: 9.0.32-1 (api:2/proto:86-121)
GIT-hash: 7d2933d5a3764fcc5e0bf54b71fd9cfb0363be1a build by root@nodeC, 2022-02-18 14:55:18
Transports (api:17): tcp (9.0.32-1)


2022-02-18 14:59:13 err kernel: drbd u12215 nodeA: PingAck did not arrive in time.
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
2022-02-18 14:59:13 info kernel: drbd u12215/0 drbd854 nodeA: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: ack_receiver terminated
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: Terminating ack_recv thread
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: Terminating sender thread
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: Starting sender thread (from drbd_r_u12215 [5386])
2022-02-18 14:59:13 info kernel: drbd u12215/0 drbd854: sending new current UUID: 762953691B1D87C3
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: Connection closed
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: helper command: /sbin/drbdadm disconnected exit code 0
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: conn( NetworkFailure -> Unconnected )
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: Restarting receiver thread
2022-02-18 14:59:13 info kernel: drbd u12215 nodeA: conn( Unconnected -> Connecting )

2022-02-18 14:59:33 err kernel: drbd u12215 nodeB: PingAck did not arrive in time.
2022-02-18 14:59:33 info kernel: drbd u12215: susp-io( no -> no-disk)
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
2022-02-18 14:59:33 info kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: ack_receiver terminated
2022-02-18 14:59:33 info kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: Terminating ack_recv thread
2022-02-18 14:59:33 info kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
2022-02-18 14:59:33 info kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
2022-02-18 14:59:33 info kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: Terminating sender thread
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [5387])
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: Connection closed
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: conn( NetworkFailure -> Unconnected )
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: Restarting receiver thread
2022-02-18 14:59:33 info kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )

2022-02-18 14:59:44 info kernel: drbd u12215 nodeA: Handshake to peer 10 successful: Agreed network protocol version 121
2022-02-18 14:59:44 info kernel: drbd u12215 nodeA: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
2022-02-18 14:59:44 info kernel: drbd u12215 nodeA: Starting ack_recv thread (from drbd_r_u12215 [5386])
2022-02-18 14:59:44 info kernel: drbd u12215: Preparing cluster-wide state change 2786458399 (7->10 499/145)
2022-02-18 14:59:44 info kernel: drbd u12215: State change 2786458399: primary_nodes=80, weak_nodes=FFFFFFFFFFFFFB7F
2022-02-18 14:59:44 info kernel: drbd u12215: Committing cluster-wide state change 2786458399 (93ms)
2022-02-18 14:59:44 info kernel: drbd u12215 nodeA: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
2022-02-18 14:59:44 info kernel: drbd u12215/0 drbd854 nodeA: pdsk( DUnknown -> Inconsistent ) repl( Off -> Established ) resync-susp( no -> peer )
2022-02-18 14:59:44 info kernel: drbd u12215: susp-io( no-disk -> no)
2022-02-18 14:59:44 err kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 4721080+24
2022-02-18 14:59:44 err kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 4721080+8
2022-02-18 14:59:44 err kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 829328+32
2022-02-18 14:59:44 err kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 829328+8
2022-02-18 14:59:44 err kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 19657200+16
2022-02-18 14:59:44 err kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 19657200+16

2022-02-18 14:59:45 info kernel: drbd u12215/0 drbd854 nodeA: pdsk( Inconsistent -> Outdated ) resync-susp( peer -> no )

2022-02-18 14:59:47 err kernel: drbd u12215/0 drbd854: IO ERROR: neither local nor remote data, sector 534299+3

2022-02-18 15:00:06 info kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
2022-02-18 15:00:06 info kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
2022-02-18 15:00:06 info kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [5387])
2022-02-18 15:00:06 info kernel: drbd u12215: Preparing cluster-wide state change 4096656875 (7->11 499/145)
2022-02-18 15:00:06 info kernel: drbd u12215: State change 4096656875: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
2022-02-18 15:00:06 info kernel: drbd u12215: Committing cluster-wide state change 4096656875 (24ms)
2022-02-18 15:00:06 info kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
2022-02-18 15:00:06 info kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated ) repl( Off -> Established )
2022-02-18 15:00:06 info kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> UpToDate )

2022-02-18 15:00:52 warning kernel: drbd u12215/0 drbd854: Remote failed to finish a request within 89864ms > ko-count (7) * timeout (60 * 0.1s)
2022-02-18 15:00:52 info kernel: drbd u12215: susp-io( no -> no-disk)
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: conn( Connected -> Timeout ) peer( Secondary -> Unknown )
2022-02-18 15:00:52 info kernel: drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: ack_receiver terminated
2022-02-18 15:00:52 info kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: Terminating ack_recv thread
2022-02-18 15:00:52 info kernel: drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
2022-02-18 15:00:52 info kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
2022-02-18 15:00:52 info kernel: drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: Terminating sender thread
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: Starting sender thread (from drbd_r_u12215 [5387])
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: Connection closed
2022-02-18 15:00:52 info kernel: drbd u12215 nodeA: Preparing remote state change 694803547
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: helper command: /sbin/drbdadm disconnected exit code 0
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: conn( Timeout -> Unconnected )
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: Restarting receiver thread
2022-02-18 15:00:52 info kernel: drbd u12215 nodeB: conn( Unconnected -> Connecting )
2022-02-18 15:00:52 info kernel: drbd u12215 nodeA: Committing remote state change 694803547 (primary_nodes=80)
2022-02-18 15:00:52 info kernel: drbd u12215/0 drbd854 nodeB: pdsk( DUnknown -> Outdated )

2022-02-18 15:00:53 info kernel: drbd u12215 nodeB: Handshake to peer 11 successful: Agreed network protocol version 121
2022-02-18 15:00:53 info kernel: drbd u12215 nodeB: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
2022-02-18 15:00:53 info kernel: drbd u12215 nodeB: Starting ack_recv thread (from drbd_r_u12215 [5387])
2022-02-18 15:00:53 info kernel: drbd u12215: Preparing cluster-wide state change 268496490 (7->11 499/145)
2022-02-18 15:00:53 info kernel: drbd u12215: State change 268496490: primary_nodes=80, weak_nodes=FFFFFFFFFFFFF37F
2022-02-18 15:00:53 info kernel: drbd u12215: Committing cluster-wide state change 268496490 (24ms)
2022-02-18 15:00:53 info kernel: drbd u12215 nodeB: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
2022-02-18 15:00:53 info kernel: drbd u12215/0 drbd854 nodeB: repl( Off -> Established )
2022-02-18 15:00:53 info kernel: drbd u12215: susp-io( no-disk -> no)
2022-02-18 15:00:53 info kernel: drbd u12215/0 drbd854 nodeB: pdsk( Outdated -> UpToDate )

2022-02-18 15:01:37 notice [  578.543260] drbd u12215/0 drbd854: Remote failed to finish a request within 134920ms > ko-count (7) * timeout (60 * 0.1s)
2022-02-18 15:01:37 notice [  578.544168] drbd u12215: susp-io( no -> no-disk)
2022-02-18 15:01:37 notice [  578.544720] drbd u12215 nodeB: conn( Connected -> Timeout ) peer( Secondary -> Unknown )
2022-02-18 15:01:37 notice [  578.545286] drbd u12215/0 drbd854 nodeB: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
2022-02-18 15:01:37 notice [  578.545912] drbd u12215 nodeB: ack_receiver terminated
2022-02-18 15:01:37 notice [  578.546170] drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr
2022-02-18 15:01:37 notice [  578.546542] drbd u12215 nodeB: Terminating ack_recv thread
2022-02-18 15:01:37 notice [  578.550764] drbd u12215/0 drbd854 nodeA: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
2022-02-18 15:01:37 notice [  578.551801] drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr
2022-02-18 15:01:37 notice [  578.554057] drbd u12215/0 drbd854 nodeB: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0
2022-02-18 15:01:37 notice [  578.565163] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
2022-02-18 15:01:37 notice [  578.565772] PGD 0 P4D 0
2022-02-18 15:01:37 notice [  578.566340] Oops: 0000 [#1] SMP PTI
2022-02-18 15:01:37 notice [  578.566908] CPU: 0 PID: 6261 Comm: drbd_s_u12215 Tainted: G           OE    --------- -  - 4.18.0-348.12.2.el8_5.x86_64 #1
2022-02-18 15:01:37 notice [  578.567564] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.13.0 05/14/2021
2022-02-18 15:01:37 notice [  578.568137] RIP: 0010:tl_next_request_for_connection+0x122/0x350 [drbd]
2022-02-18 15:01:37 notice [  578.568748] Code: 85 db 48 0f 44 dd 49 8b 44 24 10 48 8b 55 58 48 05 48 01 00 00 48 8d 6a a8 48 39 c2 0f 84 25 01 00 00 4c 8b 75 00 48 8b 3c 24 <49> 63 b6 80 00 00 00 e8 12 4a 1b d2 4c 63 88 84 00 00 00 49 89 c7
2022-02-18 15:01:37 notice [  578.569928] RSP: 0018:ffffb5eed9943db0 EFLAGS: 00010087
2022-02-18 15:01:37 notice [  578.570544] RAX: ffff8dbc49799148 RBX: ffff8db869166b50 RCX: 0000000000000000
2022-02-18 15:01:37 notice [  578.571106] RDX: ffff8db869166ba8 RSI: 0000000000000000 RDI: ffff8db819a07048
2022-02-18 15:01:37 notice [  578.571722] RBP: ffff8db869166b50 R08: ffff8db67f61ab80 R09: 000000000000000b
2022-02-18 15:01:37 notice [  578.572301] R10: 0000000000000000 R11: 0000000000008000 R12: ffff8db819a07000
2022-02-18 15:01:37 notice [  578.572889] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8db92aa38800
2022-02-18 15:01:37 notice [  578.573474] FS:  0000000000000000(0000) GS:ffff8db67f600000(0000) knlGS:0000000000000000
2022-02-18 15:01:37 notice [  578.574047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-02-18 15:01:37 notice [  578.574661] CR2: 0000000000000080 CR3: 000000b6c0810003 CR4: 00000000003726f0
2022-02-18 15:01:37 notice [  578.575216] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2022-02-18 15:01:37 notice [  578.575824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2022-02-18 15:01:37 notice [  578.576400] Call Trace:
2022-02-18 15:01:37 notice [  578.576976]  check_sender_todo+0x16/0xb0 [drbd]
2022-02-18 15:01:37 notice [  578.577621]  wait_for_sender_todo+0xde/0x290 [drbd]
2022-02-18 15:01:37 notice [  578.578171]  ? finish_wait+0x80/0x80
2022-02-18 15:01:37 notice [  578.578775]  drbd_sender+0x301/0x3f0 [drbd]
2022-02-18 15:01:37 notice [  578.579352]  ? __drbd_next_peer_device_ref+0x140/0x140 [drbd]
2022-02-18 15:01:37 notice [  578.579944]  drbd_thread_setup+0x5e/0x160 [drbd]
2022-02-18 15:01:37 notice [  578.580554]  ? __drbd_next_peer_device_ref+0x140/0x140 [drbd]
2022-02-18 15:01:37 notice [  578.581102]  kthread+0x116/0x130
2022-02-18 15:01:37 notice [  578.581709]  ? kthread_flush_work_fn+0x10/0x10
2022-02-18 15:01:37 notice [  578.582258]  ret_from_fork+0x35/0x40
2022-02-18 15:01:37 notice [  578.588274] CR2: 0000000000000080
2022-02-18 15:01:37 notice [  578.588909] ---[ end trace 0120c3e98d62a293 ]---
2022-02-18 15:01:37 notice [  578.632554] RIP: 0010:tl_next_request_for_connection+0x122/0x350 [drbd]
2022-02-18 15:01:37 notice [  578.633316] Code: 85 db 48 0f 44 dd 49 8b 44 24 10 48 8b 55 58 48 05 48 01 00 00 48 8d 6a a8 48 39 c2 0f 84 25 01 00 00 4c 8b 75 00 48 8b 3c 24 <49> 63 b6 80 00 00 00 e8 12 4a 1b d2 4c 63 88 84 00 00 00 49 89 c7
2022-02-18 15:01:37 notice [  578.634634] RSP: 0018:ffffb5eed9943db0 EFLAGS: 00010087
2022-02-18 15:01:37 notice [  578.635316] RAX: ffff8dbc49799148 RBX: ffff8db869166b50 RCX: 0000000000000000
2022-02-18 15:01:37 notice [  578.635974] RDX: ffff8db869166ba8 RSI: 0000000000000000 RDI: ffff8db819a07048
2022-02-18 15:01:37 notice [  578.636658] RBP: ffff8db869166b50 R08: ffff8db67f61ab80 R09: 000000000000000b
2022-02-18 15:01:37 notice [  578.637278] R10: 0000000000000000 R11: 0000000000008000 R12: ffff8db819a07000
2022-02-18 15:01:37 notice [  578.637933] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8db92aa38800
2022-02-18 15:01:37 notice [  578.638632] FS:  0000000000000000(0000) GS:ffff8db67f600000(0000) knlGS:0000000000000000
2022-02-18 15:01:37 notice [  578.639292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-02-18 15:01:37 notice [  578.639909] CR2: 0000000000000080 CR3: 000000b6c0810003 CR4: 00000000003726f0
2022-02-18 15:01:37 notice [  578.640587] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2022-02-18 15:01:37 notice [  578.641155] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2022-02-18 15:01:37 notice [  578.641766] Kernel panic - not syncing: Fatal exception
2022-02-18 15:01:37 notice [  578.734113] Kernel Offset: 0x11a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
2022-02-18 15:01:37 notice [  578.777494] ---[ end Kernel panic - not syncing: Fatal exception ]---

@koszik
Copy link
Author

koszik commented Mar 4, 2022

I would like to finish evaluating diskless mode, but unfortunately haven't gotten past the first test. Do you need any help fixing the bugs?

@JoelColledge
Copy link
Contributor

Regarding the errors on read, it appears that on-no-data-accessible suspend-io only applies when the resource is Primary. It is possible to access DRBD read-only without promoting to Primary. Try the test again with an explicit drbdadm primary ... first.

Regarding the write errors, we have applied some fixes to the drbd-9.1 branch just recently that relate to the area of suspending and resuming IO. Please try the current drbd-9.1 branch.

Also, you probably want to enable quorum, since you are configuring on-no-quorum. Set the quorum option too.

@JoelColledge
Copy link
Contributor

@koszik Please test again with drbd-9.1.7. Does the bug still occur with the write test?

@JoelColledge
Copy link
Contributor

Given that there has been no information to the contrary, I am assuming that this is fixed in drbd-9.1.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants