Correct handle broken logout session when iscsid restarted #339

wenchao-hao · 2022-04-13T12:47:20Z

Refer to #338 for more details.

wenchao-hao · 2022-04-14T07:59:06Z

I tested this with some changes(stop iscsid after session_unbind succeed) in iscsid and kernel with patches:

https://lore.kernel.org/linux-scsi/20220414014947.4168447-1-haowenchao@huawei.com/

Following is a brief log of test:

If session is in UNBOUND state, iscsid would be shutdown in iscsi_sync_session durning start.

root@fedora ~:# systemctl status iscsid 
× iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; disabled; vendor preset: disabled)
     Active: failed (Result: signal) since Thu 2022-04-14 04:28:10 EDT; 4min 12s ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
    Process: 10574 ExecStart=/sbin/iscsid -f (code=killed, signal=KILL)
   Main PID: 10574 (code=killed, signal=KILL)
     Status: "Ready to process requests"
        CPU: 45ms

Apr 14 04:28:04 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:05 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:06 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:07 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:08 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:09 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:10 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:10 fedora systemd[1]: iscsid.service: Main process exited, code=killed, status=9/KILL
Apr 14 04:28:10 fedora systemd[1]: iscsid.service: Failed with result 'signal'.
Apr 14 04:28:10 fedora systemd[1]: Stopped Open-iSCSI.
root@fedora ~:# systemctl start iscsid 
root@fedora ~:# systemctl status iscsid 
● iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; disabled; vendor preset: disabled)
     Active: active (running) since Thu 2022-04-14 04:32:33 EDT; 1s ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
   Main PID: 10659 (iscsid)
     Status: "Ready to process requests"
      Tasks: 1 (limit: 38062)
     Memory: 1.8M
        CPU: 21ms
     CGroup: /system.slice/iscsid.service
             └─10659 /sbin/iscsid -f

Apr 14 04:32:32 fedora systemd[1]: Starting Open-iSCSI...
Apr 14 04:32:32 fedora iscsid[10659]: iscsid: Shutdown unbound session3
Apr 14 04:32:32 fedora iscsid[10659]: iscsid: Connection3:0 to [target: iqn.2003-01.org.linux-iscsi.localhost.test:sn.12345678abcd, portal: 9.84.111.6,3260] through [iface: default] is shutdown.
Apr 14 04:32:33 fedora systemd[1]: Started Open-iSCSI.
root@fedora ~:# iscsiadm -m session
iscsiadm: No active sessions.
root@fedora ~:# ls /sys/class/iscsi_session/

mikechristie · 2022-04-20T16:09:03Z

usr/iscsid.c

+
+	if (!strcmp(session_state, "FREE")) {
+		log_error("Skip syncing a freed session%d", info->sid);
+		return 0;


The FREE state doesn't always mean we are freeing the session during removal. It means the session is no longer valid. This could be because it is being removed or the replacement/recovery timeout fired (see session_recovery_timedout in scsi_transport_iscsi.c). For the latter, if we re-login we will re-use the session struct. So if iscsid was restarted for the recovery timed out case we would still want to try to relogin/sync when iscsid starts up again.

I think you need to check if there is also a connection in sysfs (/sys/class/iscsi_connection). If there is no connection, then we were in FREE because it's being removed.

The problem above is that we've looped over sysfs and gathered the session and connection data, so we would expect that we would only get to this point if there was a valid connection. However, iscsi_sysfs_get_sessioninfo_by_id looks like it has a bug where it assumes that if we have a session then there will always be a connection, so if those

sysfs_get_str(ISCSI_CONN_SUBSYS)

calls fail, we just ignore it more or less so we can end up running sync_session.

So I think we need 3 things:

In initiator.c iscsi_sync_session we could add some code that can check if there is a connection and session. If there is only a session and it's in FREE, then do like you did above and just return because it's being freed in the kernel.

If there is a connection and it's state is DOWN, then something was removing the session/connection but iscsid failed when we were removing the connection. Don't sync. Just finish the removal. If the conn state is DOWN then stop conn with STOP_CONN_TERM was already called. So we just need to do destroy_conn then destroy_session.

Fix up iscsi_sysfs_get_sessioninfo_by_id so it handles the case where the connection has been removed but the session is still there. Since for the above case, we want iscsid to cleanup we can't just return error and ignore the session. I'm not sure what we want to do here. Maybe just make sure the address/port are set to invalid values so the caller can figure out what to do?

mikechristie · 2022-04-20T16:17:29Z

usr/initiator.c

+
+	if (target_unbound) {
+		log_error("Shutdown unbound session%d", sid);
+		session_conn_shutdown(&session->conn[0], qtask, 0);


The conn->state will be ISCSI_CONN_STATE_CLEANUP_WAIT from sync_conn() right?

If so, the conn could have been in the logged in state when the unbind was running and iscsid stopped. So we still need to do a stop_conn call so the kernel can run libiscsi.c:iscsi_conn_stop.

I think we can check the conn kernel state to detect if it's in the DOWN state already (scsi_transport_iscsi.c:iscsi_stop_conn sets that state when userspace has called into the kernel to stop a conn for removal).

The conn->state will be ISCSI_CONN_STATE_CLEANUP_WAIT from sync_conn() right?

Yes, it's now ISCSI_CONN_STATE_CLEANUP_WAIT state.

If so, the conn could have been in the logged in state when the unbind was running and iscsid stopped. So we still need to do a stop_conn call so the kernel can run libiscsi.c:iscsi_conn_stop.

Yes, we need to check it and judge if stop_conn is necessary now.

I think we can check the conn kernel state to detect if it's in the DOWN state already (scsi_transport_iscsi.c:iscsi_stop_conn sets that state when userspace has called into the kernel to stop a conn for removal).

mikechristie · 2022-04-20T16:22:09Z

usr/initiator.c

@@ -1149,6 +1149,9 @@ static void iscsi_stop(void *data)

 	iscsi_ev_context_put(ev_context);



Could you add a comment about how we would hit this. Since you are fixing it in the kernel so we don't send 2 events, if someone later looks at the userspace and current kernel code, they will be confused and think the check is not needed.

Just add something like:

When using async destroy session and older kernels we might get 2 session unbind events. If we are already in the IN_LOGOUT state then we know we've already got one, so we can ignore the second.

Could you add a comment about how we would hit this. Since you are fixing it in the kernel so we don't send 2 events, if someone later looks at the userspace and current kernel code, they will be confused and think the check is not needed.

Just add something like:

When using async destroy session and older kernels we might get 2 session unbind events. If we are already in the IN_LOGOUT state then we know we've already got one, so we can ignore the second.

Of course, I would add it.

gonzoleeman · 2022-06-29T16:39:37Z

What's the status of this pull request? Sorry, I didn't take time to re-read the whole thread, but now that Mike has submitted a few changes upstream, is this still an issue?

mikechristie · 2022-06-29T17:23:57Z

It's not related to any patches upstream so it's still needed.

wenchao-hao · 2022-06-30T14:58:07Z

What's the status of this pull request? Sorry, I didn't take time to re-read the whole thread, but now that Mike has submitted a few changes upstream, is this still an issue?

Apolgy for my long time absence. I was too busy to submit changes in last 2 month. It seems I would have time to make it in next month.

gonzoleeman · 2022-07-07T18:55:04Z

What's the status of this pull request? Sorry, I didn't take time to re-read the whole thread, but now that Mike has submitted a few changes upstream, is this still an issue?

Apolgy for my long time absence. I was too busy to submit changes in last 2 month. It seems I would have time to make it in next month.

No problem at all. I understand being busy. I just wanted to clean this request up if it was no longer needed.

wenchao-hao · 2022-08-01T14:40:19Z

@mikechristie @gonzoleeman Hi mike and lee, I found some other scenario which may cause iscsi session in invalid state, and I am trying to figure these out. This PR would be updated in next few days

wenchao-hao · 2022-08-01T17:17:05Z

@mikechristie @gonzoleeman Hi mike and lee, I found some other scenario which may cause iscsi session in invalid state, and I am trying to figure these out. This PR would be updated in next few days

Updated

wenchao-hao · 2022-08-14T16:53:26Z

Update this PR with kernel patch:https://lore.kernel.org/all/20220802034729.2566787-1-haowenchao@huawei.com/

For kernel which did not apply the target_state patch, when iscsid restart, it would show:

root@fedora /code/open-iscsi:# ./usr/iscsid -f
iscsid: failed to open session's target state No such file or directory
iscsid: connection1:0 is operational after recovery (1 attempts)

For kernel which applied the target_state patch, when iscsid restart:

If the target_state is UNBINDING, it would show:

root@fedora /code/open-iscsi:# ./usr/iscsid -f
Donot sync UNBINDING session3
iscsid: connection3:0 Could not send logout pdu(Invalid argument) from iscsi_stop. Dropping session
iscsid: Connection3:0 to [target: iqn.1994-05.com.redhat:e1d3c4cb3d4, portal: 192.168.1.12,3260] through [iface: default] is shutdown.

If the target_state is UNBOUND, it would show:

iscsid: Shutdown UNBOUND session1
iscsid: Connection1:0 to [target: iqn.1994-05.com.redhat:e1d3c4cb3d4, portal: 192.168.1.12,3260] through [iface: default] is shutdown.

mikechristie · 2022-08-15T15:55:52Z

Hey wenchao-hao,

Thanks for your work on this. I'm on vacation for a couple weeks. I'll be able to look at this and the kernel patch more when I get back.

Once receive ISCSI_KEVENT_UNBIND_SESSION, ctldev_handle() would trigger an EV_CONN_STOP event, this event would calling iscsi_stop() to perform connection stop operations. While we can not guarantee only one ISCSI_KEVENT_UNBIND_SESSION is received(actually in current mainline kernel design, kernel would always send ISCSI_KEVENT_UNBIND_SESSION twice). So we must check connection's state at the begining of iscsi_stop() to avoid the stop operations performed more than one time. This issue only happened with async destroy session. Signed-off-by: Wenchao Hao <haowenchao@huawei.com>

A session might have no leadconn in next scenario: - iscsid died after destroy_conn() is called during logout now the session state could be LOGED_IN or FREE - iscsid died after create_session() is called during login now the session state would be FREE For the first scenario, kernel is waitting for userspace's destroy_session request, iscsid should perform this destroy operation when restart. While for the second scenario, iscsi_sysfs_get_sessioninfo_by_id() would return error on this session. If session has no connection in this scenario, the session would still be residue because it would not call in iscsi_sync_session(). No matter what is the root reason, it's an invalid session which should be destroyed when it comes to iscsi_sync_session(). Signed-off-by: Wenchao Hao <haowenchao@huawei.com>

A connection would be in down state in next scenario: - iscsid died after stop_conn() is called during logout - iscsid died after create_conn() is called during login For the first scenario, the connection is stopped with STOP_CONN_TERM, it's a unrecoverable flag. For drivers which based on libiscsi, once stop_conn() is triggerer, the connection can not send pdu any more. While for the second scenario, iscsi_sysfs_get_sessioninfo_by_id() would returned error, so it would not call in iscsi_sync_session(), this session would still be residual. Both the above scenario the session would failed to recovery, so we need to destroy it. Signed-off-by: Wenchao Hao <haowenchao@huawei.com>

When iscsid restart and sync a session, check this session's target_state. If the state is SCANNED or ALLOCATED, reopen this session; else the session would be shutdown. target state would be ALLOCATED if iscsid died before scan session, iscsid should sync this session and perform the scan operation if this session has a connection. targt state would be UNBOUND or UNBINDING if iscsid died after session_unbind() is done, iscsid should destroy this session when restart. Refer to https://lore.kernel.org/all/20221108014414.3510940-1-haowenchao@huawei.com/T/#u for more details. Signed-off-by: Wenchao Hao <haowenchao@huawei.com>

gonzoleeman · 2022-12-02T16:27:35Z

It looks like there are still 3 "unresolved conversations" here. I've set up open-iscsi so that they must be resolved before a merge is allowed.

@mikechristie are your issues resolved?

wenchao-hao · 2022-12-02T16:46:40Z

It looks like there are still 3 "unresolved conversations" here. I've set up open-iscsi so that they must be resolved before a merge is allowed.

@mikechristie are your issues resolved?

The last 3 commits are based a kernel patch, I have send the patch to maillist:
https://lore.kernel.org/linux-scsi/20221126010752.231917-1-haowenchao@huawei.com/

I think this PR could be merged after that patch is applied.

wenchao-hao · 2023-03-30T12:48:00Z

I would submit a new PR in next days, so close this one.

wenchao-hao force-pushed the unbind_cleanup branch from acd09c7 to 605cdeb Compare April 18, 2022 03:27

mikechristie reviewed Apr 20, 2022

View reviewed changes

wenchao-hao force-pushed the unbind_cleanup branch 2 times, most recently from 6b63600 to dd3ee97 Compare August 1, 2022 16:36

violethaze74 previously approved these changes Aug 13, 2022

View reviewed changes

wenchao-hao dismissed violethaze74’s stale review via 4500276 August 14, 2022 16:47

wenchao-hao force-pushed the unbind_cleanup branch from dd3ee97 to 4500276 Compare August 14, 2022 16:47

wenchao-hao force-pushed the unbind_cleanup branch 2 times, most recently from 0399a04 to 0df5d7b Compare October 27, 2022 13:57

wenchao-hao force-pushed the unbind_cleanup branch from 0df5d7b to 70811a1 Compare November 7, 2022 12:40

Wenchao Hao added 4 commits November 8, 2022 11:28

wenchao-hao force-pushed the unbind_cleanup branch from 70811a1 to a27ed34 Compare November 8, 2022 03:30

wenchao-hao closed this Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct handle broken logout session when iscsid restarted #339

Correct handle broken logout session when iscsid restarted #339

wenchao-hao commented Apr 13, 2022

wenchao-hao commented Apr 14, 2022 •

edited

mikechristie Apr 20, 2022

mikechristie Apr 20, 2022

wenchao-hao Apr 21, 2022

mikechristie Apr 20, 2022

wenchao-hao Apr 21, 2022

gonzoleeman commented Jun 29, 2022

mikechristie commented Jun 29, 2022

wenchao-hao commented Jun 30, 2022

gonzoleeman commented Jul 7, 2022

wenchao-hao commented Aug 1, 2022

wenchao-hao commented Aug 1, 2022

wenchao-hao commented Aug 14, 2022

mikechristie commented Aug 15, 2022

gonzoleeman commented Dec 2, 2022

wenchao-hao commented Dec 2, 2022

wenchao-hao commented Mar 30, 2023

		@@ -1149,6 +1149,9 @@ static void iscsi_stop(void *data)

		iscsi_ev_context_put(ev_context);

Correct handle broken logout session when iscsid restarted #339

Correct handle broken logout session when iscsid restarted #339

Conversation

wenchao-hao commented Apr 13, 2022

wenchao-hao commented Apr 14, 2022 • edited

mikechristie Apr 20, 2022

Choose a reason for hiding this comment

mikechristie Apr 20, 2022

Choose a reason for hiding this comment

wenchao-hao Apr 21, 2022

Choose a reason for hiding this comment

mikechristie Apr 20, 2022

Choose a reason for hiding this comment

wenchao-hao Apr 21, 2022

Choose a reason for hiding this comment

gonzoleeman commented Jun 29, 2022

mikechristie commented Jun 29, 2022

wenchao-hao commented Jun 30, 2022

gonzoleeman commented Jul 7, 2022

wenchao-hao commented Aug 1, 2022

wenchao-hao commented Aug 1, 2022

wenchao-hao commented Aug 14, 2022

mikechristie commented Aug 15, 2022

gonzoleeman commented Dec 2, 2022

wenchao-hao commented Dec 2, 2022

wenchao-hao commented Mar 30, 2023

wenchao-hao commented Apr 14, 2022 •

edited