Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct handle broken logout session when iscsid restarted #339

Closed
wants to merge 4 commits into from

Conversation

wenchao-hao
Copy link
Contributor

Refer to #338 for more details.

@wenchao-hao
Copy link
Contributor Author

wenchao-hao commented Apr 14, 2022

I tested this with some changes(stop iscsid after session_unbind succeed) in iscsid and kernel with patches:

https://lore.kernel.org/linux-scsi/20220414014947.4168447-1-haowenchao@huawei.com/

Following is a brief log of test:

If session is in UNBOUND state, iscsid would be shutdown in iscsi_sync_session durning start.

root@fedora ~:# systemctl status iscsid 
× iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; disabled; vendor preset: disabled)
     Active: failed (Result: signal) since Thu 2022-04-14 04:28:10 EDT; 4min 12s ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
    Process: 10574 ExecStart=/sbin/iscsid -f (code=killed, signal=KILL)
   Main PID: 10574 (code=killed, signal=KILL)
     Status: "Ready to process requests"
        CPU: 45ms

Apr 14 04:28:04 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:05 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:06 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:07 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:08 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:09 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:10 fedora iscsid[10574]: iscsid: sleep with session_unbind succeed
Apr 14 04:28:10 fedora systemd[1]: iscsid.service: Main process exited, code=killed, status=9/KILL
Apr 14 04:28:10 fedora systemd[1]: iscsid.service: Failed with result 'signal'.
Apr 14 04:28:10 fedora systemd[1]: Stopped Open-iSCSI.
root@fedora ~:# systemctl start iscsid 
root@fedora ~:# systemctl status iscsid 
● iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; disabled; vendor preset: disabled)
     Active: active (running) since Thu 2022-04-14 04:32:33 EDT; 1s ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
   Main PID: 10659 (iscsid)
     Status: "Ready to process requests"
      Tasks: 1 (limit: 38062)
     Memory: 1.8M
        CPU: 21ms
     CGroup: /system.slice/iscsid.service
             └─10659 /sbin/iscsid -f

Apr 14 04:32:32 fedora systemd[1]: Starting Open-iSCSI...
Apr 14 04:32:32 fedora iscsid[10659]: iscsid: Shutdown unbound session3
Apr 14 04:32:32 fedora iscsid[10659]: iscsid: Connection3:0 to [target: iqn.2003-01.org.linux-iscsi.localhost.test:sn.12345678abcd, portal: 9.84.111.6,3260] through [iface: default] is shutdown.
Apr 14 04:32:33 fedora systemd[1]: Started Open-iSCSI.
root@fedora ~:# iscsiadm -m session
iscsiadm: No active sessions.
root@fedora ~:# ls /sys/class/iscsi_session/

usr/iscsid.c Outdated

if (!strcmp(session_state, "FREE")) {
log_error("Skip syncing a freed session%d", info->sid);
return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FREE state doesn't always mean we are freeing the session during removal. It means the session is no longer valid. This could be because it is being removed or the replacement/recovery timeout fired (see session_recovery_timedout in scsi_transport_iscsi.c). For the latter, if we re-login we will re-use the session struct. So if iscsid was restarted for the recovery timed out case we would still want to try to relogin/sync when iscsid starts up again.

I think you need to check if there is also a connection in sysfs (/sys/class/iscsi_connection). If there is no connection, then we were in FREE because it's being removed.

The problem above is that we've looped over sysfs and gathered the session and connection data, so we would expect that we would only get to this point if there was a valid connection. However, iscsi_sysfs_get_sessioninfo_by_id looks like it has a bug where it assumes that if we have a session then there will always be a connection, so if those

sysfs_get_str(ISCSI_CONN_SUBSYS)

calls fail, we just ignore it more or less so we can end up running sync_session.

So I think we need 3 things:

  1. In initiator.c iscsi_sync_session we could add some code that can check if there is a connection and session. If there is only a session and it's in FREE, then do like you did above and just return because it's being freed in the kernel.

  2. If there is a connection and it's state is DOWN, then something was removing the session/connection but iscsid failed when we were removing the connection. Don't sync. Just finish the removal. If the conn state is DOWN then stop conn with STOP_CONN_TERM was already called. So we just need to do destroy_conn then destroy_session.

  3. Fix up iscsi_sysfs_get_sessioninfo_by_id so it handles the case where the connection has been removed but the session is still there. Since for the above case, we want iscsid to cleanup we can't just return error and ignore the session. I'm not sure what we want to do here. Maybe just make sure the address/port are set to invalid values so the caller can figure out what to do?

usr/initiator.c Outdated

if (target_unbound) {
log_error("Shutdown unbound session%d", sid);
session_conn_shutdown(&session->conn[0], qtask, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conn->state will be ISCSI_CONN_STATE_CLEANUP_WAIT from sync_conn() right?

If so, the conn could have been in the logged in state when the unbind was running and iscsid stopped. So we still need to do a stop_conn call so the kernel can run libiscsi.c:iscsi_conn_stop.

I think we can check the conn kernel state to detect if it's in the DOWN state already (scsi_transport_iscsi.c:iscsi_stop_conn sets that state when userspace has called into the kernel to stop a conn for removal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conn->state will be ISCSI_CONN_STATE_CLEANUP_WAIT from sync_conn() right?

Yes, it's now ISCSI_CONN_STATE_CLEANUP_WAIT state.

If so, the conn could have been in the logged in state when the unbind was running and iscsid stopped. So we still need to do a stop_conn call so the kernel can run libiscsi.c:iscsi_conn_stop.

Yes, we need to check it and judge if stop_conn is necessary now.

I think we can check the conn kernel state to detect if it's in the DOWN state already (scsi_transport_iscsi.c:iscsi_stop_conn sets that state when userspace has called into the kernel to stop a conn for removal).

@@ -1149,6 +1149,9 @@ static void iscsi_stop(void *data)

iscsi_ev_context_put(ev_context);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about how we would hit this. Since you are fixing it in the kernel so we don't send 2 events, if someone later looks at the userspace and current kernel code, they will be confused and think the check is not needed.

Just add something like:

When using async destroy session and older kernels we might get 2 session unbind events. If we are already in the IN_LOGOUT state then we know we've already got one, so we can ignore the second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about how we would hit this. Since you are fixing it in the kernel so we don't send 2 events, if someone later looks at the userspace and current kernel code, they will be confused and think the check is not needed.

Just add something like:

When using async destroy session and older kernels we might get 2 session unbind events. If we are already in the IN_LOGOUT state then we know we've already got one, so we can ignore the second.

Of course, I would add it.

@gonzoleeman
Copy link
Collaborator

What's the status of this pull request? Sorry, I didn't take time to re-read the whole thread, but now that Mike has submitted a few changes upstream, is this still an issue?

@mikechristie
Copy link
Collaborator

It's not related to any patches upstream so it's still needed.

@wenchao-hao
Copy link
Contributor Author

What's the status of this pull request? Sorry, I didn't take time to re-read the whole thread, but now that Mike has submitted a few changes upstream, is this still an issue?

Apolgy for my long time absence. I was too busy to submit changes in last 2 month. It seems I would have time to make it in next month.

@gonzoleeman
Copy link
Collaborator

What's the status of this pull request? Sorry, I didn't take time to re-read the whole thread, but now that Mike has submitted a few changes upstream, is this still an issue?

Apolgy for my long time absence. I was too busy to submit changes in last 2 month. It seems I would have time to make it in next month.

No problem at all. I understand being busy. I just wanted to clean this request up if it was no longer needed.

@wenchao-hao
Copy link
Contributor Author

@mikechristie @gonzoleeman Hi mike and lee, I found some other scenario which may cause iscsi session in invalid state, and I am trying to figure these out. This PR would be updated in next few days

@wenchao-hao wenchao-hao force-pushed the unbind_cleanup branch 2 times, most recently from 6b63600 to dd3ee97 Compare August 1, 2022 16:36
@wenchao-hao
Copy link
Contributor Author

@mikechristie @gonzoleeman Hi mike and lee, I found some other scenario which may cause iscsi session in invalid state, and I am trying to figure these out. This PR would be updated in next few days

Updated

violethaze74
violethaze74 previously approved these changes Aug 13, 2022
@wenchao-hao
Copy link
Contributor Author

Update this PR with kernel patch:https://lore.kernel.org/all/20220802034729.2566787-1-haowenchao@huawei.com/

For kernel which did not apply the target_state patch, when iscsid restart, it would show:

root@fedora /code/open-iscsi:# ./usr/iscsid -f
iscsid: failed to open session's target state No such file or directory
iscsid: connection1:0 is operational after recovery (1 attempts)

For kernel which applied the target_state patch, when iscsid restart:

  • If the target_state is UNBINDING, it would show:
root@fedora /code/open-iscsi:# ./usr/iscsid -f
Donot sync UNBINDING session3
iscsid: connection3:0 Could not send logout pdu(Invalid argument) from iscsi_stop. Dropping session
iscsid: Connection3:0 to [target: iqn.1994-05.com.redhat:e1d3c4cb3d4, portal: 192.168.1.12,3260] through [iface: default] is shutdown.
  • If the target_state is UNBOUND, it would show:
iscsid: Shutdown UNBOUND session1
iscsid: Connection1:0 to [target: iqn.1994-05.com.redhat:e1d3c4cb3d4, portal: 192.168.1.12,3260] through [iface: default] is shutdown.

@mikechristie
Copy link
Collaborator

Hey wenchao-hao,

Thanks for your work on this. I'm on vacation for a couple weeks. I'll be able to look at this and the kernel patch more when I get back.

Wenchao Hao added 4 commits November 8, 2022 11:28
Once receive ISCSI_KEVENT_UNBIND_SESSION, ctldev_handle() would trigger
an EV_CONN_STOP event, this event would calling iscsi_stop() to perform
connection stop operations.

While we can not guarantee only one ISCSI_KEVENT_UNBIND_SESSION is
received(actually in current mainline kernel design, kernel would always send
ISCSI_KEVENT_UNBIND_SESSION twice).

So we must check connection's state at the begining of iscsi_stop() to
avoid the stop operations performed more than one time.

This issue only happened with async destroy session.

Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
A session might have no leadconn in next scenario:

- iscsid died after destroy_conn() is called during logout
  now the session state could be LOGED_IN or FREE
- iscsid died after create_session() is called during login
  now the session state would be FREE

For the first scenario, kernel is waitting for userspace's
destroy_session request, iscsid should perform this destroy
operation when restart.
While for the second scenario, iscsi_sysfs_get_sessioninfo_by_id()
would return error on this session. If session has no connection
in this scenario, the session would still be residue because it
would not call in iscsi_sync_session().

No matter what is the root reason, it's an invalid session
which should be destroyed when it comes to iscsi_sync_session().

Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
A connection would be in down state in next scenario:

- iscsid died after stop_conn() is called during logout
- iscsid died after create_conn() is called during login

For the first scenario, the connection is stopped with STOP_CONN_TERM,
it's a unrecoverable flag. For drivers which based on libiscsi, once
stop_conn() is triggerer, the connection can not send pdu any more.

While for the second scenario, iscsi_sysfs_get_sessioninfo_by_id()
would returned error, so it would not call in iscsi_sync_session(),
this session would still be residual.

Both the above scenario the session would failed to recovery, so
we need to destroy it.

Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
When iscsid restart and sync a session, check this session's target_state.
If the state is SCANNED or ALLOCATED, reopen this session; else the session
would be shutdown.

target state would be ALLOCATED if iscsid died before scan session,
iscsid should sync this session and perform the scan operation if this
session has a connection.

targt state would be UNBOUND or UNBINDING if iscsid died after
session_unbind() is done, iscsid should destroy this session when
restart.

Refer to https://lore.kernel.org/all/20221108014414.3510940-1-haowenchao@huawei.com/T/#u
for more details.

Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
@gonzoleeman
Copy link
Collaborator

It looks like there are still 3 "unresolved conversations" here. I've set up open-iscsi so that they must be resolved before a merge is allowed.

@mikechristie are your issues resolved?

@wenchao-hao
Copy link
Contributor Author

It looks like there are still 3 "unresolved conversations" here. I've set up open-iscsi so that they must be resolved before a merge is allowed.

@mikechristie are your issues resolved?

The last 3 commits are based a kernel patch, I have send the patch to maillist:
https://lore.kernel.org/linux-scsi/20221126010752.231917-1-haowenchao@huawei.com/

I think this PR could be merged after that patch is applied.

@wenchao-hao
Copy link
Contributor Author

I would submit a new PR in next days, so close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants