New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct handle broken logout session when iscsid restarted #339
Conversation
I tested this with some changes(stop iscsid after session_unbind succeed) in iscsid and kernel with patches: https://lore.kernel.org/linux-scsi/20220414014947.4168447-1-haowenchao@huawei.com/ Following is a brief log of test: If session is in UNBOUND state, iscsid would be shutdown in iscsi_sync_session durning start.
|
acd09c7
to
605cdeb
Compare
usr/iscsid.c
Outdated
|
||
if (!strcmp(session_state, "FREE")) { | ||
log_error("Skip syncing a freed session%d", info->sid); | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FREE state doesn't always mean we are freeing the session during removal. It means the session is no longer valid. This could be because it is being removed or the replacement/recovery timeout fired (see session_recovery_timedout in scsi_transport_iscsi.c). For the latter, if we re-login we will re-use the session struct. So if iscsid was restarted for the recovery timed out case we would still want to try to relogin/sync when iscsid starts up again.
I think you need to check if there is also a connection in sysfs (/sys/class/iscsi_connection). If there is no connection, then we were in FREE because it's being removed.
The problem above is that we've looped over sysfs and gathered the session and connection data, so we would expect that we would only get to this point if there was a valid connection. However, iscsi_sysfs_get_sessioninfo_by_id looks like it has a bug where it assumes that if we have a session then there will always be a connection, so if those
sysfs_get_str(ISCSI_CONN_SUBSYS)
calls fail, we just ignore it more or less so we can end up running sync_session.
So I think we need 3 things:
-
In initiator.c iscsi_sync_session we could add some code that can check if there is a connection and session. If there is only a session and it's in FREE, then do like you did above and just return because it's being freed in the kernel.
-
If there is a connection and it's state is DOWN, then something was removing the session/connection but iscsid failed when we were removing the connection. Don't sync. Just finish the removal. If the conn state is DOWN then stop conn with STOP_CONN_TERM was already called. So we just need to do destroy_conn then destroy_session.
-
Fix up iscsi_sysfs_get_sessioninfo_by_id so it handles the case where the connection has been removed but the session is still there. Since for the above case, we want iscsid to cleanup we can't just return error and ignore the session. I'm not sure what we want to do here. Maybe just make sure the address/port are set to invalid values so the caller can figure out what to do?
usr/initiator.c
Outdated
|
||
if (target_unbound) { | ||
log_error("Shutdown unbound session%d", sid); | ||
session_conn_shutdown(&session->conn[0], qtask, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conn->state will be ISCSI_CONN_STATE_CLEANUP_WAIT from sync_conn() right?
If so, the conn could have been in the logged in state when the unbind was running and iscsid stopped. So we still need to do a stop_conn call so the kernel can run libiscsi.c:iscsi_conn_stop.
I think we can check the conn kernel state to detect if it's in the DOWN state already (scsi_transport_iscsi.c:iscsi_stop_conn sets that state when userspace has called into the kernel to stop a conn for removal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conn->state will be ISCSI_CONN_STATE_CLEANUP_WAIT from sync_conn() right?
Yes, it's now ISCSI_CONN_STATE_CLEANUP_WAIT state.
If so, the conn could have been in the logged in state when the unbind was running and iscsid stopped. So we still need to do a stop_conn call so the kernel can run libiscsi.c:iscsi_conn_stop.
Yes, we need to check it and judge if stop_conn is necessary now.
I think we can check the conn kernel state to detect if it's in the DOWN state already (scsi_transport_iscsi.c:iscsi_stop_conn sets that state when userspace has called into the kernel to stop a conn for removal).
@@ -1149,6 +1149,9 @@ static void iscsi_stop(void *data) | |||
|
|||
iscsi_ev_context_put(ev_context); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about how we would hit this. Since you are fixing it in the kernel so we don't send 2 events, if someone later looks at the userspace and current kernel code, they will be confused and think the check is not needed.
Just add something like:
When using async destroy session and older kernels we might get 2 session unbind events. If we are already in the IN_LOGOUT state then we know we've already got one, so we can ignore the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about how we would hit this. Since you are fixing it in the kernel so we don't send 2 events, if someone later looks at the userspace and current kernel code, they will be confused and think the check is not needed.
Just add something like:
When using async destroy session and older kernels we might get 2 session unbind events. If we are already in the IN_LOGOUT state then we know we've already got one, so we can ignore the second.
Of course, I would add it.
What's the status of this pull request? Sorry, I didn't take time to re-read the whole thread, but now that Mike has submitted a few changes upstream, is this still an issue? |
It's not related to any patches upstream so it's still needed. |
Apolgy for my long time absence. I was too busy to submit changes in last 2 month. It seems I would have time to make it in next month. |
No problem at all. I understand being busy. I just wanted to clean this request up if it was no longer needed. |
@mikechristie @gonzoleeman Hi mike and lee, I found some other scenario which may cause iscsi session in invalid state, and I am trying to figure these out. This PR would be updated in next few days |
6b63600
to
dd3ee97
Compare
Updated |
dd3ee97
to
4500276
Compare
Update this PR with kernel patch:https://lore.kernel.org/all/20220802034729.2566787-1-haowenchao@huawei.com/ For kernel which did not apply the target_state patch, when iscsid restart, it would show:
For kernel which applied the target_state patch, when iscsid restart:
|
Hey wenchao-hao, Thanks for your work on this. I'm on vacation for a couple weeks. I'll be able to look at this and the kernel patch more when I get back. |
0399a04
to
0df5d7b
Compare
0df5d7b
to
70811a1
Compare
Once receive ISCSI_KEVENT_UNBIND_SESSION, ctldev_handle() would trigger an EV_CONN_STOP event, this event would calling iscsi_stop() to perform connection stop operations. While we can not guarantee only one ISCSI_KEVENT_UNBIND_SESSION is received(actually in current mainline kernel design, kernel would always send ISCSI_KEVENT_UNBIND_SESSION twice). So we must check connection's state at the begining of iscsi_stop() to avoid the stop operations performed more than one time. This issue only happened with async destroy session. Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
A session might have no leadconn in next scenario: - iscsid died after destroy_conn() is called during logout now the session state could be LOGED_IN or FREE - iscsid died after create_session() is called during login now the session state would be FREE For the first scenario, kernel is waitting for userspace's destroy_session request, iscsid should perform this destroy operation when restart. While for the second scenario, iscsi_sysfs_get_sessioninfo_by_id() would return error on this session. If session has no connection in this scenario, the session would still be residue because it would not call in iscsi_sync_session(). No matter what is the root reason, it's an invalid session which should be destroyed when it comes to iscsi_sync_session(). Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
A connection would be in down state in next scenario: - iscsid died after stop_conn() is called during logout - iscsid died after create_conn() is called during login For the first scenario, the connection is stopped with STOP_CONN_TERM, it's a unrecoverable flag. For drivers which based on libiscsi, once stop_conn() is triggerer, the connection can not send pdu any more. While for the second scenario, iscsi_sysfs_get_sessioninfo_by_id() would returned error, so it would not call in iscsi_sync_session(), this session would still be residual. Both the above scenario the session would failed to recovery, so we need to destroy it. Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
When iscsid restart and sync a session, check this session's target_state. If the state is SCANNED or ALLOCATED, reopen this session; else the session would be shutdown. target state would be ALLOCATED if iscsid died before scan session, iscsid should sync this session and perform the scan operation if this session has a connection. targt state would be UNBOUND or UNBINDING if iscsid died after session_unbind() is done, iscsid should destroy this session when restart. Refer to https://lore.kernel.org/all/20221108014414.3510940-1-haowenchao@huawei.com/T/#u for more details. Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
70811a1
to
a27ed34
Compare
It looks like there are still 3 "unresolved conversations" here. I've set up open-iscsi so that they must be resolved before a merge is allowed. @mikechristie are your issues resolved? |
The last 3 commits are based a kernel patch, I have send the patch to maillist: I think this PR could be merged after that patch is applied. |
I would submit a new PR in next days, so close this one. |
Refer to #338 for more details.