Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iscsid failed to recover session if another error happened when iscsid is reopening the session. #387

Closed
wenchao-hao opened this issue Jan 3, 2023 · 2 comments · Fixed by #388

Comments

@wenchao-hao
Copy link
Contributor

Here is the log:

Jan 03 03:57:25 fedora iscsid[40456]: iscsid: Kernel reported iSCSI connection 3:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jan 03 03:57:25 fedora kernel:  connection3:0: detected conn error (1020)
Jan 03 03:57:28 fedora kernel:  connection3:0: detected conn error (1020)
Jan 03 03:57:32 fedora iscsid[40456]: iscsid: Received iferror -107: Transport endpoint is not connected.
Jan 03 03:57:32 fedora iscsid[40456]: iscsid: can't set operational parameter 15 for connection 3:0, retcode -107 (0)
Jan 03 03:57:34 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)
Jan 03 03:57:36 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)
Jan 03 03:57:38 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)
Jan 03 03:57:40 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)
Jan 03 03:57:42 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)
Jan 03 03:57:44 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)
Jan 03 03:57:46 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)
Jan 03 03:57:48 fedora iscsid[40456]: iscsid: can't bind conn 3:0 to session 3, retcode -107 (115)

iscsid would failed to recover this session if error happened when it is reopening an session.

@wenchao-hao
Copy link
Contributor Author

The analysis is like following:

time
 |
 |      iscsid                                 kernel                          target
 |  
 |                                                                         close socket
 |                                                                   <-----------------------
 |                                     iscsi_conn_error_event()
 |                                     with error type
 |                                     ISCSI_ERR_TCP_CONN_CLOSE
 |                                    <------------------------
 |  
 |  session_conn_reopen()
 |   -> ipc->stop_conn()    -->       would trigger kernel clear conn->flags'
 |                                    ISCSI_CLS_CONN_BIT_CLEANUP bit
 |   -> iscsi_conn_connect()
 |  session_conn_poll
 |   -> ipc->bind_conn()
 |                                                                         close socket
 |                                                                   <-----------------------
 |                                     iscsi_conn_error_event()
 |                                     would set kernel conn->flags'
 |                                     ISCSI_CLS_CONN_BIT_CLEANUP bit
 |                                     And call iscsi_stop_conn
 |                                     Now, in kernel connection's state
 |                                     is ISCSI_CONN_FAILED
 |                                    <------------------------
 |   -> iscsi_session_set_params()
 |        -> ipc->set_param
 |           failed with ENOTCONN
 |   -> iscsi_login_eh()
 |        -> session_conn_reopen()
 |
 v

in user space:
conn->state is ISCSI_CONN_STATE_XPT_WAIT
session->r_stage is R_STAGE_SESSION_REOPEN
in kernel space
conn->flags is set with ISCSI_CLS_CONN_BIT_CLEANUP

iscsid would call session_conn_reopen() to recovery this connection,
while it would be called with do_stop set to 0. So the kernel would never
clear conn->flags' ISCSI_CLS_CONN_BIT_CLEANUP bit.

iscsid would fall into an infinite cycle which looks like following:

iscsi_conn_connect -> bind_conn(failed with ENOTCONN)
        ^                     |
        |                     |
        |                     v
        \---------- session_conn_reopen

@wenchao-hao
Copy link
Contributor Author

wenchao-hao commented Jan 3, 2023

we can recurrent this issue by adding delay in iscsid(it would happen without following changes but with low possibility):

--- a/usr/initiator.c
+++ b/usr/initiator.c
@@ -1630,6 +1630,8 @@ static void session_conn_poll(void *data)
                        return;
                }
 
+               sleep(5);
+
                if (iscsi_session_set_params(conn)) {
                        iscsi_login_eh(conn, qtask, ISCSI_ERR_LOGIN);
                        return;

Steps:

  1. using targetcli to create a target
  2. build open-iscsi with above changes and install it
  3. login the target created by targetcli
  4. restart target service on target for more than one time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant