Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Override pldm response time out value" #181

Closed

Commits on Aug 24, 2021

  1. Revert "Override pldm response time out value"

    This reverts commit 574f6f8.
    
    It's unclear why this change was made. Nevertheless, it also causes
    bugs - the net effect is the response to a retried request races against
    the instance ID expiration interval because the retry interval is
    effectively equal to the instance ID expiration interval once we account
    for some timer slack.
    
    This is demonstrated by the following strace on pldmd, where we can see
    a retried request go out, followed by the report that the request
    failed, further followed by the response to the request coming in. Note
    the values are string-literal-escaped-octal, so the
    [ 0x80 0x00 0x03 ... ] byte encoding of the GetPLDMVersions request
    appears as "\200\0\3...":
    
    ```
    ...
    11:56:25.046173 socket(AF_UNIX, SOCK_SEQPACKET, 0) = 3
    ...
    11:56:25.183936 connect(3, {sa_family=AF_UNIX, sun_path=@"mctp-mux"}, 11) = 0
    11:56:25.190994 write(3, "\1", 1)       = 1
    ...
    11:56:25.195272 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\t\1", iov_len=2}, {iov_base="\200\0\3\0\0\0\0\1\0", iov_len=9}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, 0) = 11
    ...
    11:56:30.202298 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\t\1", iov_len=2}, {iov_base="\200\0\3\0\0\0\0\1\0", iov_len=9}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, 0) = 11
    11:56:30.202820 gettid()                = 1918
    11:56:30.203029 timerfd_settime64(6, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=3848, tv_nsec=61124978641398328}}, NULL) = 0
    11:56:30.203286 epoll_wait(4, [{EPOLLIN, {u32=14373240, u64=14373240}}], 14, 0) = 1
    11:56:30.203509 clock_gettime64(CLOCK_REALTIME, {tv_sec=1629806190, tv_nsec=203587376}) = 0
    11:56:30.203687 clock_gettime64(CLOCK_MONOTONIC, {tv_sec=3843, tv_nsec=523046301}) = 0
    11:56:30.203844 clock_gettime64(CLOCK_BOOTTIME, {tv_sec=3843, tv_nsec=523206110}) = 0
    11:56:30.204049 write(2, "Response not received for the re"..., 59) = 59
    11:56:30.204427 write(2, " EID = ", 7)  = 7
    11:56:30.204745 write(2, "9", 1)        = 1
    11:56:30.205047 write(2, " INSTANCE_ID = ", 15) = 15
    11:56:30.205389 write(2, "0", 1)        = 1
    11:56:30.205719 write(2, " TYPE = ", 8) = 8
    11:56:30.205997 write(2, "0", 1)        = 1
    11:56:30.206266 write(2, " COMMAND = ", 11) = 11
    11:56:30.206576 write(2, "3", 1)        = 1
    11:56:30.206893 write(2, "\n", 1)       = 1
    11:56:30.209402 write(2, "Failed to receive response for ", 31) = 31
    11:56:30.209814 write(2, "getPLDMVersion command, Host see"..., 46) = 46
    11:56:30.210969 gettid()                = 1918
    11:56:30.211171 timerfd_settime64(6, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=8549172174085160960}, it_value={tv_sec=0, tv_nsec=8566510441663037440}}, NULL) = 0
    11:56:30.211406 epoll_wait(4, [{EPOLLIN, {u32=14373240, u64=14373240}}], 14, 0) = 1
    11:56:30.211640 clock_gettime64(CLOCK_REALTIME, {tv_sec=1629806190, tv_nsec=211720512}) = 0
    11:56:30.211825 clock_gettime64(CLOCK_MONOTONIC, {tv_sec=3843, tv_nsec=531188829}) = 0
    11:56:30.211983 clock_gettime64(CLOCK_BOOTTIME, {tv_sec=3843, tv_nsec=531335706}) = 0
    11:56:30.212143 recv(3, NULL, 0, MSG_PEEK|MSG_TRUNC) = 15
    11:56:30.212366 recv(3, "\t\1\0\0\3\0\0\0\0\0\5\361\361\360\0", 15, 0) = 15
    ```
    
    That is, at 11:56:30.202298 we send out the retry for the request
    initiated at 11:56:25.195272 and the reply arrives back at
    11:56:30.212366, but in between we've already cancelled the request
    handler due to the instance ID interval timer expiring.
    
    Resolve this by removing the explicit configuration of the
    response-time-out build parameter setting the per-request response time
    to 4.8 seconds, setting its value back to the default of two seconds.
    Anecdotal testing of with the following shell script produced no
    failures (by inspection of the journal as the iterations executed):
    
    ```
    for i in `seq 1 30`; do echo $i; ( systemctl stop pldmd mctp-demux && echo 1e78902c.kcs > /sys/bus/platform/drivers/ast-kcs-bmc/unbind && sleep 1 && echo 1e78902c.kcs > /sys/bus/platform/drivers/ast-kcs-bmc/bind && systemctl start pldmd && sleep 15 ) || break; done
    ```
    
    Change-Id: Id2e308fa60c708a7c4321e025e1cf5a1056a9a2f
    Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
    amboar committed Aug 24, 2021
    Configuration menu
    Copy the full SHA
    20b1307 View commit details
    Browse the repository at this point in the history