Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In src/agent/dtls_mbedtls.cpp - issue with FD_ISSET() #87

Closed
DuaneEllis-TI opened this issue Aug 13, 2017 · 4 comments
Closed

In src/agent/dtls_mbedtls.cpp - issue with FD_ISSET() #87

DuaneEllis-TI opened this issue Aug 13, 2017 · 4 comments
Labels

Comments

@DuaneEllis-TI
Copy link
Contributor

To start with, I'm not a UDP socket guy - but I think I found a problem dealing with the dtls server in the border router agent, and the commissioning test app

My initial tests (with a fix) seem to be indicate I am correct, but I have not completed my testing.

Step 1: In the main loop, various 'pseudo' tasks are called to set their bits in a select( ... fd_set ... )

This occurs here, in the commissioner test app:
https://github.com/openthread/borderrouter/blob/master/tests/meshcop/commissioner.cpp#L655

In the AGENT, it occurs here:
https://github.com/openthread/borderrouter/blob/master/src/agent/main.cpp#L68

Step 2: The main loop then calls select, either select times out, or something is readable/writable, etc.

Step 3: The main loop then calls all of the 'pseudo-tasks' - asking them to examine their SelectBits.

I think I see a problem in in the MbedtlsServer::ProcessServer() function, the code does this:

SEE: https://github.com/openthread/borderrouter/blob/master/src/agent/dtls_mbedtls.cpp#L491
snippit below

void MbedtlsServer::ProcessServer( ... )
{
     /* .. snip ... */
     otbrError   error = OTBR_ERROR_ERRNO;
     /* .. snip ... */

    // I THINK THIS IS WRONG...
    VerifyOrExit(recvmsg(mSocket, &msghdr, MSG_PEEK) > 0);

    [ snip ... process the request ...]

exit:
    // abbreviated version...
    if( error ){
           close_socket()
           reopen_socket()
     }
}

I think this is wrong because:

SHORT VERSION

Step A - During the JOINING process, the JOINER sends a RELAY request that is eventually will end up back at the Commissioner

Step B - To do that, when the RELAY request is received, it is decrypted and forwarded -

example in the commissioner test application, here:
https://github.com/openthread/borderrouter/blob/master/tests/meshcop/commissioner.cpp#L212

In the agent, it occurs here (tx & rx are next to each other):
https://github.com/openthread/borderrouter/blob/master/src/agent/border_agent.cpp#L141

Step C - KEY POINT Depending upon timing, that message may or may-not be ready on the relay socket when select is called, due to this issue, the socket is closed & reopened thus the message may or may not be lost. (without a fix, I see the message lost quite often, but it is moody sometimes it works, sometimes it fails always ... Grr...)

LONG VERSION

  • Initial conditions, when SELECT returns in the main loop
    • DTLS message is ready on Socket (A) - the secure socket.
    • Nothing is ready on Socket (B) - the non secure socket.
  • The process loop runs, we come to the dtls pseudo-process.
    • The FD_ISSET() == true for Socket(A) -
    • The message is received and decrypted
    • The message is then forwarded to Socket (B)
  • Depending upon TIMING
    * Case (A) The forwarded message may be "in-transit"
    * Case (B) OR - the message has arrived at the socket quickly.
    * In any event, select() was not called again.
  • What would cause big timing differences? Why do I see this?
    • I think it is due to network traffic volume .. and this is only a SWAG...
    • In my environment, I only have my RaspBerryPi/BeagleBone + WifiRouter
      * Thus, my environment is super calm from a network point of view.
    • In the Travis Environment, I presume the data center has LOTS of network activity to sort
    • Other offices and/or test environments may be just as noisy
  • Eventually the Process Loop gets to the mbed code (above) that I think is wrong.
    * because FD_ISSET(socket B) is FALSE - the VerifyOrExit() goes to EXIT
    * At the exit: label, if an ERROR occurs - the socket is CLOSED
    * See: https://github.com/openthread/borderrouter/blob/master/src/agent/dtls_mbedtls.cpp#L538
  • Because the socket is closed
    * In case A - the forwarded message is NOT LOST because it is in transit and not yet delivered
    * In case B - the waiting message is LOST
    * Again - i'm not a UDP guy, but as I understand a closed socket drops all packets.
    * My reference is here: Reference: https://stackoverflow.com/questions/7843644/how-long-does-a-udp-packet-stay-at-a-socket
    * In case B - the message SHOULD be re transmitted after a timeout.
    * In case B - maybe the 2nd time is a charm and it works... or it is a random test failure
  • In case A (the message is "in-transit" and not lost)
    * In the working situation, the code loops around Select is called again
    * Again due to timing -
    * the socket is CLOSED and REOPENED
    * In the successful case, I assume this occurs quickly
    * ie: before the message is delivered by the kernel to the socket
    * The second time around, the FD_ISSET() is true - and the socket is not closed.

FYI: @pkanek and @srickardti

@DuaneEllis-TI
Copy link
Contributor Author

DuaneEllis-TI commented Aug 13, 2017

Update: After quite a bit of testing (30+ join attempts) - the ONLY type of JOIN_FAIL I have found is when the JOINING_ROUTER does not receive the initial discovery request.

What I cannot do right now is test with the ThreadGroup Android app because that app does not function in my home network... why I don't know... The Hurricane Electric Android App, found here:

https://play.google.com/store/apps/details?id=net.he.networktools&hl=en

The HE app can see the Bonjour entry for the Border Router no problems. But the ThreadGroup Android App - cannot see the mDNS entry. I believe this is a bug in the ThreadGroup Android app mDNS system. This only occurs in my home office environment. This is logged in the ThreadGroup Android App JIRA here: https://threadgroup.atlassian.net/browse/COMMAPP-141

Further testing will occur when I am in the office, but now commissioning with the test app is extremely reliable now 👍

@bukepo
Copy link
Member

bukepo commented Aug 14, 2017

@DuaneEllis-TI thanks for identifying this issue. After a DTLS session is established, the socket used for listening new DTLS requests is different from the socket bound with the existing session, so re-opening the listening socket should not affect existing sessions. However, the listening socket should not be re-open when it's not readable. Are you going to share your fix with a PR?

@DuaneEllis-TI
Copy link
Contributor Author

yes, I hope to push a PR very soon.

DuaneEllis-TI added a commit to DuaneEllis-TI/borderrouter that referenced this issue Aug 15, 2017
@DuaneEllis-TI
Copy link
Contributor Author

See: #90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants