-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Can't communicate via Ethernet #3618
Comments
This issue does not occur when starting the firmware (2.3.4.ea61808) in a Docker container:
Log:
What stands out is that on the device it says
while when I run the firmware in Docker it says
|
OK, Ive spent a couple hours poking at this and I think I've got some answers on why this is happening, but I don't know the best approach for solving it. Taking a look at how the TCP API loop is implemented here: https://github.com/meshtastic/firmware/blob/master/src/mesh/api/ServerAPI.cpp#L48-L59 My understanding is this:
The issue of course, is a client connected to the TCP API will send a number of requests through the life of the connection, so the receive buffer will have data in it much longer than the 100ms it takes to start the loop over. When the loop repeats in the next 100ms, Line 48/49 will return true, because there is an established connection, likely with contents in the recv buffer. The connection we made a moment ago. Its about to be killed. This generally seems like an incomplete way to handle this. I went looking through old git commits and found #608 where the force-closing of the previous TCP connection was introduced. I get that we only want to permit one connection at a time, but outright killing the existing connection isn't a valid route either. This behavior makes some sense if you only consider long-running connections that have a 0-length receive buffer waiting from the client (such as the Android app idling, some other custom application watching for received messages, etc). However, if you're using the CLI via the TCP API, queries can take ~3 seconds to fully complete, and given the 100ms loop, there's no chance they'll finish before they get auto-destroyed the next time the loop runs. I've temporarily changed Hoping someone has some other ideas on ways to handle the various edge cases here? |
Thinking through some high level logic on an appropriate way to handle this, and I have an idea to propose. When a new connection comes in, the code currently checks to see if there are any existing clients (openAPI) and if so, it destroys them. I feel like the intention here was to only behave this way if the connection is a long-running, currently idle established connection. Perhaps on line 51, rather than blindly checking for an existing client, we should be evaluating the currently established client a bit further, and checking if its receive buffer is empty or not. If it is, allow it to be killed. This would allow a second long-running client to "take over" the single connection that's allwoed. If there is data in the recv buffer, dont destroy the client, but also dont attempt to create a new one. This would allow an existing client that's still actively asking for data to finish. I can see how this could be accomplished with the private functions within EthernetServer/EthernetClient, but not entirely sure if the public functions allow for this level of inspection. I'm not a C++ developer, so I'm just taking a shot here. Another thought of course, is if you have two clients (let's use the android app using IP connections as our example) both trying to connect to the same API, and the server doesn't want to allow more than one.. If you let the server kill an existing connection and give the new client the socket, it seems like this will actually just cause the two clients to fight while they repeatedly attempt to reconnect - ultimately making neither of them function. I dont know how many times they retry, but this doesn't seem viable either. Perhaps closing the existing connection shouldn't happen at all, and we should just reject new connections with an error if one already exists? |
Interestingly - I've updated to the latest python meshtastic v2.3.8 and noticed that there's a major change in it to use the 'modern' meshtastic protocol init flow. In the PR it mentions timeouts when using the TCP API: After this update, the CLI seems to work basically 100% of the time. It makes some sense, given that it can accomplish the task much faster, but its also entirely confusing given how i know the code mentioned above should work, so there must still be something about the TCP API loop that I dont understand. Will try to get a better understanding of how that loop works, but for now if you update the meshtastic CLI to 2.3.8, i suspect you'll also see that its now working reliably. Curious to hear your results. |
@daytonturner if I understand it correctly, meshtastic/python#560 only eliminates the symptom, not the issue. Before, the Python CLI did:
Because 3. is no longer needed, the Python CLI no longer runs into this issue. In the client I'm writing, I can also successfully establish a connection and receive the device information. However, as soon as I send another message to the device, the connection is dropped. |
It makes sense that this behaviour would mask the problem, rather than fixing the root of the problem.
I say this because the main ServerAPI loop checks to see if a new connection is waiting to be served by testing if there’s an established connection with bytes waiting in the receive buffer (ie the client hasn’t sent its request yet).
If the client makes only one request and is simply waiting for response to happen, then it won’t be a problem.
If after 100ms (the loop time) there’s still a client connected with anything in the receive buffer (like a second request) it will close it and open a new one.
Dayton
…________________________________
From: Michel Jung ***@***.***>
Sent: Sunday, May 5, 2024 1:01:38 AM
To: meshtastic/firmware ***@***.***>
Cc: daytonturner ***@***.***>; Mention ***@***.***>
Subject: Re: [meshtastic/firmware] [Bug]: Can't communicate via Ethernet (Issue #3618)
@daytonturner<https://github.com/daytonturner> they way I understand it, meshtastic/python#560<meshtastic/python#560> only eliminates the symptom, not the issue.
Before, the Python CLI did:
1. Establish a connection (works)
2. Request channels (failed, because of this bug)
Now, it only does 1., which works, and 2. is no longer needed.
In the client I'm writing, I can also successfully establish a connection and receive the device information. However, as soon as I send another message to the device, the connection is dropped.
—
Reply to this email directly, view it on GitHub<#3618 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACZEPSNSUDPMJBVBBR45GY3ZAXROFAVCNFSM6AAAAABGGD3OGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGY3TGOJQGI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Category
Other
Hardware
RAK4631
RAK13800
Firmware Version
2.3.4.ea61808
Description
I wrote a Software that successfully connects via Ethernet and retrieves device information. However, when trying to send a chat message, the device drops the connection.
To make sure the problem is not on my end, I tried the Python CLI and it seems to confirm the issue:
Relevant log output
The text was updated successfully, but these errors were encountered: