Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gardena] Fix handling of websocket connection losses that causes memory leaks #11825

Merged
merged 17 commits into from
Jun 14, 2022

Conversation

Bruetti1991
Copy link
Contributor

@Bruetti1991 Bruetti1991 commented Dec 20, 2021

Fixes #10516
Fixes #11474
Fixes #10455
Fixes #12481

Issue

  • Memory leaks were caused by not properly synchonizing the calls when a connection closes
  • The binding restarted a single websocket more than once
  • This led to performance issues of the whole Openhab system when the binding was running for some time

Fix

  • Make use of the synchronized keyword to ensure a websocket is only restarted once
  • Only restart the websocket that was closed, instead of re-initializing the whole binding

…ory leaks

* The binding no longer restarts websockets more than once if the connection is lost
* Fixes openhab#10516

Signed-off-by: Nico Brüttner <n@bruettner.de>
@Bruetti1991 Bruetti1991 marked this pull request as ready for review December 20, 2021 23:56
@lolodomo lolodomo added the bug An unexpected problem or unintended behavior of an add-on label Dec 21, 2021
Copy link
Member

@kaikreuzer kaikreuzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you! Just one small question below.
@gerrieg Would you also want to have a look?

logger.warn("Restarting GardenaSmart Webservice ({})", socket.getSocketID());
socket.stop();
// restart after 3 seconds
scheduler.schedule(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to run this asynchronously, i.e. not within the synchronized call anymore? Or would a simple Thread.sleep(3000) be a safer choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kai,
sorry for the late answer.
I did it this way, because the functions onWebSocketError and onWebSocketClose are called at the same time, if the socket connection closes unexpectedly. When only doing a Thread.sleep before restarting the socket, the socket would restart twice.
To overcome this problem, I did scheduler.schedule() to wait 3 seconds, before restarting the socket. Because this is done in a new thread, the first call to restartWebsocket is immediately leaving the synchronized block, so when the second call enters the synchronized block, the socket is already closed, but not yet restarted. This prevents a second restart of the websocket.

I think this is safe and it also works fine in my openhab installation for several month now.

But if this is not a good practice, there are other options to solve this problem. For example:

  • Only restart the socket, when onWebSocketClose is called and use onWebSocketError for logging purposes only. This also works fine for me, because every time the connection is closed unexpectedly, both functions are called. But I am not a Java developer, so I don't know, if it is guaranteed, that onWebSocketClose is called every time when an error occurs...
  • If we still want to be able to restart the socket on calls to onWebSocketError and/or onWebSocketClose, I could refactor the synchronized block to get rid of scheduler.schedule() und use Thread.sleep(3000) instead. I already tested this, and it works fine. The websocket is still only restarted once.

What do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments today about the logger level. Personally I would suggest you do something like the following..

    private @Nullable ScheduledFuture<?> restartTask;

    @Override
    public void onWebSocketClose() {
        // do your logging
    ...
        // do your other stuff
    ...
        scheduleRestart();
    }

    @Override
    public void onWebSocketError() {
        // do your logging
    ...
        // do your other stuff
    ...
        scheduleRestart();
    }

    private synchronized void scheduleRestart() {
        ScheduledFuture<?> restartTask = this.restartTask;
        if ((restartTask == null) || restartTask.isDone()) {
            restartTask = scheduler.schedule(() -> { restartWebsockets(); }, 3, TimeUnit.SECONDS);
         }
    }

    @Override
    public synchronized void restartWebsockets() {
        // your restart code here
        this.restartTask = null;
    }

Signed-off-by: Nico Brüttner <n@bruettner.de>
Signed-off-by: Nico Brüttner <n@bruettner.de>
@andrewfg
Copy link
Contributor

andrewfg commented Mar 22, 2022

#11825 (comment)
#11825 (comment)
#11825 (comment)

@Bruetti1991 gentle reminder: you promised three small fixes above, but did not yet commit them. => Could you please do this, and therefore close the respective open issues, so that the PR can be merged?

@andrewfg
Copy link
Contributor

PS @Bruetti1991 I wonder how many of the other issues on this binding may have been resolved by your fix? => Any thoughts?

image

@Bruetti1991
Copy link
Contributor Author

Bruetti1991 commented Mar 23, 2022

#11825 (comment)
#11825 (comment)
#11825 (comment)

@Bruetti1991 gentle reminder: you promised three small fixes above, but did not yet commit them. => Could you please do this, and therefore close the respective open issues, so that the PR can be merged?

@andrewfg Sorry for the delay. I just did not have much time for it in the last couple of weeks.
And I also wanted to confirm that the binding is working as intended when my gardena mower is actually doing something. But that was not possible during the winter months ;)
Currently I am on a business trip until April 3rd. As soon as I am back home, I will commit the fixes.

I will also have a look at the other issues as soon as possible.

Signed-off-by: Nico Brüttner <n@bruettner.de>
Signed-off-by: Nico Brüttner <n@bruettner.de>
…with HTTP 429 error (Too Many Requests)

Signed-off-by: Nico Brüttner <n@bruettner.de>
…HTTP errors (except 429)

Signed-off-by: Nico Brüttner <n@bruettner.de>
Signed-off-by: Nico Brüttner <n@bruettner.de>
@Bruetti1991
Copy link
Contributor Author

@andrewfg I finally found the time to commit the requested fixed. I also changed two additional things:

  1. I revised the function how websockets are restarted. I think it is more robust now.
  2. I increased the reinitialization delay, after a HTTP error occurs.
    • Error 429 (Too Many Requests): 24 hours delay. This will reduce the amount of reconnection attempts, if the cloud limits was reached.
    • Any other error: 120 seconds. This might help to avoid reaching the cloud limit...

@Bruetti1991
Copy link
Contributor Author

PS @Bruetti1991 I wonder how many of the other issues on this binding may have been resolved by your fix? => Any thoughts?

image

@andrewfg
Copy link
Contributor

andrewfg commented Apr 8, 2022

Issues should be resolved with my fix
Issues are maybe also fixed because now there should be no "simultaneous logins" anymore

Many thanks for the feedback. And the work. I added the 'fixes' keyword so that those issues will 'auto-close' when your PR is merged.

This is not fixed.

Ok, I will have a look at those issues separately to see if I have some suggestions on how to fix.

@andrewfg
Copy link
Contributor

andrewfg commented Apr 8, 2022

@Bruetti1991 fyi, I will build this version and test it for a few days on my own system.

Signed-off-by: Nico Brüttner <n@bruettner.de>
Copy link
Contributor

@andrewfg andrewfg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewfg
Copy link
Contributor

andrewfg commented May 5, 2022

I will .. test it for a few days on my own system

@Bruetti1991 I have been testing it for about one month now. Unfortunately I have been noticing more disconnects on this build than I did normally on the official release version. However it is not 100% clear to me why I am seeing such errors. I will keep observing & testing it.

@andrewfg
Copy link
Contributor

andrewfg commented May 6, 2022

@Bruetti1991 to be even more specific, I am getting the following errors in my log, repeated exactly every two hours..

2022-05-06 04:52:04.538 [WARN ] [rdena.internal.GardenaSmartWebSocket] - Connection to Gardena Webservice was closed (e7bdbc22aa-My Garden): code: 1001, reason: Going away
2022-05-06 04:52:04.540 [WARN ] [ng.gardena.internal.GardenaSmartImpl] - Restarting GardenaSmart Webservice (e7bdbc22aa-My Garden)

@Bruetti1991 Bruetti1991 requested a review from andrewfg May 27, 2022 17:20
@Bruetti1991
Copy link
Contributor Author

@Bruetti1991 I am still seeing a lot of HTTP 429 'message limit exceeded' errors using a build of your code. But I don't know if this is due to the changes you made, or if it would have occurred also on the prior official release version of the code. ??

I am pretty sure that these errors are not related to my code changes. You would also get these errors with the official version of the binding. I think this is because of the many websocket disconnections other than 'Going away'...

@andrewfg
Copy link
Contributor

many websocket disconnections other than 'Going away'...

I dont know the api very well, but I wonder if we can eliminate the token expired disconnections by renewing the token just before it expires?

@andrewfg
Copy link
Contributor

andrewfg commented Jun 3, 2022

@Bruetti1991 I think there is a fundamental problem in how the access token expiration is being handled.

Problem

IMHO sendKeepAlivePing() is IMHO -- WRONGLY -- closing the session if the access token expired ...

image

... and that (therefore) WRONGLY triggers the going away log message here ...

image

Proposed Solution

I think that sendKeepAlivePing() should NOT close the session when the access token expires; but instead it should POST a call to refresh the token in a similar way to what the code snippet below does.

image

@Bruetti1991 @kaikreuzer => any thoughts on this?

@kaikreuzer
Copy link
Member

@andrewfg I didn't follow your discussion and the code changes, but if it is possible to refresh the access token through the established websocket connection, I'd fully agree with you that it should not be closed, but only the token to be refreshed.

@andrewfg
Copy link
Contributor

andrewfg commented Jun 5, 2022

@Bruetti1991 / @kaikreuzer after doing some further digging, I discovered that it is NOT a token expiry problem. It seems that the Gardena remote server is unilaterally closing the WebSocket after 120 minutes. This seems rather odd because the client (the binding) is sending regular WebSocket pings and getting regular pongs from the server. I have two hypotheses about why the server might unilaterally closing the WebSocket as below, but I would appreciate your thoughts too.

  • Current code sends client pings and receives server pongs; but perhaps it must send pongs in response to server pings too?
  • Maybe we need to set a keepAlive on the underlying TCP socket?

In the next days, I shall do some tests to see if either of the above may solve the problem. And I will get back to you.

@Bruetti1991 while playing around with your code I also found some compiler warnings and a potential memory leak. Once I have finished the above-mentioned tests, I will come back with a further code review with suggested changes.

@andrewfg
Copy link
Contributor

andrewfg commented Jun 6, 2022

.. perhaps it must send pongs in response to server pings too?

The server does not send pings, so no client pongs are needed, so this is NOT the issue.

.. set a keepAlive on the underlying TCP socket?

This may indeed be the reason why the server "goes away". Reason for me thinking this is that the server goes away after exactly 120 minutes, which is the default time limit for the TCP socket keep-alive mechanism. However in the Jetty framework, there seems to be no way to get at the underlying TCP socket, so I cannot test it.

Therefore I think this issue of serve premature "going away" is NOT something we can solve in this PR, and may be something to look out for in future. (I will open an Issue to mark it). And instead, the 'solution' of not logging "going away" errors is probably the right / only solution. => I suggest to simply 'logger.debug()' all server socket close messages.

I will therefore just post a new review with some further change requests on your code, so we can close off this PR asap.

Copy link
Contributor

@andrewfg andrewfg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here a new (hopefully the last) review.

Signed-off-by: Nico Brüttner <n@bruettner.de>
Signed-off-by: Nico Brüttner <n@bruettner.de>
…d or timed out

Signed-off-by: Nico Brüttner <n@bruettner.de>
Signed-off-by: Nico Brüttner <n@bruettner.de>
@Bruetti1991
Copy link
Contributor Author

@andrewfg

after doing some further digging, I discovered that it is NOT a token expiry problem. It seems that the Gardena remote server is unilaterally closing the WebSocket after 120 minutes.

Right. I discovered the same thing. When getting the first token, the response contains "expires_in": 86399. That means, the token should only expire after 86399 seconds (~24h). I still tried to refresh the token every hour, but this didn't led to any improvement.

I also took a look at the API for getting the websocket URLs. Along with the websocket URL you indeed get a validity value, but it only describes the "Time window for connection, starting from issuing this POST request, (seconds)". It is set to 10 seconds. So it is not useful for us.

I agree with you, that this issue has most likely something to do how jetty handles the connection. So this should be fixed in another PR.

@Bruetti1991 Bruetti1991 requested a review from andrewfg June 9, 2022 20:24
@andrewfg
Copy link
Contributor

So this should be fixed in another PR.

Agreed. See #12896

Signed-off-by: Nico Brüttner <n@bruettner.de>
Signed-off-by: Nico Brüttner <n@bruettner.de>
Copy link
Contributor

@andrewfg andrewfg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kaikreuzer kaikreuzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your work @Bruetti1991 and thanks for reviewing @andrewfg!

@kaikreuzer kaikreuzer merged commit fd9fa72 into openhab:main Jun 14, 2022
@kaikreuzer kaikreuzer added this to the 3.3 milestone Jun 14, 2022
@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/gardena-error-429-limit-exceeded/137419/22

leifbladt pushed a commit to leifbladt/openhab-addons that referenced this pull request Oct 15, 2022
…ory leaks (openhab#11825)

* [gardena] Fix handling of websocket connection losses that causes memory leaks

* The binding no longer restarts websockets more than once if the connection is lost

Signed-off-by: Nico Brüttner <n@bruettner.de>
andan67 pushed a commit to andan67/openhab-addons that referenced this pull request Nov 6, 2022
…ory leaks (openhab#11825)

* [gardena] Fix handling of websocket connection losses that causes memory leaks

* The binding no longer restarts websockets more than once if the connection is lost

Signed-off-by: Nico Brüttner <n@bruettner.de>
andrasU pushed a commit to andrasU/openhab-addons that referenced this pull request Nov 12, 2022
…ory leaks (openhab#11825)

* [gardena] Fix handling of websocket connection losses that causes memory leaks

* The binding no longer restarts websockets more than once if the connection is lost

Signed-off-by: Nico Brüttner <n@bruettner.de>
Signed-off-by: Andras Uhrin <andras.uhrin@gmail.com>
psmedley pushed a commit to psmedley/openhab-addons that referenced this pull request Feb 23, 2023
…ory leaks (openhab#11825)

* [gardena] Fix handling of websocket connection losses that causes memory leaks

* The binding no longer restarts websockets more than once if the connection is lost

Signed-off-by: Nico Brüttner <n@bruettner.de>
nemerdaud pushed a commit to nemerdaud/openhab-addons that referenced this pull request Feb 28, 2023
…ory leaks (openhab#11825)

* [gardena] Fix handling of websocket connection losses that causes memory leaks

* The binding no longer restarts websockets more than once if the connection is lost

Signed-off-by: Nico Brüttner <n@bruettner.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of an add-on
Projects
None yet
6 participants