Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[netatmo] Bridge status out of sync #16518

Closed
jlaur opened this issue Mar 13, 2024 · 4 comments · Fixed by #16520
Closed

[netatmo] Bridge status out of sync #16518

jlaur opened this issue Mar 13, 2024 · 4 comments · Fixed by #16520
Labels
bug An unexpected problem or unintended behavior of an add-on

Comments

@jlaur
Copy link
Contributor

jlaur commented Mar 13, 2024

Two weather stations went offline:

2024-03-12 13:43:10.574 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:account:home' changed from ONLINE to OFFLINE (CONFIGURATION_ERROR): Configuration incomplete, please grant the binding to Netatmo Connect.
2024-03-12 13:43:10.576 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:weather-station:home:indoor' changed from ONLINE to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.577 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:weather-station:home:indoor_small_bathroom' changed from ONLINE to OFFLINE (BRIDGE_OFFLINE)

Because of service unavailability (HTTP code 503):

2024-03-12 13:43:05.216 [DEBUG] [mo.internal.handler.ApiBridgeHandler] - executeUri GET  https://api.netatmo.com/api/getstationsdata?device_id=station46&get_favorites=false 
2024-03-12 13:43:10.199 [ERROR] [oauth2client.internal.OAuthConnector] - grant type refresh_token to URL https://api.netatmo.com/oauth2/token failed with HTTP response code 503
2024-03-12 13:43:10.297 [DEBUG] [mo.internal.handler.ApiBridgeHandler] - Failed to load access token: Exception in oauth communication, grant type refresh_token: Bad http response, http code 503
2024-03-12 13:43:10.482 [INFO ] [etatmo.internal.servlet.GrantServlet] - Registered Netatmo servlet at '/netatmo/connect/xxx'
2024-03-12 13:43:10.548 [DEBUG] [tatmo.internal.handler.DeviceHandler] - bridgeStatusChanged for bridge netatmo:weather-station:home:indoor to OFFLINE (CONFIGURATION_ERROR): @text/conf-error-grant-needed
2024-03-12 13:43:10.552 [DEBUG] [tatmo.internal.handler.DeviceHandler] - bridgeStatusChanged for bridge netatmo:weather-station:home:indoor_small_bathroom to OFFLINE (CONFIGURATION_ERROR): @text/conf-error-grant-needed
2024-03-12 13:43:10.552 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor:indoor2 to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.553 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:outdoor:home:indoor:outdoor to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.552 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor:indoor3 to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.558 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor_small_bathroom:indoor_office to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.558 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:outdoor:home:indoor_small_bathroom:outdoor_shed to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.559 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor_small_bathroom:indoor_alva to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.559 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor_small_bathroom:indoor_linus to OFFLINE (BRIDGE_OFFLINE)
2024-03-12 13:43:10.577 [WARN ] [handler.capability.WeatherCapability] - Error retrieving weather data 'station46' : Not authenticated

The status for both weather stations is correct, but the description is debatable (service unavailability probably shouldn't result in configuration error).

After that follows the same - either HTTP status 500 or 503. I noticed also:

2024-03-12 14:28:13.292 [TRACE] [mo.internal.handler.ApiBridgeHandler] -  -returned: code [503 Service Unavailable] body <html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx</center>
</body>
</html>

2024-03-12 14:28:13.299 [WARN ] [handler.capability.WeatherCapability] - Error retrieving weather data 'station46' : Error deserializing error: Service Unavailable

Then suddenly:

2024-03-12 14:43:13.308 [DEBUG] [mo.internal.handler.ApiBridgeHandler] - executeUri GET  https://api.netatmo.com/api/getstationsdata?device_id=station46&get_favorites=false 
2024-03-12 14:43:13.312 [TRACE] [mo.internal.handler.ApiBridgeHandler] -  -with headers: Accept-Encoding: gzip, User-Agent: Jetty/9.4.52.v20230823, Authorization: Bearer xxx
2024-03-12 14:43:13.623 [TRACE] [mo.internal.handler.ApiBridgeHandler] -  -returned: code [200 OK] body CUT
2024-03-12 14:43:13.640 [DEBUG] [handler.capability.RefreshCapability] - Module refreshed, next one in 602s
2024-03-12 14:43:13.640 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor_small_bathroom:indoor_office to ONLINE
2024-03-12 14:43:13.642 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor_small_bathroom:indoor_alva to ONLINE
2024-03-12 14:43:13.642 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:outdoor:home:indoor_small_bathroom:outdoor_shed to ONLINE
2024-03-12 14:43:13.642 [DEBUG] [tatmo.internal.handler.ModuleHandler] - bridgeStatusChanged for thing netatmo:indoor:home:indoor_small_bathroom:indoor_linus to ONLINE

Event log:

2024-03-12 14:43:13.650 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:weather-station:home:indoor_small_bathroom' changed from OFFLINE (BRIDGE_OFFLINE) to ONLINE
2024-03-12 14:43:13.654 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:indoor:home:indoor_small_bathroom:indoor_office' changed from OFFLINE (BRIDGE_OFFLINE) to UNKNOWN
2024-03-12 14:43:13.657 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:indoor:home:indoor_small_bathroom:indoor_linus' changed from OFFLINE (BRIDGE_OFFLINE) to UNKNOWN
2024-03-12 14:43:13.659 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:outdoor:home:indoor_small_bathroom:outdoor_shed' changed from OFFLINE (BRIDGE_OFFLINE) to UNKNOWN
2024-03-12 14:43:13.668 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:indoor:home:indoor_small_bathroom:indoor_alva' changed from OFFLINE (BRIDGE_OFFLINE) to UNKNOWN

Shortly later the external modules came online as well:

2024-03-12 14:53:16.031 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:indoor:home:indoor_small_bathroom:indoor_office' changed from UNKNOWN to ONLINE
2024-03-12 14:53:16.059 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:outdoor:home:indoor_small_bathroom:outdoor_shed' changed from UNKNOWN to ONLINE
2024-03-12 14:53:16.080 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:indoor:home:indoor_small_bathroom:indoor_alva' changed from UNKNOWN to ONLINE
2024-03-12 14:53:16.135 [INFO ] [ab.event.ThingStatusInfoChangedEvent] - Thing 'netatmo:indoor:home:indoor_small_bathroom:indoor_linus' changed from UNKNOWN to ONLINE

The problem with the last two series of status updates is that one of the weather stations went offline while the API bridge was still offline:

image

This is inconsistent. And also, the other weather station never came online until I later restarted the binding.

Expected Behavior

When a weather station is online, its parent bridge should also be online.

Current Behavior

A weather station thing came be online and working without the bridge being online, and while another hierarchy remains offline altogether:

  • bridge: offline
    • weather station 1: online
      • module 1: online
      • module 2: online
    • weather station 2: offline (bridge offline)
      • module 1: offline
      • module 2: offline

Possible Solution

Account bridge must have been marked offline here:

updateStatus(ThingStatus.OFFLINE, ThingStatusDetail.CONFIGURATION_ERROR,
ConfigurationLevel.REFRESH_TOKEN_NEEDED.message);

and most likely triggered from executeUri:

if (!authenticate(null, null)) {
prepareReconnection(null, null);
throw new NetatmoException("Not authenticated");
}

In this case, will it ever come back online if the next call succeeds?

if (statusCode == Code.OK) {
return deserializer.deserialize(clazz, responseBody);
}

Perhaps what is missing is:

updateStatus(ThingStatus.ONLINE);

As a consequence, maybe the other weather station is kept offline:

} else if (!ThingStatus.ONLINE.equals(bridge.getStatus())) {
setThingStatus(ThingStatus.OFFLINE, ThingStatusDetail.BRIDGE_OFFLINE, null);

Steps to Reproduce

Context

This is part of general stability issues since OH 4.1. See also #16485.

Your Environment

  • Version used: 4.1.1
@jlaur jlaur added the bug An unexpected problem or unintended behavior of an add-on label Mar 13, 2024
@jlaur
Copy link
Contributor Author

jlaur commented Mar 13, 2024

I did a very primitive test, but did not manage yet to reproduce this from the suggestions above. In ApiBridgeHandler I created a sort of time bomb:

private final LocalTime startTime = LocalTime.now();
private boolean bugReproduced = false;

In beginning of executeUri:

            if (LocalTime.now().isAfter(startTime.plusMinutes(2)) && !bugReproduced) {
                bugReproduced = true;
                startAuthorizationFlow();
                prepareReconnection(null, null);
                throw new NetatmoException("Not authenticated");
            }

This simulated a single occurrence of a communication problem in authenticate. On the next occasion, this took all things offline. However, later everything came online again. I believe this is because I overlooked this in openConnection:

So I currently don't have a theory on how one weather station came back online while the account bridge stayed offline.

@jlaur
Copy link
Contributor Author

jlaur commented Mar 13, 2024

I missed that prepareReconnection schedules a one-shot task calling openConnection. So I need it to fail at least twice - second reproduction attempt:

    private final LocalTime startTime = LocalTime.now();
    private int bugReproducedCounter = 0;

and in the first lines of authenticate:

    private boolean authenticate(@Nullable String code, @Nullable String redirectUri) {
        if (LocalTime.now().isAfter(startTime.plusMinutes(2)) && bugReproducedCounter < 2) {
            bugReproducedCounter++;
            startAuthorizationFlow();
            return false;
        }

This time I was able to reproduce it.

I will run another test with a fix and provide a draft PR for that. It will take some time to verify because reconnection is scheduled to 15 minutes and openHAB restarts are needed because of undisposed jobs, so probably I won't have final results until tomorrow.

jlaur added a commit to jlaur/openhab-addons that referenced this issue Mar 13, 2024
Fixes openhab#16518

Signed-off-by: Jacob Laursen <jacob-github@vindvejr.dk>
@jlaur
Copy link
Contributor Author

jlaur commented Mar 14, 2024

The status for both weather stations is correct, but the description is debatable (service unavailability probably shouldn't result in configuration error).

This part I'm leaving for another issue, and it might not even be in this repository. I was quite puzzled about how it ended up being logged like this:

2024-03-12 13:43:10.297 [DEBUG] [mo.internal.handler.ApiBridgeHandler] - Failed to load access token: Exception in oauth communication, grant type refresh_token: Bad http response, http code 503

At first glance I would have expected this to have resulted in IOException, which would have been handled differently by the binding. The exception is thrown by this method:

https://github.com/openhab/openhab-core/blob/12a67421891bb28f1bece38e61495e49650a30c1/bundles/org.openhab.core.auth.oauth2client/src/main/java/org/openhab/core/auth/oauth2client/internal/OAuthConnector.java#L316-L355

So actually it is throwing OAuthException because of HTTP code 503:

throw new OAuthException("Bad http response, http code " + statusCode);

This is then caught, wrapped and rethrown just below:

        } catch (Exception e) {
            // Dont know what exception it is, wrap it up and throw it out
            throw new OAuthException(
                    "Exception in oauth communication, grant type " + grantType + ": " + e.getMessage(), e);
        }

It is still OAuthException though, so this rethrowing, even if suspicious, doesn't have any impact on how it can be handled on client side. I'm not sure if it's the correct type, I will check this later. For the time being I don't think the binding can easily know how to react in the most sensible way.

@jlaur
Copy link
Contributor Author

jlaur commented Mar 14, 2024

With the very small provided fix in the linked PR, the binding is now able to recover after multiple OAuth failures, i.e. the bridge will come back online and allow other child weather stations to come back online as well.

There are several other problems to look into as well. Especially I'm experiencing problems with jobs not being disposed and InterruptedException issues. Some are described in #16485. Anything else I will describe in separate issues. One step at a time.

lolodomo pushed a commit that referenced this issue Mar 14, 2024
Fixes #16518

Signed-off-by: Jacob Laursen <jacob-github@vindvejr.dk>
jlaur added a commit that referenced this issue Mar 15, 2024
Fixes #16518

Signed-off-by: Jacob Laursen <jacob-github@vindvejr.dk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of an add-on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant