Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve connection error handling #49

Closed
zulufoxtrot opened this issue Feb 15, 2019 · 10 comments
Closed

Improve connection error handling #49

zulufoxtrot opened this issue Feb 15, 2019 · 10 comments

Comments

@zulufoxtrot
Copy link

zulufoxtrot commented Feb 15, 2019

One may face 2 types of connection errors:

  • Connection from the wrapper to the backend (via gRPC)
  • Connection from the backend to PX4 (via mavlink)

Right now I see several issues:

  • There is no clear distinction between connection errors
  • Clearly identifying connection errors is not trivial
  • Connect() does not return a result (success or failure)
  • There are no methods in place to know the state of 1) the connection to the backend and 2) the connection to PX4, or they are not implemented, or buggy

Here are some examples:

Backend unreachable (DNS resolution failure):

[   ERROR] <_Rendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Name resolution failure"
	debug_error_string = "{"created":"@1550220055.531945000","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2261,"referenced_errors":[{"created":"@1550220055.531943000","description":"Name resolution failure","file":"src/core/ext/filters/client_channel/request_routing.cc","file_line":163,"grpc_status":14}]}"
>
Traceback (most recent call last):
  File "/Users/docs/boulot/controleur/controleur/Autopilot.py", line 68, in connect
    autopilot_type = await Autopilot.vehicle.info.get_version()
  File "/Users/docs/boulot/controleur/controleur/Libs/DronecodeSDK-Python/dronecode_sdk/plugins/info.py", line 288, in get_version
    response = await self._stub.GetVersion(request)
  File "/Users/.local/share/virtualenvs/controleur-7At54XPt/lib/python3.6/site-packages/aiogrpc/channel.py", line 40, in __call__
    return await fut
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Name resolution failure"
	debug_error_string = "{"created":"@1550220055.531945000","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2261,"referenced_errors":[{"created":"@1550220055.531943000","description":"Name resolution failure","file":"src/core/ext/filters/client_channel/request_routing.cc","file_line":163,"grpc_status":14}]}"
>

Backend unreacheable (connection refused):

Traceback (most recent call last):
  File "/Users/docs/boulot/controleur/controleur/Autopilot.py", line 68, in connect
    autopilot_type = await Autopilot.vehicle.info.get_version()
  File "/Users/docs/boulot/controleur/controleur/Libs/DronecodeSDK-Python/dronecode_sdk/plugins/info.py", line 288, in get_version
    response = await self._stub.GetVersion(request)
  File "/Users/.local/share/virtualenvs/controleur-7At54XPt/lib/python3.6/site-packages/aiogrpc/channel.py", line 40, in __call__
    return await fut
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Connect Failed"
	debug_error_string = "{"created":"@1550221187.353538000","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2261,"referenced_errors":[{"created":"@1550221187.353357000","description":"Pick Cancelled","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":245,"referenced_errors":[{"created":"@1550221187.353329000","description":"Connect Failed","file":"src/core/ext/filters/client_channel/subchannel.cc","file_line":867,"grpc_status":14,"referenced_errors":[{"created":"@1550221187.353326000","description":"OS Error","errno":65,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":313,"os_error":"No route to host","syscall":"connect"}]}]}]}"

Identifying errors

If found how to correctly identify the type of error by getting grpc.RpcError._state.code:

        except grpc.RpcError as e:
            logger.exception("Exception in telemetry handler: ")

            if e._state.code == grpc.StatusCode.UNAVAILABLE:
                logger.error("DronecodeSDK backend is unreacheable.")

But that looks like bad practice. I don't want to have to dig so deep into the dependencies to do proper error handling. Plus I'm not even supposed to use grpc.RpcError._state (private attribute)

Backend reachable but not PX4

I have not found a clean way to verify the connection with PX4 succeeded. A workaround is to try to get data from PX4 and catch an exception. For instance, using get_version() and catching InfoError. The problem is: 1) there could be other reasons for this exception to be raised. 2) there is a delay (timeout?) for the exception to be raised.

INFORMATION_NOT_RECEIVED_YET: 'Information not received yet'; origin: get_version(); params: ()
Traceback (most recent call last):
  File "/Users/docs/boulot/controleur/controleur/Autopilot.py", line 68, in connect
    autopilot_type = await Autopilot.vehicle.info.get_version()
  File "/Users/docs/boulot/controleur/controleur/Libs/DronecodeSDK-Python/dronecode_sdk/plugins/info.py", line 294, in get_version
    raise InfoError(result, "get_version()")
dronecode_sdk.plugins.info.InfoError: INFORMATION_NOT_RECEIVED_YET: 'Information not received yet'; origin: get_version(); params: ()

Connect() result

The connect() function does not appear to return a result. So in order to know if the connection is established, one has to use other ways like catching telemetry exceptions.

connection_state

I’ve tried 2 things:

  • system.is_connected: documented, but System doesn’t appear to be implemented yet in the python wrapper
  • core.connection_state(): undocumented, is implemented, but raises attributeerror: 'NoneType' object has no attribute 'connection_state'
@JonasVautherin
Copy link
Collaborator

Interesting points! May I ask the use-case for making a difference between those two? Is it for debugging/setting up the system, or is it affecting, say, the UI flow (e.g. you want to tell the user where the issue comes from)?

Right now it has been almost exclusively used with the backend running on the same device as the language bindings, so the connection to the backend is always fine (unless the backend is not running). So we haven't seen the need to improve this. That's why I am interested in the use-case.

This said, there is a way with gRPC to know the state of the channel, so we could possibly do something there. But the thing is that you don't always communicate with the server. As a comparison, say now I am writing this message on my browser. I don't know if my connection to github is still fine; I will only realize once I send the message, right? It's the same with gRPC. When you connect() (maybe we should call it init()), you actually instantiate objects on your side, but you start talking to the backend only when you do something (say, action.arm()).

@zulufoxtrot
Copy link
Author

zulufoxtrot commented Feb 15, 2019

May I ask the use-case for making a difference between those two? Is it for debugging/setting up the system, or is it affecting, say, the UI flow (e.g. you want to tell the user where the issue comes from)?
Right now it has been almost exclusively used with the backend running on the same device as the language bindings, so the connection to the backend is always fine (unless the backend is not running). So we haven't seen the need to improve this. That's why I am interested in the use-case.

For debugging. I am running the wrapper and the backend on two different machines for performance/compatibility needs (I use MacOS but I've had issues building and running px4/gazebo/mavlink-router/... on anything that isn't Linux). So there is a potentially unreliable network in between. I agree this isn't the envisioned use case.

But I can see a real use case where the flight controller fails (ex: hardware failure) and the backend keeps running, and vice versa. In which case I think it is important to differentiate.

This said, there is a way with gRPC to know the state of the channel, so we could possibly do something there. But the thing is that you don't always communicate with the server. As a comparison, say now I am writing this message on my browser. I don't know if my connection to github is still fine; I will only realize once I send the message, right? It's the same with gRPC.

I understand.
In any case, if a state for gRPC is not implemented, is it possible to have a state for PX4 in the wrapper? I know it has a heartbeat mechanism.

@JonasVautherin
Copy link
Collaborator

I use MacOS but I've had issues building and running px4/gazebo/mavlink-router/... on anything that isn't Linux

I see. Note that you can for sure run both the wrapper and backend on macOS, and SITL on your Linux. I do that a lot with this docker image, but you could just run SITL on Linux and redirect MAVLink to your macOS (by using e.g. mavlink-router or by doing something similar to what I do in my docker image, which is to set -t <backend_ip> in the mavlink start line).

In any case, if a state for gRPC is not implemented, is it possible to have a state for PX4 in the wrapper? I know it has a heartbeat mechanism.

The connection_state gives you the MAVLink status. If it says is_connected == false, it means that the connection between the backend and the PX4 failed (i.e. that comes from the heartbeat). If the connection between the wrapper and the backend fails, you get another error (probably a gRPC error?).

Would that solve your problem for now? We could also think about a more elegant way to handle that, for sure! But if that's not critical to you, I would rather not put that at the top of my priorities right now (if you want to contribute on that, I can definitely give feedback, though!).

@zulufoxtrot
Copy link
Author

The connection_state gives you the MAVLink status. If it says is_connected == false, it means that the connection between the backend and the PX4 failed (i.e. that comes from the heartbeat). If the connection between the wrapper and the backend fails, you get another error (probably a gRPC error?).
Would that solve your problem for now?

Sure. However, System.is_connected is not implemented in the current version of the wrapper.

I'm also curious about the role of Core.connection_state ?

@JonasVautherin
Copy link
Collaborator

Core.connection_state tells you about the MAVLink connection, right? When a system is connected, you get an event that tells you that a new system got connected, together with its uuid.

So I believe it is what you expect with System.is_connected, isn't it?

@zulufoxtrot
Copy link
Author

Core.connection_state tells you about the MAVLink connection, right? When a system is connected, you get an event that tells you that a new system got connected, together with its uuid.

I can't confirm it, because it throws an error whether the connection is established or not:

Autopilot.py", line 268, in get_connection_status
async for connection_state in Autopilot.vehicle.core.connection_state():
AttributeError: 'NoneType' object has no attribute 'connection_state'

So I believe it is what you expect with System.is_connected, isn't it?

Yes. I've dug into the generated code and it appears that core.connection_state() returns an object with 2 attributes: the uuid and an is_connected boolean. Sounds good, but what I don't understand is that it does not match the model used in other modules. Each class of the API ref (Info, Telemetry, Action, Mission) has a class with the same name in the wrapper. Except System.

@JonasVautherin
Copy link
Collaborator

I can't confirm it, because it throws an error whether the connection is established or not

That may be a bug, let me check that :-)

Sounds good, but what I don't understand is that it does not match the model used in other modules.

You're right. The thing is that other modules are "plugins", and Core is not. Core is the one managing the MAVLink connection. And the API of that is a bit different from the underlying C++ code (whereas the other plugins are more similar). However, the goal is to get the C++ API to get closer to that in the future. The idea being that the API is defined here, independently from the language.

Does that make sense to you?

@zulufoxtrot
Copy link
Author

Does that make sense to you?

It does, thanks for the explanation 😄

@zulufoxtrot
Copy link
Author

That may be a bug, let me check that :-)

@JonasVautherin any progress? I don't mean to push you, there is no rush.

@JonasVautherin
Copy link
Collaborator

Not yet. Should not be a big issue, though. Let me open two issues actually:

  1. The bug
  2. We should have better handling for gRPC errors in general (not just the connection).

And I will close this one, as I believe the current behavior which the gRPC error is usable, just not super elegant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants