Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wamp.close.transport_lost [WAMP transport was lost without closing the session before] #514

Closed
egrigore opened this issue Oct 22, 2019 · 9 comments

Comments

@egrigore
Copy link

Hi,

We have multiple exporters running as services and they often fail (at least 2-3 times per day) with below error. Is there any specific setting/configuration that we can use to avoid these failures?

oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '2-1.8.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 1647), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 40191), ('busnum', 2), ('devnum', 23)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: 2019-10-22T16:38:06 started ser2net for /dev/ttyUSB28 on port 47206
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkSerialPort', 'params': OrderedDict([('extra', {'path': '/dev/ttyUSB28', 'proxy': 'workstation-name', 'proxy_required': False}), ('port', 47206), ('host', 'workstation-name.dummy.domain.net')])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.3.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 8137), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 297), ('busnum', 1), ('devnum', 33)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: 2019-10-22T16:38:06 started ser2net for /dev/ttyUSB18 on port 44315
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkSerialPort', 'params': OrderedDict([('extra', {'path': '/dev/ttyUSB18', 'proxy': 'workstation-name', 'proxy_required': False}), ('port', 44315), ('host', 'workstation-name.dummy.domain.net')])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.2.3'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 5538), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 118), ('busnum', 1), ('devnum', 61)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: 2019-10-22T16:38:06 started ser2net for /dev/ttyUSB16 on port 58230
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkSerialPort', 'params': OrderedDict([('extra', {'path': '/dev/ttyUSB16', 'proxy': 'workstation-name', 'proxy_required': False}), ('port', 58230), ('host', 'workstation-name.dummy.domain.net')])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.1.3'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 5538), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 128), ('busnum', 1), ('devnum', 93)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: 2019-10-22T16:38:06 started ser2net for /dev/ttyUSB17 on port 49932
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkSerialPort', 'params': OrderedDict([('extra', {'path': '/dev/ttyUSB17', 'proxy': 'workstation-name', 'proxy_required': False}), ('port', 49932), ('host', 'workstation-name.dummy.domain.net')])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.1.4.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 5538), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 125), ('busnum', 1), ('devnum', 90)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '2-1.8.4.3'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 8137), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 303), ('busnum', 2), ('devnum', 38)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': False, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', None), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', None), ('host', 'workstation-name.dummy.domain.net'), ('model_id', None), ('busnum', None), ('devnum', None)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.3.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 8137), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 297), ('busnum', 1), ('devnum', 36)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': False, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', None), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', None), ('host', 'workstation-name.dummy.domain.net'), ('model_id', None), ('busnum', None), ('devnum', None)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.3.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 1317), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 46244), ('busnum', 1), ('devnum', 37)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.3.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 1317), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 42149), ('busnum', 1), ('devnum', 38)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': False, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', None), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', None), ('host', 'workstation-name.dummy.domain.net'), ('model_id', None), ('busnum', None), ('devnum', None)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.3.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 1317), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 46244), ('busnum', 1), ('devnum', 39)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': False, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', None), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', None), ('host', 'workstation-name.dummy.domain.net'), ('model_id', None), ('busnum', None), ('devnum', None)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.2.4.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 8137), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 318), ('busnum', 1), ('devnum', 40)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: {'acquired': None, 'avail': True, 'cls': 'NetworkUniversalUpdateUtility', 'params': OrderedDict([('path', '1-1.2.4.2'), ('extra', {'proxy': 'workstation-name', 'proxy_required': False}), ('vendor_id', 1317), ('host', 'workstation-name.dummy.domain.net'), ('model_id', 42149), ('busnum', 1), ('devnum', 41)])}
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: 2019-10-22T19:50:56 dropping connection to peer tcp:<coordinator-ip>:20408 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: connection lost
oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: 2019-10-22T19:50:56 session closed with reason wamp.close.transport_lost [WAMP transport was lost without closing the session before]
oct 22 19:51:12 workstation-name systemd[1]: exporter.service: Main process exited, code=exited, status=100/n/a
oct 22 19:51:12 workstation-name systemd[1]: exporter.service: Unit entered failed state.
oct 22 19:51:12 workstation-name systemd[1]: exporter.service: Failed with result 'exit-code'.
oct 22 19:51:17 workstation-name systemd[1]: exporter.service: Service hold-off time over, scheduling restart.
oct 22 19:51:17 workstation-name systemd[1]: Stopped labgrid exporter service.

@Emantor
Copy link
Member

Emantor commented Oct 23, 2019

Which version of labgrid is your custom fork based on?

@egrigore
Copy link
Author

Thank you for the prompt response.
The last commit to which we rebased is from 28 august: b6312f8
b6312f8

@Emantor
Copy link
Member

Emantor commented Oct 23, 2019

Your log shows the exporter logfile, can you print the coordinator log file around the same timestamps? It would be interesting whether this is the coordinator dropping the exporter due to ping timeout. It would also be relevant what kind of workloads are run on your exporter, maybe we still have a bottleneck somewhere in the exporter implementation which prevents timely responses to the coordinator pings.

@egrigore
Copy link
Author

We have 6 exporters, only 4 were active at the time the log was gathered. I had to take the logs from today and replace the real workstations/exporters/coordinator name and ip with dummy ones for security reasons. So these are the new names after which you can search for:

4 exporters:
exporter-1-workstation.dummy.domain.net with <exporter-1-ip>
exporter-2-workstation.dummy.domain.net with <exporter-2-ip>
exporter-3-workstation.dummy.domain.net with <exporter-3-ip>
exporter-4-workstation.dummy.domain.net with <exporter-4-ip>

1 coordinator:
coordinator.workstation.com <coordinator-ip>

2 pdus
<pdu1-ip>
<pdu2-ip>

See attached the log with the exporter part at the beginning and the coordinator in the end. They are on different machines (I just concatenated the log).
ping-timeout-coordinator-and-exporter-log.txt

@Emantor
Copy link
Member

Emantor commented Oct 23, 2019

Your coordinator log is off by an hour. It might be easier to search for lines like:

Kicking exporter (<number>/<exporter-name>)

in the exporter log yourself.

@egrigore
Copy link
Author

yes, I forgot to mention that the coordinator machine is on a different timezone and that is why it is one our off.

I have seen "kicking exporter" on coordinator log only once, but it was 2 days ago and for another exporter

@jluebbe
Copy link
Member

jluebbe commented Jan 10, 2020

oct 22 19:50:56 workstation-name service_start_exporter.sh[21952]: 2019-10-22T19:50:56 dropping connection to peer tcp::20408 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)

This indicates that crossbar stopped responding to low-level websocket pings from the exporter. This could be caused by network issues or an unresponsive crossbar process. Note that the coordinator is not involved in answering these websocket pings from exporters.

We've merged some stability fixes for the communication between exporters and coordinator in PR #547. It doesn't look like your issues are exactly like the symptoms in that case, but it would be useful to test anyway.

If you still get WebSocket ping timeouts, it could be useful to check with wireshark whether the TCP connection still ACKs the WebSocket ping request or not...

@egrigore
Copy link
Author

Hi, we will try the patch at next rebase. For now, we have worked around the issue with a sleep time of 2 minutes into a loop that was checking a file from the exporter - this somehow solved the problem.

@Emantor
Copy link
Member

Emantor commented Oct 26, 2020

Closing, I expect this to be fixed either by the update or a local workaround. Ping me for a reopen if required.

@Emantor Emantor closed this as completed Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants