Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Draft: Migrate to OpenTelemetry tracing #13400

Closed

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Jul 26, 2022

Migrate to OpenTelemetry tracing

Fix #11850

  • Now requires the latest Twisted (22.8.0+) in order for tracing to work
    • From the backend chapter sync last month, it was discussed that it would probably be fine to require the latest Twisted for tracing as it's mostly a tool for us.

Blockers

Dev notes

poetry run synapse_homeserver --config-path homeserver.yaml
$ poetry cache list
$ poetry cache clear --all pypi

via https://stackoverflow.com/a/70064450/796832

# Stop the current virtualenv if active
$ deactivate

# Find the venv for poetry
$ poetry env info -p
# Remove all the files of the current environment
$ rm -rf $(poetry env info -p)

# Reactivate Poetry shell
$ poetry shell
# Install everything
$ poetry install --extras all

$ psql synapse

ALTER TABLE device_lists_outbound_pokes RENAME COLUMN opentracing_context TO tracing_context;
ALTER TABLE device_lists_changes_in_room RENAME COLUMN opentracing_context TO tracing_context;

ALTER TABLE device_lists_outbound_pokes RENAME COLUMN tracing_context TO opentracing_context;
ALTER TABLE device_lists_changes_in_room RENAME COLUMN tracing_context TO opentracing_context;

Typing

Type hints for optional imports:

OpenTelemetry

Code:

OTEL code:

Sampling:

UDP on macOS (maybe Docker)

Error when using jaeger.thrift over Thrift-compact protocol with UDP on port 6831

homeserver.yaml

tracing:
   # ...
    jaeger_exporter_config:
      agent_host_name: localhost
      agent_port: 6831
2022-08-02 19:37:14,796 - opentelemetry.sdk.trace.export - 360 - ERROR - sentinel - Exception while exporting Span batch.
Traceback (most recent call last):
  File "/Users/eric/Documents/github/element/synapse/.venv/lib/python3.9/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 358, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File "/Users/eric/Documents/github/element/synapse/.venv/lib/python3.9/site-packages/opentelemetry/exporter/jaeger/thrift/__init__.py", line 219, in export
    self._agent_client.emit(batch)
  File "/Users/eric/Documents/github/element/synapse/.venv/lib/python3.9/site-packages/opentelemetry/exporter/jaeger/thrift/send.py", line 95, in emit
    udp_socket.sendto(buff, self.address)
OSError: [Errno 40] Message too long
2022-08-02 19:45:50,196 - opentelemetry.exporter.jaeger.thrift.send - 87 - WARNING - sentinel - Data exceeds the max UDP packet size; size 94357, max 65000

Fix: open-telemetry/opentelemetry-python#1061 (comment) -> https://www.jaegertracing.io/docs/1.19/client-libraries/#emsgsize-and-udp-buffer-limits -> jaegertracing/jaeger-client-node#124

$ sysctl net.inet.udp.maxdgram
net.inet.udp.maxdgram: 9216
$ sudo sysctl net.inet.udp.maxdgram=65536
net.inet.udp.maxdgram: 9216 -> 65536
$ sysctl net.inet.udp.maxdgram
net.inet.udp.maxdgram: 65536

To make those changes stick between OS reboots, make sure to add it to /etc/sysctl.conf as well:

/etc/sysctl.conf

net.inet.udp.maxdgram=65536

You also have to set udp_split_oversized_batches to split the batches up if they go over the 65k limit set in OTEL
homeserver.yaml

tracing:
    # ...
    jaeger_exporter_config:
      agent_host_name: localhost
      agent_port: 6831
      # Split UDP packets (UDP_PACKET_MAX_LENGTH is set to 65k in OpenTelemetry)
      udp_split_oversized_batches: true

Using thrift over UDP (port 6831) to communicate to the Jaeger agent doesn't seem to work from within a Complement test Docker container. I wonder why this is the case? I was seeing this same behavior with the Jaeger opentracing stuff. Is the UDP connection being over saturated (1065 spans in one trace)? Can the Jaeger agent in Docker not keep up? We see some spans come over but never the main servlet overarching one that is probably the last to be exported. But using the HTTP Jaeger collector endpoint seems to work fine for getting the whole trace (collector_endpoint: "http://localhost:14268/api/traces?format=jaeger.thrift")

TODO

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

pyproject.toml Outdated
@@ -196,7 +196,7 @@ oidc = ["authlib"]
systemd = ["systemd-python"]
url_preview = ["lxml"]
sentry = ["sentry-sdk"]
opentracing = ["jaeger-client", "opentracing"]
opentelemetry = ["opentelemetry-api", "opentelemetry-sdk"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what I am doing with the dependencies here. Please double check this stuff.

@MadLittleMods MadLittleMods added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 27, 2022
synapse/app/_base.py Outdated Show resolved Hide resolved
```
Invalid type StreamToken for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
```

Had to add a few more logs to find this instance since the warning doens't give much info where I am setting this invalid attribute.
This was good enough to find it in the code.
```
BoundedAttributes __setitem__ key=since_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481272, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0)

BoundedAttributes __setitem__ key=now_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481287, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0)

BoundedAttributes __setitem__ key=token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481291, typing_key=0, receipt_key=340, account_data_key=1237, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0)
```
Fix error:
```
AttributeError: 'SpanContext' object has no attribute 'get'
```

`Context`:
```
{'current-span-1a226c96-a5db-4412-bcaa-1fdd34213c5c': _Span(name="sendToDevice", context=SpanContext(trace_id=0x5d2dcc3fdc8205046d60a5cd18672ac6, span_id=0x715c736ff5f4d208, trace_flags=0x01, trace_state=[], is_remote=False))}
```

`SpanContext`:
```
SpanContext(trace_id=0xf7cd9d058b7b76f364bdd649c4ba7b8a, span_id=0x287ce71bac31bfc4, trace_flags=0x01, trace_state=[], is_remote=False)
```
The `incoming-federation-request` vs `process-federation_request` was first introduced in
#11870

 - Span for remote trace: `incoming-federation-request`
    - `child_of` reference: `origin_span_context`
    - `follows_from` reference: `servlet_span`
 - Span for local trace: `process-federation-request`
    - `child_of` reference: `servlet_span` (by the nature of it being active)
    - `follows_from` reference: `incoming-federation-request`
@@ -701,15 +701,15 @@ def new_transaction(
exception_callbacks=exception_callbacks,
)
try:
with opentracing.start_active_span(
with tracing.start_active_span(
"db.txn",
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When viewing a trace for a servlet, we have the db.{desc} span but the child database spans are missing like db.connection, db.txn, db.query 🤔

Probably need to assert that the parents are set correctly in the tests as well (particularly with deferreds).

Conflicts:
	synapse/logging/opentracing.py
	tests/logging/test_opentracing.py
Conflicts:
	poetry.lock
	synapse/federation/federation_client.py
	synapse/federation/federation_server.py
	synapse/handlers/federation.py
	synapse/handlers/federation_event.py
	synapse/logging/opentracing.py
	synapse/rest/client/room.py
	synapse/storage/controllers/persist_events.py
	synapse/storage/controllers/state.py
Conflicts:
	poetry.lock
	synapse/api/auth.py
	synapse/federation/federation_client.py
	synapse/logging/opentracing.py
	synapse/rest/client/keys.py
	synapse/rest/client/sendtodevice.py
	synapse/storage/schema/__init__.py
```
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 512, in _async_render
    callback_return = await raw_callback_return
  File "/usr/local/lib/python3.9/site-packages/synapse/federation/transport/server/_base.py", line 357, in new_func
    remote_parent_span = create_non_recording_span()
  File "/usr/local/lib/python3.9/site-packages/synapse/logging/tracing.py", line 502, in create_non_recording_span
    return opentelemetry.trace.NonRecordingSpan(
AttributeError: 'NoneType' object has no attribute 'trace'
```
Hopefully fix:
```
  File "/home/runner/work/synapse/synapse/synapse/storage/controllers/persist_events.py", line 246, in add_to_queue
    links=[Link(end_item.tracing_span_context)],
builtins.TypeError: __init__() takes 1 positional argument but 2 were given
```
Conflicts:
	.github/workflows/tests.yml
	poetry.lock
	synapse/storage/schema/__init__.py
Conflicts:
	synapse/handlers/message.py
	synapse/logging/opentracing.py
Conflicts:
	poetry.lock (conflicts not fixed)
	synapse/handlers/message.py
	synapse/handlers/relations.py
	synapse/storage/databases/main/devices.py
	synapse/storage/schema/__init__.py
Conflicts:
	docs/usage/configuration/config_documentation.md
	poetry.lock
	synapse/handlers/message.py
	synapse/http/server.py
	synapse/logging/opentracing.py
	synapse/rest/client/keys.py
	synapse/rest/client/room.py
Comment on lines +566 to +584
span = start_span(
name=name,
context=context,
kind=kind,
attributes=attributes,
links=links,
start_time=start_time,
record_exception=record_exception,
set_status_on_exception=set_status_on_exception,
tracer=tracer,
)

# Equivalent to `tracer.start_as_current_span`
return opentelemetry.trace.use_span(
span,
end_on_exit=end_on_exit,
record_exception=record_exception,
set_status_on_exception=set_status_on_exception,
)
Copy link
Contributor Author

@MadLittleMods MadLittleMods May 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this over from #13440 (comment),

Occasionally, I see spans that have a super massive duration and shows up with 213503982d 8h in the Jaeger UI. Obviously looks like some sort of max duration bug.

The raw data for those spans are the following. Only these spans are messed up and the end_time does not affect any more parent spans before _process_pulled_event.

db.get_partial_state_events

{
    "traceID": "563478e5d75db03e9fc028822fde0649",
    "spanID": "5b63b0172763c5cf",
    "flags": 1,
    "operationName": "db.get_partial_state_events",
    "references": [
        {
            "refType": "CHILD_OF",
            "traceID": "563478e5d75db03e9fc028822fde0649",
            "spanID": "10c9f2a2c50e286f"
        }
    ],
    "startTime": 1659994809436162,
    "duration": 18446744073709517944,
    "tags": [
        // ...
    ],
    "logs": [],
    "processID": "p1",
    "warnings": null
}

_process_pulled_event

{
    "traceID": "563478e5d75db03e9fc028822fde0649",
    "spanID": "6e9ee2608b03b542",
    "flags": 1,
    "operationName": "_process_pulled_event",
    "references": [
        {
            "refType": "CHILD_OF",
            "traceID": "563478e5d75db03e9fc028822fde0649",
            "spanID": "af2c9c8d08284134"
        }
    ],
    "startTime": 1659994809432356,
    "duration": 18446744073709532019,
    "tags": [
        // ...
    ],
    "logs": [],
    "processID": "p1",
    "warnings": null
},

Definitely close to the Python "max int size" (unsigned word, 64 bits): sys.maxsize * 2 + 1 (or 2^64 - 1) -> 18446744073709551615 (via https://stackoverflow.com/questions/7604966/maximum-and-minimum-values-for-ints)

It seems like it's some negative value getting turned into an unsigned integer so it wraps to the end of the value range. But I don't see the math where that can happen yet.

start_time and end_time are calculated by time.time_ns() (_time_ns)

And the Jaeger exporter calculates startTime and duration with this small _nsec_to_usec_round function.

To see if it's just some endTime (0) - startTime (some positive number) = some negative number, type of problem: If we take the Python max value minus the duration we see from _process_pulled_event: 18446744073709551615 - 18446744073709532019 -> 19596 microseconds. TODO: Is that reasonable?

@MadLittleMods
Copy link
Contributor Author

Closing as our existing OpenTracing setup seems to just work over UDP inside Docker (for Complement tests) on my Linux machine now so I don't have a personal need for this anymore (didn't work properly on macOS).

This is probably best tackled with a fresh PR with some of the learnings from here anyway

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

switch from jaeger-client to opentelemetry
4 participants