Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Draft: Migrate to OpenTelemetry tracing #13400

Closed
Closed
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
0cc610e
Migrate to OpenTelemetry tracing
MadLittleMods Jul 26, 2022
2fe6911
Some shim and some new
MadLittleMods Jul 27, 2022
6984cef
Progress towards OTEL
MadLittleMods Jul 27, 2022
6406fd5
Server running
MadLittleMods Jul 27, 2022
2428172
Export to Jaeger (things are showing up)
MadLittleMods Jul 27, 2022
0d7a2b9
Revert changes to Sentry scopes (not OTEL)
MadLittleMods Jul 27, 2022
9e1de86
We use the config for the Jaeger exporter now
MadLittleMods Jul 27, 2022
f6c3b22
Fix some lints
MadLittleMods Jul 27, 2022
3a25996
Fixup some todos
MadLittleMods Jul 28, 2022
1b0840e
Fix some lints
MadLittleMods Jul 29, 2022
1d208fa
Fix invalid attribute type
MadLittleMods Jul 29, 2022
2011ac2
Fix using wrong type of context (`Context` vs `SpanContext`)
MadLittleMods Jul 29, 2022
19d20b5
Record exception
MadLittleMods Jul 29, 2022
786dd9b
Explain weird function
MadLittleMods Jul 29, 2022
7c135b9
Easier to follow local vs remote span tracing
MadLittleMods Jul 30, 2022
d29a4af
Move to start_active_span
MadLittleMods Jul 30, 2022
041acdf
Working second test although it's a bit pointless testing whether ope…
MadLittleMods Jul 30, 2022
d848156
Passing tests and context manager doesn't seem to be needed
MadLittleMods Jul 30, 2022
070195a
Use correct type for what start_as_current_span returns
MadLittleMods Jul 30, 2022
7772f50
Use HTTP_HOST attribute
MadLittleMods Jul 30, 2022
322da51
Fix some lints
MadLittleMods Aug 1, 2022
33fd24e
todos
MadLittleMods Aug 1, 2022
a9fb504
Implement start_active_span_from_edu for OTEL
MadLittleMods Aug 1, 2022
8e902b8
Remove what's left of scopemanager
MadLittleMods Aug 1, 2022
00be06c
Try to align read from edu content
MadLittleMods Aug 1, 2022
6255a1a
Fix tests and some lints
MadLittleMods Aug 2, 2022
b3cdbad
PoC force tracing
MadLittleMods Aug 2, 2022
d15fa45
Non-working try baggage to inherit force tracing/sampling
MadLittleMods Aug 2, 2022
6bb7cb7
Revert "Non-working try baggage to inherit force tracing/sampling"
MadLittleMods Aug 2, 2022
dbd9005
Revert crazy custom sampler and span process to try force tracing for…
MadLittleMods Aug 2, 2022
0f93ec8
Fix lints
MadLittleMods Aug 2, 2022
36d6648
Remove type ignore comments
MadLittleMods Aug 2, 2022
fb0e820
More clear method names
MadLittleMods Aug 2, 2022
b09651a
Always return config path for config error
MadLittleMods Aug 2, 2022
da396a2
Add test for what happens when side by side spans in with statement
MadLittleMods Aug 2, 2022
ad71bc3
End on exit is already the default expected behavior
MadLittleMods Aug 2, 2022
59facea
Restore logging current_context (not sure why removed
MadLittleMods Aug 2, 2022
9d6fcf3
Clean up some opentracing text references
MadLittleMods Aug 2, 2022
fcc4220
Update docs
MadLittleMods Aug 2, 2022
d72cacf
Add changelog
MadLittleMods Aug 2, 2022
ba4a46a
Seems to (see test_side_by_side_spans)
MadLittleMods Aug 2, 2022
72c718d
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Aug 2, 2022
c26fa2d
Move to 72 schema version
MadLittleMods Aug 2, 2022
5999132
Fix lints
MadLittleMods Aug 2, 2022
2491665
Fix remnant
MadLittleMods Aug 2, 2022
16d17f7
Fix table missing column
MadLittleMods Aug 3, 2022
b6f5665
Use latested Twisted from source to fix contextvar issues causing OTE…
MadLittleMods Aug 3, 2022
699dad0
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Aug 3, 2022
270db42
Update treq to match minimum Twisted Python versions
MadLittleMods Aug 3, 2022
f5da762
Revert "Update treq to match minimum Twisted Python versions"
MadLittleMods Aug 3, 2022
ccd4752
Fix tracing imports after merging in develop
MadLittleMods Aug 3, 2022
d7166a0
Update docs/tracing.md
MadLittleMods Aug 3, 2022
7566375
Try fix Twisted/treq problems
MadLittleMods Aug 4, 2022
7024d7b
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Aug 9, 2022
8def7e4
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Aug 18, 2022
50f0342
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Sep 9, 2022
f73bc59
Try to resolve poetry deps
MadLittleMods Sep 9, 2022
a15592d
Poetry install again
MadLittleMods Sep 9, 2022
32b9d16
poetry update
MadLittleMods Sep 9, 2022
6c40dfa
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Sep 12, 2022
ad3e324
Install otel deps from develop
MadLittleMods Sep 12, 2022
15e242e
OTEL install with DMR
MadLittleMods Sep 13, 2022
d730a46
Update Twisted to lastest
MadLittleMods Sep 13, 2022
ed11237
Remove linting from CI for now
MadLittleMods Sep 13, 2022
19c6f6e
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Sep 13, 2022
b77d49f
Hopefully fix problem when OTEL not installed with non recording span
MadLittleMods Sep 13, 2022
a027c6e
Maybe fix positional argument mismatch for DummyLink
MadLittleMods Sep 14, 2022
84f91e3
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Sep 14, 2022
b86869f
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Sep 20, 2022
e4b9898
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Sep 26, 2022
4a495ac
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Oct 1, 2022
7d70acd
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Oct 20, 2022
627951e
Fix poetry.lock conflicts
MadLittleMods Oct 20, 2022
d993cb0
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Nov 18, 2022
7acb365
Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry
MadLittleMods Nov 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13400.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Migrate from OpenTracing to OpenTelemetry (config changes necessary).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other changelogs to base this kind of change on?

2 changes: 1 addition & 1 deletion docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@
- [Git Usage](development/git.md)
- [Testing]()
- [Demo scripts](development/demo.md)
- [OpenTracing](opentracing.md)
- [Tracing](tracing.md)
- [Database Schemas](development/database_schema.md)
- [Experimental features](development/experimental_features.md)
- [Dependency management](development/dependencies.md)
Expand Down
93 changes: 1 addition & 92 deletions docs/opentracing.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,3 @@
# OpenTracing

## Background

OpenTracing is a semi-standard being adopted by a number of distributed
tracing platforms. It is a common api for facilitating vendor-agnostic
tracing instrumentation. That is, we can use the OpenTracing api and
select one of a number of tracer implementations to do the heavy lifting
in the background. Our current selected implementation is Jaeger.

OpenTracing is a tool which gives an insight into the causal
relationship of work done in and between servers. The servers each track
events and report them to a centralised server - in Synapse's case:
Jaeger. The basic unit used to represent events is the span. The span
roughly represents a single piece of work that was done and the time at
which it occurred. A span can have child spans, meaning that the work of
the child had to be completed for the parent span to complete, or it can
have follow-on spans which represent work that is undertaken as a result
of the parent but is not depended on by the parent to in order to
finish.

Since this is undertaken in a distributed environment a request to
another server, such as an RPC or a simple GET, can be considered a span
(a unit or work) for the local server. This causal link is what
OpenTracing aims to capture and visualise. In order to do this metadata
about the local server's span, i.e the 'span context', needs to be
included with the request to the remote.

It is up to the remote server to decide what it does with the spans it
creates. This is called the sampling policy and it can be configured
through Jaeger's settings.

For OpenTracing concepts see
<https://opentracing.io/docs/overview/what-is-tracing/>.

For more information about Jaeger's implementation see
<https://www.jaegertracing.io/docs/>

## Setting up OpenTracing

To receive OpenTracing spans, start up a Jaeger server. This can be done
using docker like so:

```sh
docker run -d --name jaeger \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
jaegertracing/all-in-one:1
```

Latest documentation is probably at
https://www.jaegertracing.io/docs/latest/getting-started.

## Enable OpenTracing in Synapse

OpenTracing is not enabled by default. It must be enabled in the
homeserver config by adding the `opentracing` option to your config file. You can find
documentation about how to do this in the [config manual under the header 'Opentracing'](usage/configuration/config_documentation.md#opentracing).
See below for an example Opentracing configuration:

```yaml
opentracing:
enabled: true
homeserver_whitelist:
- "mytrustedhomeserver.org"
- "*.myotherhomeservers.com"
```

## Homeserver whitelisting

The homeserver whitelist is configured using regular expressions. A list
of regular expressions can be given and their union will be compared
when propagating any spans contexts to another homeserver.

Though it's mostly safe to send and receive span contexts to and from
untrusted users since span contexts are usually opaque ids it can lead
to two problems, namely:

- If the span context is marked as sampled by the sending homeserver
the receiver will sample it. Therefore two homeservers with wildly
different sampling policies could incur higher sampling counts than
intended.
- Sending servers can attach arbitrary data to spans, known as
'baggage'. For safety this has been disabled in Synapse but that
doesn't prevent another server sending you baggage which will be
logged to OpenTracing's logs.

## Configuring Jaeger

Sampling strategies can be set as in this document:
<https://www.jaegertracing.io/docs/latest/sampling/>.
Synapse now uses OpenTelemetry and the [documentation for tracing has moved](./tracing.md).
90 changes: 90 additions & 0 deletions docs/tracing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Tracing

## Background

OpenTelemetry is a semi-standard being adopted by a number of distributed
tracing platforms. It is a common API for facilitating vendor-agnostic
tracing instrumentation.

Tracing is a tool which gives an insight into the causal
relationship of work done in and between servers. The servers each track
events and report them to a centralised server - in Synapse's case:
Jaeger. The basic unit used to represent events is the span. The span
roughly represents a single piece of work that was done and the time at
which it occurred. A span can have child spans, meaning that the work of
the child had to be completed for the parent span to complete, or it can
have follow-on spans which represent work that is undertaken as a result
of the parent but is not depended on by the parent to in order to
finish.

Since this is undertaken in a distributed environment a request to
another server, such as an RPC or a simple GET, can be considered a span
(a unit or work) for the local server. This causal link is what
tracing aims to capture and visualise. In order to do this metadata
about the local server's span, i.e the 'span context', needs to be
included with the request to the remote.

It is up to the remote server to decide what it does with the spans it
creates. This is called the sampling policy and it can be configured
through Jaeger's settings.

For OpenTelemetry concepts, see
<https://opentelemetry.io/docs/concepts/>.

For more information about the Python implementation of OpenTelemetry we're using, see
<https://opentelemetry.io/docs/instrumentation/python/>

For more information about Jaeger, see
<https://www.jaegertracing.io/docs/>

## Setting up tracing

To receive tracing spans, start up a Jaeger server. This can be done
using docker like so:

```sh
docker run -d --name jaeger \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
jaegertracing/all-in-one:1
```

Latest documentation is probably at
https://www.jaegertracing.io/docs/latest/getting-started.

## Enable tracing in Synapse

Tracing is not enabled by default. It must be enabled in the
homeserver config by adding the `tracing` option to your config file. You can find
documentation about how to do this in the [config manual under the header 'Tracing'](usage/configuration/config_documentation.md#tracing).
See below for an example tracing configuration:

```yaml
tracing:
enabled: true
homeserver_whitelist:
- "mytrustedhomeserver.org"
- "*.myotherhomeservers.com"
```

## Homeserver whitelisting

The homeserver whitelist is configured using regular expressions. A list
of regular expressions can be given and their union will be compared
when propagating any spans contexts to another homeserver.

Though it's mostly safe to send and receive span contexts to and from
untrusted users since span contexts are usually opaque ids it can lead
to two problems, namely:

- If the span context is marked as sampled by the sending homeserver
the receiver will sample it. Therefore two homeservers with wildly
different sampling policies could incur higher sampling counts than
intended.
- Sending servers can attach arbitrary data to spans, known as
'baggage'. For safety this has been disabled in Synapse but that
doesn't prevent another server sending you baggage which will be
logged in the trace.
44 changes: 24 additions & 20 deletions docs/usage/configuration/config_documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ apply if you want your config file to be read properly. A few helpful things to
In addition, each setting has an example of its usage, with the proper indentation
shown.


## Modules

Server admins can expand Synapse's functionality with external modules.
Expand Down Expand Up @@ -3525,47 +3526,50 @@ default_power_level_content_override:
```

---
## Opentracing ##
Configuration options related to Opentracing support.
## Tracing ##
Configuration options related to tracing support.

---
### `opentracing`
### `tracing`

These settings enable and configure opentracing, which implements distributed tracing.
This allows you to observe the causal chains of events across servers
including requests, key lookups etc., across any server running
synapse or any other services which support opentracing
(specifically those implemented with Jaeger).
These settings enable and configure tracing. This allows you to observe the
causal chains of events across servers including requests, key lookups etc.,
across any server running synapse or any other services which support
OpenTelemetry.

Sub-options include:
* `enabled`: whether tracing is enabled. Set to true to enable. Disabled by default.
* `homeserver_whitelist`: The list of homeservers we wish to send and receive span contexts and span baggage.
See [here](../../opentracing.md) for more.
See [here](../../tracing.md#homeserver-whitelisting) for more.
This is a list of regexes which are matched against the `server_name` of the homeserver.
By default, it is empty, so no servers are matched.
* `force_tracing_for_users`: # A list of the matrix IDs of users whose requests will always be traced,
* `sample_rate`: The probability that a given span and subsequent child spans in the trace will be
recorded. This controls the amount of spans that record and are exported from Synapse.
* `force_tracing_for_users`: A list of the matrix IDs of users whose requests will always be traced,
even if the tracing system would otherwise drop the traces due to probabilistic sampling.
By default, the list is empty.
* `jaeger_config`: Jaeger can be configured to sample traces at different rates.
All configuration options provided by Jaeger can be set here. Jaeger's configuration is
mostly related to trace sampling which is documented [here](https://www.jaegertracing.io/docs/latest/sampling/).
* `jaeger_exporter_config`: Configure authentication and where you Jaeger instance is located.
Full options available in the [`JaegerExporter` API docs](https://opentelemetry-python.readthedocs.io/en/latest/exporter/jaeger/jaeger.html#opentelemetry.exporter.jaeger.thrift.JaegerExporter).

Example configuration:
```yaml
opentracing:
tracing:
enabled: true
homeserver_whitelist:
- ".*"

sample_rate: 1
force_tracing_for_users:
- "@user1:server_name"
- "@user2:server_name"

jaeger_config:
sampler:
type: const
param: 1
logging:
false
jaeger_exporter_config:
agent_host_name: localhost
agent_port: 6831
# Split UDP packets so they fit within the limit (UDP_PACKET_MAX_LENGTH is set to 65k in OpenTelemetry)
udp_split_oversized_batches: true
# If you define a collector, it will communicate directly to the collector, bypassing the agent
#collector_endpoint: "http://localhost:14268/api/traces?format=jaeger.thrift"
```
---
## Workers ##
Expand Down
3 changes: 0 additions & 3 deletions mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -164,9 +164,6 @@ ignore_missing_imports = True
[mypy-pympler.*]
ignore_missing_imports = True

[mypy-rust_python_jaeger_reporter.*]
ignore_missing_imports = True

[mypy-saml2.*]
ignore_missing_imports = True

Expand Down
Loading