[2021 Theme Proposal] Better observability

## Theme description

When running a service in production, observability is a cornerstone or reliability and performance. While go-ipfs has become a quite complex piece of software, it too often is a black box for a node operator. It's often difficult to identify problem, let alone solve them. It's time to improve the situation, notably with a better tracing.

## Hypothesis

- administrating a black box is hard and lead to breakage of various shape and form
- observability helps everyone with barely any downside

## Vision statement

If executed properly, node operators will be able to:
- better diagnostic and resolve a wide range of issues
- provide better feedback to the development team
- rely less on PL to diagnose those issues and reduce the burden on the development team
- develop solution to address those issues

Additionally, better observability also greatly helps during development, to verify correctness and to have actual numbers to base decision on.

## Why focus this year

There is no major roadblock preventing this. Just work that need to be done or completed.

## Example workstreams

Observability consist of 3 pillars: logs, metrics, traces. Those are in different shapes at the moment in go-ipfs and will require different amount of work.

#### Logs

Logs are in a decent shape in go-ipfs. Most of the subsystems are instrumented, although not equally. However they are a bit difficult to exploit as there is a single sink possible (stdout) and a unique global filter.

For reference, Infura use a [custom plugin](https://github.com/INFURA/go-ipfs-datadog-plugin) to get those logs out of go-ipfs.

Possible work:
- develop an API to register a log sink, with dedicated filtering
- tag the logs with a request ID if availabe, which allow later to match logs and traces
- review the log instrumentation across subsystems to harmonize it

#### Metrics

go-ipfs expose metrics in the Prometheus format. I don't have many complaints about it.

Possible work:
- review the metric instrumentation across subsystems to harmonize it
- identify missing metrics and implement them

#### Tracing

This is the real meat of this proposal. go-ipfs here is a black box. The best you can achieve is to know how long a request is handled by go-ipfs. No details about the internals. AFAIK, only the DHT is decently instrumented.

For reference, Infura use a [`PluginTracer`](https://github.com/INFURA/go-ipfs-datadog-plugin) to export traces to an external system for analysis. However this require not only this plugin but also some custom code in our fork to get something meaningful. This is obviously not great.

Possible work:
- [add a go context in the data pipeline](https://github.com/ipfs/go-ipfs/issues/6803)
- add tracing in the data pipeline
- add tracing in other important subsystems (pinner, pubsub, connect to the DHT ...)
- support distributed tracing (match traces coming from another system and reaching go-ipfs)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2021 Theme Proposal] Better observability #74

Theme description

Hypothesis

Vision statement

Why focus this year

Example workstreams

Logs

Metrics

Tracing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[2021 Theme Proposal] Better observability #74

Description

Theme description

Hypothesis

Vision statement

Why focus this year

Example workstreams

Logs

Metrics

Tracing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions