Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(instrumentation): add OpenTelemetry tracing and metrics with basic configurations #5175

Merged
merged 90 commits into from Oct 11, 2022

Conversation

girishc13
Copy link
Contributor

@girishc13 girishc13 commented Sep 15, 2022

Goals:

  • resolves Add OpenTelemetry to increase observability #5155
  • Integrate OpenTelemetry API and SDK.
  • [ ] Provide environment variable configurations to enable tracking when required. Use console exporter for now.
  • Trace gRPC requests within the Flow.
  • Add helpers for creating traces on request methods with default span attributes.
  • [ ] Convert send_health_check_sync or is_ready method to async to prevent the grpc aio interceptor from throwing and capturing an exception.
  • Extract tracing context from the server and make it available for the Executor methods in the kwargs list or arguments.
  • check and update documentation. See guide and ask the team.

Sample Usage

Flow

jtype: Flow
version: '1'
with:
  protocol: grpc
  port: 54321
  tracing: true
  traces_exporter_host: '0.0.0.0'
  traces_exporter_port: 4317
executors:
  - uses: executor1/config.yml
    name: toyExecutor
  - uses: executor2/config.yml
    name: toyExecutor2

Executor

import functools

from jina import DocumentArray, Executor, requests
from opentelemetry.context.context import Context
from opentelemetry.semconv.trace import SpanAttributes
from opentelemetry.trace import Status, StatusCode


class MyExecutor(Executor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @requests
    def foo(self, docs: DocumentArray, tracing_context: Context, **kwargs):
        with self.tracer.start_span("foo", context=tracing_context) as span:
            try:
                span.set_attribute("len_added_docs", len(docs))
                span.set_attribute(SpanAttributes.RPC_METHOD, functools.__name__)

                docs[0].text = 'hello, world!'
                docs[1].text = 'goodbye, world!'
            except Exception as ex:
                span.set_status(Status(StatusCode.ERROR))
                span.record_exception(ex)
            finally:
                span.set_status(Status(StatusCode.OK))

Client

from jina import Client, DocumentArray
import time

if __name__ == '__main__':
    c = Client(
        host='grpc://0.0.0.0:54321',
        tracing=True,
        traces_exporter_host='0.0.0.0',
        traces_exporter_port=4317,
    )

    da = c.post('/', DocumentArray.empty(4))
    print(da.texts)

    time.sleep(3)

Collecting Data

Please check the docker-compose.yml and otel-collector-config.yml under the folder tests/integration/instrumentation for running the OpenTelemetry collector and Jaeger UI locally.

@github-actions github-actions bot added size/S area/core This issue/PR affects the core codebase area/entrypoint This issue/PR affects the entrypoint codebase area/helper This issue/PR affects the helper functionality area/setup This issue/PR affects setting up Jina labels Sep 15, 2022
@codecov
Copy link

codecov bot commented Sep 16, 2022

Codecov Report

Merging #5175 (bb0b003) into master (bcf17c3) will increase coverage by 23.24%.
The diff coverage is 52.91%.

@@             Coverage Diff             @@
##           master    #5175       +/-   ##
===========================================
+ Coverage   51.99%   75.23%   +23.24%     
===========================================
  Files          95      100        +5     
  Lines        6145     6433      +288     
===========================================
+ Hits         3195     4840     +1645     
+ Misses       2950     1593     -1357     
Flag Coverage Δ
jina 75.23% <52.91%> (+23.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/clients/base/http.py 91.89% <ø> (+2.70%) ⬆️
jina/clients/base/websocket.py 83.80% <ø> (+7.61%) ⬆️
jina/orchestrate/flow/base.py 90.15% <ø> (+29.81%) ⬆️
jina/serve/instrumentation/_aio_server.py 0.00% <0.00%> (ø)
jina/serve/runtimes/gateway/http/gateway.py 87.30% <ø> (+63.49%) ⬆️
jina/serve/runtimes/gateway/websocket/gateway.py 85.71% <ø> (+58.92%) ⬆️
jina/serve/runtimes/gateway/http/app.py 38.19% <25.00%> (+29.62%) ⬆️
jina/serve/runtimes/gateway/websocket/app.py 29.03% <25.00%> (+20.42%) ⬆️
jina/clients/base/grpc.py 82.85% <33.33%> (+5.71%) ⬆️
jina/serve/runtimes/asyncio.py 69.02% <44.44%> (+20.06%) ⬆️
... and 76 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@girishc13
Copy link
Contributor Author

The python OpenTelemetry sdk currently doesn't have support for the grpc.aio.Server. There is an open pull request which will make the implementation easy for us.

jina/serve/instrumentation/__init__.py Outdated Show resolved Hide resolved
jina/serve/instrumentation/__init__.py Outdated Show resolved Hide resolved
jina/serve/instrumentation/_aio_client.py Show resolved Hide resolved
@girishc13 girishc13 closed this Sep 21, 2022
@girishc13 girishc13 reopened this Sep 21, 2022
Copy link
Contributor

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should merge the new metrics argument with the already existing monitoring one.

Great PR I am looking forward to it

| `tracing` | If set, the sdk implementation of the OpenTelemetry tracer will be available and will be enabled for automatic tracing of requests and customer span creation. Otherwise a no-op implementation will be provided. | `boolean` | `False` |
| `span_exporter_host` | If tracing is enabled, this hostname will be used to configure the trace exporter agent. | `string` | `None` |
| `span_exporter_port` | If tracing is enabled, this port will be used to configure the trace exporter agent. | `number` | `None` |
| `metrics` | If set, the sdk implementation of the OpenTelemetry metrics will be available for default monitoring and custom measurements. Otherwise a no-op implementation will be provided. | `boolean` | `False` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the sentence. Isn't it going to overlap with the monitoring ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my intention is to use the same terms as OpenTelemetry. If people read the OpenTelemetry documentation then the terms are aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be renamed to traces_exporter_host?

Comment on lines 39 to 40
| `span_exporter_host` | If tracing is enabled, this hostname will be used to configure the trace exporter agent. | `string` | `None` |
| `span_exporter_port` | If tracing is enabled, this port will be used to configure the trace exporter agent. | `number` | `None` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a small thing that I mentioned already, so sorry to be a PITA about this, but I really think we should switch these around to ''host/port_span_exporter" to align them with the nomenclature of the prometheus feature. It's the small things that make a good user experience imo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohannesMessner what name would u suggest ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The port_monitoring won't exist in the near future and there will be only the span_exporter attributes. I'm generally used to seeing and using _host as the suffix rather than as a prefix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then we might introduce a breaking change right ? We need to be careful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can deprecated an argument if needed but this should be thinked ahead.

@girishc13 could you show here what would be the relevant arguments on this near future where port_monitoring does not exist ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you can think of the naming in terms on the yaml configuration for the OpenTelemetry collector. The hierarchy that I'm implicitly used to is: dependency -> service -> host, port, .... So this naturally follows the convention of service.host and service.port.

version: "3"
services:
  # Jaeger
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "14250"

  otel-collector:
    image: otel/opentelemetry-collector:0.61.0
    command: [ "--config=/etc/otel-collector-config.yml" ]
    volumes:
      - ${PWD}/otel-collector-config.yml:/etc/otel-collector-config.yml
    ports:
      - "1888:1888" # pprof extension
      - "8888:8888" # Prometheus metrics exposed by the collector
      - "8889:8889" # Prometheus exporter metrics
      - "13133:13133" # health_check extension
      - "55679:55679" # zpages extension
      - "4317:4317" # OTLP gRPC receiver
      - "4318:4318" # OTLP http receiver
    depends_on:
      - jaeger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of deprecating our current port_monitoring so quickly after it being introduced, but if it leads to a nicer and more unified experience moving forward then we'll have to do it.

But apart from the argument naming, am I understanding correctly that, according to this plan, the user won't be able to use prometheus to collect metrics anymore? Or will the setup on the user side remain the same, and we only change the way we expose these metrics from our internals?
Because on the otel collector site I still see some prometheus logos but some of them are not connected to the system, so I am a bit lost.

If this is the case, then I don't think we should remove the current way users set up their metrics pipeline. This would be a huge breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But apart from the argument naming, am I understanding correctly that, according to this plan, the user won't be able to use prometheus to collect metrics anymore? Or will the setup on the user side remain the same, and we only change the way we expose these metrics from our internals?

The main concern from my understanding is introducing a breaking change for the metrics data which requires new setup. Do we have data on how many users are using the Prometheus client for monitoring except for JCloud users? Also the lack of interior between OpenTelemetry monitoring and Prometheus monitoring makes it a bit hard to just remove the current monitoring setup.

I can think of the following ways to tackle this:

  1. We can also choose to release only the tracing instrumentation and work on the metrics later if we get feedback from the users. I also believe that the OpenTelemetry metrics does not provide rich features when compared to Prometheus but it's still the direction to go early to avoid the users from investing too much into the Prometheus only solution.
  2. We deprecate Prometheus monitoring and continue supporting OpenTelemetry tracing and monitoring for users that want to work with OpenTelemetry. The decision is up to the user and we might have some more work to maintain both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would declared the old metric system as deprecated (TO BE REMOVED in a couple of minors) and go with full OpenTelemetry approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The official Prometheus library already supports OpenTelemetry api's and sdk's. The OpenTelemetry Collector also supports scraping data from the existing Prometheus client. We might need some elaborate configuration for metrics and OpenTelemetry Collector to support the existing mechanism but OpenTelemetry is the way to go.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

jina/serve/gateway.py Outdated Show resolved Hide resolved
jina/serve/runtimes/gateway/__init__.py Outdated Show resolved Hide resolved
async def async_run_forever(self):
"""Running method of the server."""
await self.gateway.run_server()
from .gateway import HTTPGateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably missed this, but I believe it's still possible, it does not produce circular imports for other gateways

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@JoanFM JoanFM closed this Oct 11, 2022
@JoanFM JoanFM reopened this Oct 11, 2022
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

📝 Docs are deployed on https://feat-instrumentation-5155--jina-docs.netlify.app 🎉

Copy link
Contributor

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@JoanFM JoanFM merged commit 107631e into master Oct 11, 2022
@JoanFM JoanFM deleted the feat-instrumentation-5155 branch October 11, 2022 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli This issue/PR affects the command line interface area/core This issue/PR affects the core codebase area/docs This issue/PR affects the docs area/setup This issue/PR affects setting up Jina area/testing This issue/PR affects testing component/client component/resource size/L size/S size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add OpenTelemetry to increase observability
7 participants