Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start emitting disk usage metrics for domains #4989

Merged
merged 1 commit into from Feb 20, 2024

Conversation

lubosmj
Copy link
Member

@lubosmj lubosmj commented Jan 29, 2024

closes #4603

@lubosmj lubosmj force-pushed the 2-artifacts-size-domains-stored-4603 branch 5 times, most recently from d59fa80 to 724a6ec Compare January 29, 2024 18:08
@lubosmj
Copy link
Member Author

lubosmj commented Jan 29, 2024

I am attaching how the end-result looks in Grafana. Metrics are periodically collected on Prometheus.

In the graph, there are two domains considered -- default and OTEL. I created the OTEL domain manually. Then, I synced a repository within that domain with the file-large repository. I did the same for the default domain. In between, I removed the OTEL domain. Finally, I removed content from a repository in the default domain and ran the orphan cleanup. The graph correctly shows how the disk usage changes over time.
image

Below, I demonstrate the existence of the selected OTEL domain. Once a user removes the domain, the metrics are no longer exported.
image

@lubosmj
Copy link
Member Author

lubosmj commented Jan 29, 2024

@decko, @dkliban, @bmbouter, I am a bit concerned about the current implementation. Will you help me resolve some of the issues?

  1. It takes a long time to export and receive the metrics on the collector. Modifying the obvious OTEL_EXPORTER_OTLP_METRICS_TIMEOUT setting did not mitigate the problem.
    SOLUTION: Do not test the metrics in our CI.
  2. When a user restarts the pulpcore-api service, we stop emitting metrics for newly created domains since the initial process is not running anymore. Furthermore, it is not the best idea to initialize the metrics exporter from post_migrate hooks (I have noticed that exporters initialized inside _ensure_default_domain are being ignored).
    SOLUTION: Initialize metrics' instruments inside _util.startup_hook, like it is done for the default domain now.
  3. The number of workers determines the number of possible duplicates being sent to the collector (because of the initialization is done in _util.startup_hook). In oci-env, we have 2 workers; thus, it sends 2 metrics for the same domain. This is however easily digested by Prometheus and no duplicates are preserved in graphs.
    SOLUTION: We need to ensure that the initialization is run just from pulpcore-api and it is able to recover itself after the restart. Where is the correct place for doing this?

Any inputs?

@lubosmj lubosmj marked this pull request as ready for review January 29, 2024 18:57
@decko
Copy link
Member

decko commented Jan 31, 2024

  1. Have you tried OTEL_METRIC_EXPORT_TIMEOUT?

  2. I think I'm ok with it, but probably @gerrod3 is the best person to answer that.

  3. Do you think we can keep the "emitter" only on the API side?

  4. On the first graph I'm seeing that we got two metrics, one emitted by the API and other by the worker. I'm presuming that under the attribute "name" we have the domain, but on the URL it's pointing just for the default domain although the name is "OTEL".

@lubosmj
Copy link
Member Author

lubosmj commented Jan 31, 2024

  1. Have you tried OTEL_METRIC_EXPORT_TIMEOUT?

I have just tested it. It did not work either. Seeing the logs inside oci-env, the metrics are emitted every 15 seconds or so. Can you verify this on your side too, please?

  1. Do you think we can keep the "emitter" only on the API side?

If we keep it on the API side, outside of the domain.save() scope, we will need to have a hook somewhere which iterates over existing domains and initialize the emitter every time the process is started. The somewhere part is the catch. That is why I resolved to have it inside the AFTER_CREATE hook at the moment (this is the place where we initialize the emitter for newly created domains).

  1. On the first graph I'm seeing that we got two metrics, one emitted by the API and other by the worker. I'm presuming that under the attribute "name" we have the domain, but on the URL it's pointing just for the default domain although the name is "OTEL".

The URLs are different, pointing to respective domains. The exported job is equal to "worker" because information about the default domain is emitted from

def startup_hook():
(ensuring that if a worker fails or is restarted, we still emit data for the default domain; restarting api-app worker is not so common, but might happen in case of the overloaded system).

@lubosmj lubosmj marked this pull request as draft February 2, 2024 20:00
@lubosmj
Copy link
Member Author

lubosmj commented Feb 15, 2024

It appears that I can emit data about domains from wsgi.py (in the same way I do from the worker process inside the startup_hook function). This should be resilient to process restarts and the exporter is always emitting data from api-app, which is exactly what we want! With @dkliban, we also agreed on having the data emitted just from api-app but with a lower frequency (e.g., 5 minutes). I will work on this.

In the future (new PR), we may consider having the metrics emitted on a specific event as initially planned. On the other hand, we would need to find a balance between emitting data when a task finishes and serving on-demand content (we do not want to run heavy DB queries each time content is downloaded and streamed by content-app).

Another problem to deal with is the number of duplicated data. The number of api-app processes determines the number of duplicates. This can be resolved by introducing a new option to the entrypoint command (new PR). So, users will run pulpcore-api --emit-metrics to get the metrics exported for a main emitting process only.

@lubosmj lubosmj force-pushed the 2-artifacts-size-domains-stored-4603 branch 6 times, most recently from f6e51a9 to 36a4c91 Compare February 19, 2024 16:06
@lubosmj lubosmj marked this pull request as ready for review February 19, 2024 16:12
Comment on lines +1 to +9
Tech Previews
=============

The following features are currently being released as part of tech preview:

* Support for Open Telemetry
* Upstream replicas
* Domains - Multi-Tenancy
* Complex filtering
Copy link
Member Author

@lubosmj lubosmj Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedro-psb, is it worth introducing the same page to the staging docs? If yes, where? How do we keep sync of newly added docs and staging docs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point. The docs workgroup has discussed this in the early stages of planning and we've agreed that "All new doc activity will focus on new-docs. Old documentation as is kept as is until process is done." (early planning post).

The reasons were to keep it simple and with low overhead for us. If this temporary outdated state is not acceptable we can discuss alternatives, the simple ones I can think of are:

  • duplicating the content manually until we switch
  • adding a banner to the actual website stating that only the new-docs is receiving updates

Copy link
Member Author

@lubosmj lubosmj Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in what content persona/category to put this?
I believe admin/learn/tech-preview.md seems fine for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for now" because the tech-preview listing happens in a lot of plugins, so we may think of a better general approach for this later. But this works

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should have a tech-preview section. Some guides/tutorials can just mention that a feature is tech preview.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. i misunderstood. let's make the page as was suggested above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. See the file staging_docs/admin/learn/tech-preview.md. I did not add more hyperlinks because the listed features are still located under the "triage-needed!" directory.

decko
decko previously approved these changes Feb 19, 2024
Copy link
Member

@decko decko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

decko
decko previously approved these changes Feb 20, 2024
Copy link
Member

@decko decko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀


The following features are currently being released as part of tech preview:

- [Support for Open Telemetry](architecture.md#telemetry-support)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL should be (site:pulpcore/admin/learn/architecture/#telemetry-support)

Copy link
Member Author

@lubosmj lubosmj Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me locally. Is it not working for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tried it, but I believe that we want to express all our links using the notation that i provided.

At the moment, it is not possible to destroy instruments that send
metrics. Therefore, when a user removes a domain, there might be still
the metrics about it emitted. A temporary workaround is to restart the
pulpcore-api process to reload meters.

Ref: open-telemetry/opentelemetry-specification#2232

closes pulp#4603
Copy link
Member

@decko decko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT 🚀 🚀 🚀

@lubosmj lubosmj merged commit bfb32e6 into pulp:main Feb 20, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add additional metrics reporting size of artifacts stored
4 participants