Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic sampling in Collector #1562

Closed
ntyrewalla opened this issue Nov 12, 2020 · 27 comments
Closed

Dynamic sampling in Collector #1562

ntyrewalla opened this issue Nov 12, 2020 · 27 comments

Comments

@ntyrewalla
Copy link

At present, we have static head based sampling for tracing. We would like to extend that and support dynamic sampling based on configuration passed on from the control plane. For example, customers should be able to change this sampling rate on the fly based on the current issue they are seeing in the production environment. Once the issue is resolved, customers should be able to revert it back to the default rate or reduce the rate as measure of cost control.
Additionally, communication mechanism from collector to a control plan would be needed to configure individual sampling workflows, for example certain workflows like order processing systems are more sensitive to errors and faults than others where retry mechanisms are in place.

@tigrannajaryan tigrannajaryan transferred this issue from open-telemetry/community Nov 12, 2020
@tigrannajaryan
Copy link
Member

This is a Collector feature request, moved to the appropriate repo.

@tigrannajaryan
Copy link
Member

This is a non-trivial feature that requires a design document. It is not a feature on our roadmap, so it is unlikely that maintainers work on this. If a contributor submits a design document we will review and decide if the feature is appropriate for the Collector.

@alolita
Copy link
Member

alolita commented Nov 14, 2020

@tigrannajaryan Agree - this is non-trivial. We would like to discuss dynamic sampling features. It would be great to restart the sampling SIG to discuss. @tedsuo what is the process to restart the sampling SIG which used to be held on Fridays? Please advise. @ntyrewalla and other AWS engineers are interested in participating and running the SIG meeting as community members.

@paulosman
Copy link
Member

@tigrannajaryan Agree - this is non-trivial. We would like to discuss dynamic sampling features. It would be great to restart the sampling SIG to discuss. @tedsuo what is the process to restart the sampling SIG which used to be held on Fridays? Please advise. @ntyrewalla and other AWS engineers are interested in participating and running the SIG meeting as community members.

I'd be interested in helping with this as well if you're looking for more participation.

@alolita
Copy link
Member

alolita commented Nov 19, 2020

@paulosman Great! Will ping you once I get the day and time set up for the SIG meetings.

@espower
Copy link

espower commented Dec 2, 2020

Since the xray javaagent is failing to trace queries that go through an ALB we switched to using the otel javaagent and everything lights up rather nicely. Unfortunately, as stated above, there is no dynamic controls for ingestion the way the xray agent does, where adjusting the sampling rules would allow us to bump up the number of ingested traces or tune it back to little/none

The biggest sell for this tool is that it can be implemented across the board without dropping $500K/yr to a third party vendor. If we have no ability to curtail ingestion the costs will be astronomical and it would then be better to go with the third party solution where we can negotiate on price. We would happily drop $30-50K to get this kind of data and my last company would have done the same in a heartbeat.

This really could be a big win for AWS. For now we will wait for the xray agent bug to be fixed.

@tigrannajaryan
Copy link
Member

I have an internal design document for remote configurability of the Collector. Let me see if I can make it available publicly.

@alolita
Copy link
Member

alolita commented Dec 3, 2020

Adding link to other issue with more details on Sampling SIG setup - open-telemetry/community#577 (comment)

@andrewhsu andrewhsu added enhancement New feature or request and removed feature request labels Jan 6, 2021
@alolita
Copy link
Member

alolita commented Jan 13, 2021

@tigrannajaryan any progress in getting the design doc available for remote configurability of the Collector? I'd like the Sampling SIG to be set up sooner than later and would like to add this area as one of the topics for discussion.

dyladan referenced this issue in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
Fixes #1562

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
@awsiv
Copy link

awsiv commented May 19, 2021

Would be cool to get some traction on this - The ability to dynamically increase the sample ratio during an incident (currently using X-ray) is very useful.

FWIW, xray-receiver seems to be using the proxy server to achieve this: https://aws-otel.github.io/docs/components/x-ray-receiver#set-x-ray-reciever-configurations-related-to-the-local-tcp-proxy-server

punya referenced this issue in punya/opentelemetry-collector-contrib Jul 21, 2021
Fixes #1562

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
@awsiv
Copy link

awsiv commented Nov 2, 2021

Related (AWS/X-Ray):

#4625

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2022

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Nov 7, 2022
@jpkrohling jpkrohling removed the Stale label Nov 30, 2022
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Apr 3, 2023
@atoulme atoulme added priority:p2 Medium and removed Stale priority:p3 Lowest labels Apr 24, 2023
@atoulme
Copy link
Contributor

atoulme commented Apr 24, 2023

Going to increase the priority for this. I would like to know how you would send to the collector the updated ratio to sample. Would you have the collector read the value from a remote endpoint? Send the value to the collector in some fashion, such as addressing the collector over a REST call? Please let us know what you think is best.

@atoulme
Copy link
Contributor

atoulme commented Apr 24, 2023

Alternatively, as @tigrannajaryan is hinting here, is this something that can be resolved using OpAmp?

@bputt-e
Copy link

bputt-e commented Apr 24, 2023

I'd prefer a single agent that sits in an enclave that pulls from a rest endpoint and either pushes updates to agents and/or the agents pull updates from the single agent

@atoulme
Copy link
Contributor

atoulme commented Apr 24, 2023

Isn't that a SPOF?

@bputt-e
Copy link

bputt-e commented Apr 24, 2023

Isn't that a SPOF?

Technically yes, depends on how it can run, I'd prob have replicas in k8s, but that might get tricky with adding logic for primary/read-only replicas

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Apr 25, 2023

Alternatively, as @tigrannajaryan is hinting here, is this something that can be resolved using OpAmp?

Yes, assuming the ratio is a setting in the Collector config file (of the processor that does the sampling). No different than any other dynamic config update done via OpAMP.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Jun 26, 2023
@atoulme
Copy link
Contributor

atoulme commented Jun 28, 2023

I will tend to give a nod to OpAmp on this as the best move forward to make dynamic configuration updates to the collector.

@github-actions github-actions bot removed the Stale label Jun 28, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Aug 28, 2023
@github-actions
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2023
@liwanxue
Copy link

@atoulme any plan for this feature?

@atoulme
Copy link
Contributor

atoulme commented Feb 28, 2024

See OpAmp for more info

@tigrannajaryan
Copy link
Member

Some OpAMP links that may be useful:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests