-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic sampling in Collector #1562
Comments
This is a Collector feature request, moved to the appropriate repo. |
This is a non-trivial feature that requires a design document. It is not a feature on our roadmap, so it is unlikely that maintainers work on this. If a contributor submits a design document we will review and decide if the feature is appropriate for the Collector. |
@tigrannajaryan Agree - this is non-trivial. We would like to discuss dynamic sampling features. It would be great to restart the sampling SIG to discuss. @tedsuo what is the process to restart the sampling SIG which used to be held on Fridays? Please advise. @ntyrewalla and other AWS engineers are interested in participating and running the SIG meeting as community members. |
I'd be interested in helping with this as well if you're looking for more participation. |
@paulosman Great! Will ping you once I get the day and time set up for the SIG meetings. |
Since the xray javaagent is failing to trace queries that go through an ALB we switched to using the otel javaagent and everything lights up rather nicely. Unfortunately, as stated above, there is no dynamic controls for ingestion the way the xray agent does, where adjusting the sampling rules would allow us to bump up the number of ingested traces or tune it back to little/none The biggest sell for this tool is that it can be implemented across the board without dropping $500K/yr to a third party vendor. If we have no ability to curtail ingestion the costs will be astronomical and it would then be better to go with the third party solution where we can negotiate on price. We would happily drop $30-50K to get this kind of data and my last company would have done the same in a heartbeat. This really could be a big win for AWS. For now we will wait for the xray agent bug to be fixed. |
I have an internal design document for remote configurability of the Collector. Let me see if I can make it available publicly. |
Adding link to other issue with more details on Sampling SIG setup - open-telemetry/community#577 (comment) |
@tigrannajaryan any progress in getting the design doc available for remote configurability of the Collector? I'd like the Sampling SIG to be set up sooner than later and would like to add this area as one of the topics for discussion. |
Fixes #1562 Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
Would be cool to get some traction on this - The ability to dynamically increase the sample ratio during an incident (currently using X-ray) is very useful. FWIW, xray-receiver seems to be using the proxy server to achieve this: https://aws-otel.github.io/docs/components/x-ray-receiver#set-x-ray-reciever-configurations-related-to-the-local-tcp-proxy-server |
Fixes #1562 Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
Related (AWS/X-Ray): |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
Going to increase the priority for this. I would like to know how you would send to the collector the updated ratio to sample. Would you have the collector read the value from a remote endpoint? Send the value to the collector in some fashion, such as addressing the collector over a REST call? Please let us know what you think is best. |
Alternatively, as @tigrannajaryan is hinting here, is this something that can be resolved using OpAmp? |
I'd prefer a single agent that sits in an enclave that pulls from a rest endpoint and either pushes updates to agents and/or the agents pull updates from the single agent |
Isn't that a SPOF? |
Technically yes, depends on how it can run, I'd prob have replicas in k8s, but that might get tricky with adding logic for primary/read-only replicas |
Yes, assuming the ratio is a setting in the Collector config file (of the processor that does the sampling). No different than any other dynamic config update done via OpAMP. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
I will tend to give a nod to OpAmp on this as the best move forward to make dynamic configuration updates to the collector. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
@atoulme any plan for this feature? |
See OpAmp for more info |
Some OpAMP links that may be useful:
|
At present, we have static head based sampling for tracing. We would like to extend that and support dynamic sampling based on configuration passed on from the control plane. For example, customers should be able to change this sampling rate on the fly based on the current issue they are seeing in the production environment. Once the issue is resolved, customers should be able to revert it back to the default rate or reduce the rate as measure of cost control.
Additionally, communication mechanism from collector to a control plan would be needed to configure individual sampling workflows, for example certain workflows like order processing systems are more sensitive to errors and faults than others where retry mechanisms are in place.
The text was updated successfully, but these errors were encountered: