Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampler.param shows defaultSamplingProbability instead of samplingRate #2171

Closed
ndascanio-asapp opened this issue Apr 6, 2020 · 12 comments · Fixed by #2230
Closed

sampler.param shows defaultSamplingProbability instead of samplingRate #2171

ndascanio-asapp opened this issue Apr 6, 2020 · 12 comments · Fixed by #2230
Labels
area/sampling bug help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@ndascanio-asapp
Copy link

ndascanio-asapp commented Apr 6, 2020

Apply different probability per service

This issue may be related with Issue #696

Giving this strategy configuration:

     {
       "service_strategies": [
         {
           "service": "ServiceA",
           "type": "probabilistic",
           "param": 1.0
         }
       ],
       "default_strategy": {
         "type": "probabilistic",
         "param": 0.2,
         "operation_strategies": [
           {
             "operation": "/health",
             "type": "probabilistic",
             "param": 0.0
           }
         ]
       }
     }

The client correctly receives this configuration:

{
  "strategyType": "PROBABILISTIC",
  "probabilisticSampling": {
    "samplingRate": 1
  },
  "operationSampling": {
    "defaultSamplingProbability": 0.2,
    "defaultLowerBoundTracesPerSecond": 0,
    "perOperationStrategies": [
     {
      "operation": "/health",
      "probabilisticSampling": {
        "samplingRate": 0
       }
     }
   ],
  "defaultUpperBoundTracesPerSecond": 0
  }
}

However, the traces for ServiceA have this tag:

sampler.param: 0.2  
sampler.type: "probabilistic"

The same happens if I change the strategy of serviceA to ratelimit. I don't know if it's just the tag that it's wrong or if it's using the default strategy instead of the correct one.

I'm using this docker image: jaegertracing/all-in-one:1.17.1

@ghost ghost added the needs-triage label Apr 6, 2020
@yurishkuro
Copy link
Member

What is the implementation language of ServiceA?

@ndascanio-asapp
Copy link
Author

ServiceA is in go

This is serviceA:

package main

import (
	"net/http"

	"github.com/opentracing/opentracing-go"

	"github.com/uber/jaeger-client-go"
	tConfig "github.com/uber/jaeger-client-go/config"
)

func aaaHandler(w http.ResponseWriter, r *http.Request) {
	span := opentracing.GlobalTracer().StartSpan("aaa")
	defer span.Finish()
	w.Write([]byte("OK"))
}
func bbbHandler(w http.ResponseWriter, r *http.Request) {
	span := opentracing.GlobalTracer().StartSpan("bbb")
	defer span.Finish()
	w.Write([]byte("OK"))
}
func cccHandler(w http.ResponseWriter, r *http.Request) {
	span := opentracing.GlobalTracer().StartSpan("ccc")
	defer span.Finish()
	w.Write([]byte("OK"))
}

func main() {
	cfg, err := tConfig.FromEnv()
	cfg.Reporter.LogSpans = true
	if err != nil {
		panic(err)
	}

	tracer, closer, err := cfg.NewTracer(
		tConfig.Logger(jaeger.StdLogger),
	)
	opentracing.SetGlobalTracer(tracer)

	if err != nil {
		panic(err)
	}
	defer closer.Close()

	http.HandleFunc("/aaa", aaaHandler)
	http.HandleFunc("/bbb", bbbHandler)
	http.HandleFunc("/ccc", cccHandler)
	http.ListenAndServe(":8080", nil)
}

This is go.mod:

module servicea

go 1.12

require (
	github.com/codahale/hdrhistogram v0.0.0-20161010025455-3a0bb77429bd // indirect
	github.com/opentracing/opentracing-go v1.1.0
	github.com/pkg/errors v0.9.1 // indirect
	github.com/stretchr/testify v1.5.1 // indirect
	github.com/uber/jaeger-client-go v2.22.1+incompatible
	github.com/uber/jaeger-lib v2.2.0+incompatible // indirect
	go.uber.org/atomic v1.6.0 // indirect
)

strategies is the one in the original question.

I hit endpoints aaa, bbb and ccc 10 times each:

for i in {1..10}; do
	curl http://localhost:8080/aaa
	curl http://localhost:8080/bbb
	curl http://localhost:8080/ccc
done

What I expect: 30 traces, 10 for each service
What I get: 7 traces: 2 for aaa, 2 for bbb and 3 for ccc (it's using the probabilistic 20% instead of the service strategy).

The trace has sampler.param: 0.2.

Changing just the part of the service strategy to

       "service_strategies": [
         {
           "service": "ServiceA",
           "type": "ratelimit",
           "param": 1.0
         }

The service gets this:

$ curl http://127.0.0.1:5778/sampling?service=ServiceA | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   303  100   303    0     0  48557      0 --:--:-- --:--:-- --:--:-- 50500
{
  "strategyType": "RATE_LIMITING",
  "rateLimitingSampling": {
    "maxTracesPerSecond": 1
  },
  "operationSampling": {
    "defaultSamplingProbability": 0.2,
    "defaultLowerBoundTracesPerSecond": 0,
    "perOperationStrategies": [
      {
        "operation": "/health",
        "probabilisticSampling": {
          "samplingRate": 0
        }
      }
    ],
    "defaultUpperBoundTracesPerSecond": 0
  }
}

I get the same result (7-8 traces) instead of 30.
sampler.type: "probabilistic" and sampler.param: 0.2.

Finally, only declaring a service_strategy and no default strategy:

{
  "service_strategies": [
   {
     "service": "ServiceA",
     "type": "ratelimiting",
     "param": 1.0
   }
  ]
}

In this case I do get what I expect (though the ratelimiting applies globaly instead of per operation, but I think that this is how it works)

@yurishkuro
Copy link
Member

per-operation strategies do not work with rate limiting, only probabilistic sampling.

@ndascanio-asapp
Copy link
Author

Any update on this?

@yurishkuro
Copy link
Member

Going back to your question.

$ cat sampling.json
   {
       "service_strategies": [
         {
           "service": "ServiceA",
           "type": "probabilistic",
           "param": 1.0
         }
       ],
       "default_strategy": {
         "type": "probabilistic",
         "param": 0.2,
         "operation_strategies": [
           {
             "operation": "/health",
             "type": "probabilistic",
             "param": 0.0
           }
         ]
       }
     }
$ SPAN_STORAGE_TYPE=memory go run ./cmd/collector --sampling.strategies-file sampling.json

$ curl 'http://localhost:14268/api/sampling?service=ServiceA' | jq
{
  "strategyType": "PROBABILISTIC",
  "probabilisticSampling": {
    "samplingRate": 1
  },
  "operationSampling": {
    "defaultSamplingProbability": 0.2,
    "defaultLowerBoundTracesPerSecond": 0,
    "perOperationStrategies": [
      {
        "operation": "/health",
        "probabilisticSampling": {
          "samplingRate": 0
        }
      }
    ]
  }
}

$ curl 'http://localhost:14268/api/sampling?service=ServiceB' | jq
{
  "strategyType": "PROBABILISTIC",
  "probabilisticSampling": {
    "samplingRate": 0.2
  },
  "operationSampling": {
    "defaultSamplingProbability": 0.2,
    "defaultLowerBoundTracesPerSecond": 0,
    "perOperationStrategies": [
      {
        "operation": "/health",
        "probabilisticSampling": {
          "samplingRate": 0
        }
      }
    ]
  }
}

While the returned values do reflect the difference between ServiceA and ServiceB, the overall format seems to be incorrect, because the presence of operationSampling entry overrides the normal selection via strategyType and probabilisticSampling. So the fact that the client keeps using 0.2 probability is expected.

I think there are two issues here:

  1. strategyType and probabilisticSampling should not be returned at all if operationSampling is present.
  2. when reading the config file, the specification of ServiceA (rate=1) is not taking precedence over the default_strategy.

@yurishkuro yurishkuro added bug help wanted Features that maintainers are willing to accept but do not have cycles to implement and removed needs-triage labels May 4, 2020
@yurishkuro
Copy link
Member

cc people who recently worked on this, in case someone wants to pick this up: @defool @rutgerbrf

@defool
Copy link
Contributor

defool commented May 5, 2020

I can reproduce the problem(or feature).
If a service defined in service_strategies without specifying a operation_strategies, default_strategy will take precedence.

In order to provide the ability to define the default_strategy for one service, I prefer this solution:
If a service has defined in service_strategies, even though there's no any operation_strategies, jaeger collector API will not returns the default_strategy to clients.
And a bool flag --sampling.disable-merge-strategies should be introduced to solve the compatibility issue.

@yurishkuro
Copy link
Member

If a service is defined in service_strategies, even though there's no any operation_strategies, jaeger collector API will not returns the default_strategy to clients for this service.

Almost agree, however, if default_strategy defines a single policy for /health endpoint, I think it should still apply and not fall back onto the top-level service's entry.

@rutgerbrf
Copy link
Contributor

rutgerbrf commented May 5, 2020

As @defool noted, when service strategy is defined, it by default inherits its operationSampling from the default strategy, unless it defines its own. Only the per-operation strategies are merged. So unless a service defines its own operation sampling, the default defaultSamplingProbability and other properties operationSampling will be set without regard to the SamplingStrategyResponse itself. This is caused by this line:

newStore.serviceStrategies[s.Service].OperationSampling = newStore.defaultStrategy.OperationSampling

As you can see here:

a simple nil check is used to determine if the default operationSampling should be used.

In the case that user has defined the service strategy, I think its sampling rate should take precedence over the default strategy, because being able to have a custom sampling rate for a single service is one of the reasons that someone would want to define a service strategy. I'm not familiar with the Thrift transport, but it does look like strategyType is a required field for the SamplingStrategyResponse, as are the defaultSamplingProbability, defaultLowerBoundTracesPerSecond and perOperationStrategies for the PerOperationSamplingStrategies struct (https://github.com/jaegertracing/jaeger-idl/blob/d4063e359bc52eca57cdbeb92a179fd3aa0250d6/thrift/sampling.thrift).

Perhaps it would be best to simply set the fields (defaultSamplingProbability, defaultLowerBoundTracesPerSecond and defaultUpperBoundTracesPerSecond) of the operationSampling object to the values present in the SamplingStrategyResponse, as defined for the service by the user. I'm not sure if this change would break compatibility, but it would be what I as a user expect as the behavior.

@defool
Copy link
Contributor

defool commented May 6, 2020

If a service is defined in service_strategies, even though there's no any operation_strategies, jaeger collector API will not returns the default_strategy to clients for this service.

Almost agree, however, if default_strategy defines a single policy for /health endpoint, I think it should still apply and not fall back onto the top-level service's entry.

There are two types of precedence:

  1. service-operation > service > default-operation > default
  2. service-operation (config in service is useless) > default-operation > default

2 is the current's logic.
1 is the PR's logic. I think it is more reasonable and provides the real ability to define the default strategy for specify service.

@yurishkuro
Copy link
Member

What about 3) service-operation > default-operation > service > default?

With 1), as soon as you define a service level, default-operation becomes irrelevant,

@defool
Copy link
Contributor

defool commented May 7, 2020

What about 3) service-operation > default-operation > service > default?

With 1), as soon as you define a service level, default-operation becomes irrelevant,

This is okay, default-operation will takes affect in all service. The solution should be to set operationSampling based on service's config as @rutgerbrf mentioned.
I think I can open another PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sampling bug help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
5 participants