Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTTP handler for metrics querying #3095

Merged
merged 14 commits into from
Jun 21, 2021

Conversation

albertteoh
Copy link
Contributor

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

Short description of the changes

  • Adds the HTTP handlers for metrics querying.

Signed-off-by: albertteoh <albert.teoh@logz.io>
cmd/query/app/handler_archive_test.go Show resolved Hide resolved
cmd/query/app/http_handler.go Outdated Show resolved Hide resolved
cmd/query/app/http_handler.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 16, 2021

Codecov Report

Merging #3095 (fddc433) into master (13885e5) will increase coverage by 0.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3095      +/-   ##
==========================================
+ Coverage   95.86%   95.96%   +0.09%     
==========================================
  Files         235      236       +1     
  Lines       10143    10252     +109     
==========================================
+ Hits         9724     9838     +114     
+ Misses        349      344       -5     
  Partials       70       70              
Impacted Files Coverage Δ
cmd/query/app/handler_options.go 100.00% <100.00%> (ø)
cmd/query/app/http_handler.go 100.00% <100.00%> (ø)
cmd/query/app/json_marshaler.go 100.00% <100.00%> (ø)
cmd/query/app/query_parser.go 100.00% <100.00%> (ø)
cmd/query/app/server.go 97.22% <100.00%> (+0.01%) ⬆️
plugin/storage/integration/integration.go 77.90% <0.00%> (+0.55%) ⬆️
...lugin/sampling/strategystore/adaptive/processor.go 100.00% <0.00%> (+0.92%) ⬆️
pkg/config/tlscfg/cert_watcher.go 94.80% <0.00%> (+2.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13885e5...fddc433. Read the comment docs.

@albertteoh albertteoh marked this pull request as ready for review June 16, 2021 13:18
@albertteoh albertteoh requested a review from a team as a code owner June 16, 2021 13:18
@albertteoh albertteoh requested a review from vprithvi June 16, 2021 13:18
@@ -115,6 +127,10 @@ func (aH *APIHandler) RegisterRoutes(router *mux.Router) {
// TODO - remove this when UI catches up
aH.handleFunc(router, aH.getOperationsLegacy, "/services/{%s}/operations", serviceParam).Methods(http.MethodGet)
aH.handleFunc(router, aH.dependencies, "/dependencies").Methods(http.MethodGet)
aH.handleFunc(router, aH.latencies, "/metrics/latencies/{%s}", servicesParam).Methods(http.MethodGet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you particularly set on passing service in the URL? We tried to move away from that (see L126)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I'm not particularly attached to passing service as a URL path segment. Just thought that having services in the URL path is an affordance to suggest it is a mandatory parameter.

What was the reason behind moving away from it? I only found this comment, but couldn't find the discussion on why there was final agreement on moving in this direction: #52 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason behind moving away from it?

  1. part of my holy war on high cardinality REST URLs (when params are encoded in the URL we can't use URL as the span name)
  2. when the service name is part of the URL people sometimes tend to think it can go there unencoded, which breaks when the service name contains a slash

in the end though - not very strong reasons

albertteoh added 3 commits June 17, 2021 23:08
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
cmd/query/app/http_handler.go Show resolved Hide resolved
cmd/query/app/http_handler.go Show resolved Hide resolved
service := r.FormValue(serviceParam)
operation := r.FormValue(operationParam)

startTime, err := p.parseTime(startTimeParam, r)
startTime, err := p.parseTime(r, startTimeParam, time.Microsecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the search endpoint defines times in microseconds since epoch (why?), but other endpoints such as dependencies and metrics use milliseconds since epoch, since those are the time units JS (i.e. Jaeger UI) use e.g. Date.now(), so we need to pass the units to the parseTime function to support both cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this to comments in the code. The search operates on span latencies, which are expressed as microseconds in the data model, so it makes sense to support high accuracy. The search UI itself does not insist on exact units because it supports string like 1ms. We had a debate over whether units should be handled by the UI instead of the backend service, but here we are, since Go makes parsing 1ms very easy.

The dependencies API does not operate on the latency space, instead its timestamps are just time range selections, and the typical backend granularity of those is on the order of 15min or more. So microseconds aren't really that useful in this domain, although I certainly would've preferred having a consistent time representation.

Metrics is a new domain, you need to decide which representation makes more sense. I think it's closer to dependencies, where you're asking about wall clock time, not the latency domain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search operates on span latencies, which are expressed as microseconds in the data model

Out of curiosity, why micros and not posix nanos which I think also fit into an int64? Were there some limitations in storage or clients to handle posix nanos?

Metrics is a new domain, you need to decide which representation makes more sense. I think it's closer to dependencies, where you're asking about wall clock time, not the latency domain.

I think milliseconds are most appropriate (I've documented these reasons in comments within queryParser):

  • The main client in mind is Jaeger UI. Being a react.js app, its built-in time precision is milliseconds so it simplifies usage for Jaeger UI.
  • The min step size is 1ms for the Prometheus-compliant storage backends I've tested.
  • It also follows that data point timestamps are reported with a millisecond precision, at least from my tests so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why micros and not posix nanos which I think also fit into an int64? Were there some limitations in storage or clients to handle posix nanos?

legacy from zipkin origins, which still only supports micros, whereas our domain model uses Go's time.Time which has nanos. But our Thrift API uses micros.

cmd/query/app/query_parser.go Show resolved Hide resolved

// If no units are supplied, assume parsing of duration strings like 5ms.
case units == nil:
if d, err = time.ParseDuration(formValue); err == nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the search endpoint defines durations as duration strings, presumably to support human-friendly clients like curl, etc. so this maintains support for such duration parameter values.

albertteoh added 3 commits June 19, 2021 20:43
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Copy link
Contributor Author

@albertteoh albertteoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tidied a few things up:

  • use encapsulation for duration parsing
  • move query param parsing logic for each endpoint into queryParser
  • fix up comments
  • unexport unnecessarily exported var
  • wrap errors

PTAL.

cmd/query/app/query_parser.go Outdated Show resolved Hide resolved
cmd/query/app/http_handler.go Outdated Show resolved Hide resolved
cmd/query/app/query_parser.go Outdated Show resolved Hide resolved
service := r.FormValue(serviceParam)
operation := r.FormValue(operationParam)

startTime, err := p.parseTime(startTimeParam, r)
startTime, err := p.parseTime(r, startTimeParam, time.Microsecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this to comments in the code. The search operates on span latencies, which are expressed as microseconds in the data model, so it makes sense to support high accuracy. The search UI itself does not insist on exact units because it supports string like 1ms. We had a debate over whether units should be handled by the UI instead of the backend service, but here we are, since Go makes parsing 1ms very easy.

The dependencies API does not operate on the latency space, instead its timestamps are just time range selections, and the typical backend granularity of those is on the order of 15min or more. So microseconds aren't really that useful in this domain, although I certainly would've preferred having a consistent time representation.

Metrics is a new domain, you need to decide which representation makes more sense. I think it's closer to dependencies, where you're asking about wall clock time, not the latency domain.

albertteoh added 2 commits June 20, 2021 11:23
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
// parse takes a request and constructs a model of parameters
// parseTraceQueryParams takes a request and constructs a model of parameters.
//
// Why start/end parameters are expressed in microseconds:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro I've added your comments here for posterity. Please advise if they would be better placed else where.


// parseMetricsQueryParams takes a request and constructs a model of metrics query parameters.
//
// Why the API is designed using an end time (endTs) and lookback:
Copy link
Contributor Author

@albertteoh albertteoh Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added reasons for various parameter decisions here. Again, if you can think of a better place to put these, I can move them.

spanstore.OperationQueryParameters{ServiceName: "abc/trifle"},
spanstore.OperationQueryParameters{
ServiceName: "abc/trifle",
SpanKind: "server",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the assertion to catch my earlier breaking API change.

albertteoh added 2 commits June 20, 2021 18:21
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@@ -76,15 +121,16 @@ type traceQueryParameters struct {
// key := strValue
// keyValue := strValue ':' strValue
// tags :== 'tags=' jsonMap
func (p *queryParser) parse(r *http.Request) (*traceQueryParameters, error) {
func (p *queryParser) parseTraceQueryParams(r *http.Request) (*traceQueryParameters, error) {
dp := durationStringParser{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declared too far from usage

lookback time.Duration
}

durationParser interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embrace functional programming - this is just a function, and the two parsers can be easily implemented to return functions, without using any structs

// Valid input span kinds are the string representations from the OpenTelemetry model/proto/metrics/otelspankind.proto.
// For example:
// - "SPAN_KIND_SERVER"
// - "SPAN_KIND_CLIENT"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this consistent with the strings we used in the Operations API? I thought we used strings client/server, as defined in OpenTracing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pickup, I couldn't find formal docs in comments but found evidence where "server" and "client" are used in the operations API unit tests.

Added a mapping from jaeger/opentracing span kinds to OTEL span kinds.

Signed-off-by: albertteoh <albert.teoh@logz.io>
"github.com/gogo/protobuf/proto"
)

type jsonMarshaler = func(writer io.Writer, response interface{}) error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to follow your advice and embrace functional programming here as well; it looks much more concise :)

// Valid input span kinds are the string representations from the OpenTelemetry model/proto/metrics/otelspankind.proto.
// For example:
// - "SPAN_KIND_SERVER"
// - "SPAN_KIND_CLIENT"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pickup, I couldn't find formal docs in comments but found evidence where "server" and "client" are used in the operations API unit tests.

Added a mapping from jaeger/opentracing span kinds to OTEL span kinds.

lookback time.Duration
}

durationParser = func(s string) (time.Duration, error)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converted the duration parsers to functions. I did initially try to adopt functional programming but was thrown by the fact that the unit duration parser needed to maintain state (the duration units), but hadn't thought that we can get around this via closures.

Thanks for pointing this out, @yurishkuro, it looks a lot neater, and I've learnt something new! :)

cmd/query/app/query_parser_test.go Outdated Show resolved Hide resolved
cmd/query/app/query_parser.go Outdated Show resolved Hide resolved
cmd/query/app/query_parser.go Outdated Show resolved Hide resolved
Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@yurishkuro yurishkuro merged commit 681dd68 into jaegertracing:master Jun 21, 2021
@albertteoh albertteoh deleted the 2954-http-handler branch June 21, 2021 04:37
@albertteoh
Copy link
Contributor Author

Thanks for the review, Yuri!

@jpkrohling jpkrohling added this to the Release 1.24.0 milestone Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants