-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add circular in-memory exemplars storage #6635
Add circular in-memory exemplars storage #6635
Conversation
The one concern I have with a separate ExemplarAppender interface is that it expects that exemplars are stored outside of the actual time series. Given that exemplars are emitted alongside a value in a time series, I think its beneficial to make the decision of how to persist them a concern of the storage. By augmenting the appender interface with something like this #6844 (limitedly intrusive) it allows the storage to decide how it would like to deal with exemplars - by implementing the Appender accordingly. A storage can ignore the exemplar, store both exemplar and value or just store the exemplars like in the in-memory storage you have in this PR.. |
cmd/prometheus/main.go
Outdated
@@ -224,6 +225,9 @@ func main() { | |||
a.Flag("storage.remote.read-max-bytes-in-frame", "Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note that client might have limit on frame size as well. 1MB as recommended by protobuf by default."). | |||
Default("1048576").IntVar(&cfg.web.RemoteReadBytesInFrame) | |||
|
|||
a.Flag("storage.exemplars.exemplars-per-series-limit", "Maximum number of exemplars to store in in-memory exemplar storage per series."). | |||
Default("10").IntVar(&cfg.ExemplarsLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we assume about 20% of series have exemplars (and have filled all 10 of those exemplars in their storage) then this would result in about a 10% memory increase based on the math in the design doc
Since I'll be picking up this work again soon I've rebased off master, fixed issues from that, and squashed commits down to one. The failing test is in @bwplotka I'm struggling to understand how to use the new Queriers. The exemplar API requires a set of
|
The error you are getting is related to concurrent selects really. You cannot run multiple Selects if you touch any Next from the same querier's select already. Just schedule all selects first, then iterate: this allows to run selects concurrently. |
@bwplotka Hmm okay so this was because I was calling the exemplars querier Select nested within the queriers Select, since with remote read the Select call can be streaming now? Is that where the requirement
comes from? |
docs/querying/api.md
Outdated
- `start=<rfc3339 | unix_timestamp>`: Start timestamp. | ||
- `end=<rfc3339 | unix_timestamp>`: End timestamp. | ||
|
||
Exemplar storage is not part of Prometheus' TSDB, so this endpoint doesn't allow filtering based on time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you probably want to remove the start and end params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should just update this comment rather, I added the start and end a few days later after realizing exemplar queries are most likely going to be executed alongside a query_range
query associated with a grafana dashboard. If there's additional use cases that come up for other query endpoint options/implementations we can discuss those but I've been building the current endpoints implementation with the assumption that the dashboard panel -> exemplars query is the most likely path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still unaddressed then
@cstyan I looked at the design docs: https://docs.google.com/document/d/1ymZlc9yuTj8GvZyKz1r3KDRrhaOjZ1W1qZVW_5Gj7gA/edit But I have no clue, sorry. I am just a user. But I would really love this feature: Jump from a grafana graph to traces. This would be magic. Thank you for working on this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only briefly glanced over the TSDB changes and I still have to take a detailed look at the WAL implementation, but we need some changes there.
- tsdb/wal.go is deprecated IIUC, you should be making the changes in
tsdb/wal/wal.go
. (I am not aware of how the old WAL repair is set up, so might require some changes there too if required, just to translate properly, everything else in the wal package). - I don't feel comfortable with the storage (i.e. the circular queue) being outside TSDB while we are also writing into WAL. Should we have the exemplar storage in the TSDB too? (And maybe in the Head, you have to touch it for WAL anyway)
- Are we restoring back the exemplars from the WAL to the circular queue on startup? (I could not find, I might have missed)
tsdb/exemplar.go
Outdated
lastTs := ce.exemplars[idx].exemplar.Ts | ||
|
||
for { | ||
if idx == -1 || string(ce.exemplars[idx].seriesLabels.Bytes(buf)) != string(l.Bytes(buf)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If idx could be -1, would that not be a panic above already at lastTs := ce.exemplars[idx].exemplar.Ts
? (Update: I see the if it was ok
, then it won't be -1
at start. Can you add a comment for that?)
Plus, are you sure using the same buf
on both sides of !=
is not going to cause any problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add some comments to clarify the checks here, they're not nice but the implementation of the circular buffer can still be cleaned up.
I am not sure if using the same buf
is not going to cause any problems. I think it should be (unless my understanding of how bytes.Buffer uses the underlying slice is wrong), since we create a new bytes.Buffer
from buf
in Labels()
, and return Buffer.Bytes
which is only valid until the next Write/Read/etc.
call on that same bytes.Buffer
. But we create a new one on the right hand side of !=
. I can always change this to l.String()
if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this clearer now?
tsdb/exemplar.go
Outdated
lastTs = ce.exemplars[idx].exemplar.Ts | ||
// Prepend since this exemplar came before the last one we appeneded chronologically. | ||
if e.Ts >= hints.Start && e.Ts <= hints.End { | ||
ret = append([]exemplar.Exemplar{e}, ret...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a micro-optimisation, but we could preallocate the slice here else there is some unnecessary allocation if there are too many exemplars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any suggestions for a size to preallocate? there could theoretically be as many exemplars as the max size for the overall circular buffer if only one series has exemplars
tsdb/exemplar.go
Outdated
return storage.ErrOutOfOrderExemplar | ||
} | ||
ce.exemplars[ce.nextIndex] = circularBufferEntry{exemplar: e, seriesLabels: l, prev: idx} | ||
ce.index[l.String()] = ce.nextIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will have >1 exemplar per series in general, would string interning help here for memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, I'll look at using the interner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The head symbol table is that basically already, it'll have all the sample labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick glance over.
docs/querying/api.md
Outdated
}, | ||
"value": 6, | ||
"timestamp": 1600096945479, | ||
"hasTimestamp": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this field? If there's no timestamp, return no timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I'm just returning a JSON encoded data from the exemplar.Exemplar
struct. I don't actually understand why the field exists on that struct in the first place, not sure if it's part of the OM spec or if @shreyassrivatsan had another idea there? Maybe to indicate if the timestamp was exported by the source, instead of it being the scrape timestamp?
I can remove the field from the output for the API though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exemplar timestamp is optional in the exposition. It is a different thing from the scrape/sample timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying about the exposition.
Using the scrape/sample timestamp when an exported timestamp was not found allows users to query for exemplars during the same time range as their existing PromQL query. Both timestamps could be stored as a separate fields though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you send on a duplicate for every scrape, or just once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like you mentioned in the other thread regarding the duplicate error (or not error), it's likely the exported exemplar for a series might not change from one scrape to the next, so for each subsequent scrape we'd skip storing the exemplar again until we see a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would likely be a part of the same json marshal in the previous comment thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that'd all be handled together.
docs/querying/api.md
Outdated
"labels": { | ||
"traceID": "EpTxMJ40fUus7aGY" | ||
}, | ||
"value": 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This like all float values in our APISs this should be a string, +Inf is in theory valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at how other endpoints are marshalling data then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion for where to make this change? Looks like we just marshal the promql Result
type:
// Result holds the resulting value of an execution or an error
// if any occurred.
type Result struct {
Err error
Value parser.Value
Warnings storage.Warnings
}
where Value
is:
// Value is a generic interface for values resulting from a query evaluation.
type Value interface {
Type() ValueType
String() string
}
and types are:
// The valid value types.
const (
ValueTypeNone ValueType = "none"
ValueTypeVector ValueType = "vector"
ValueTypeScalar ValueType = "scalar"
ValueTypeMatrix ValueType = "matrix"
ValueTypeString ValueType = "string"
)
The exemplar API endpoint can always return a matrix of exemplars.
Would you suggest something like a new ExemplarMatrix
type such as:
// Exemplar represents a single data point for a given timestamp.
type Exemplar struct {
Labels labels.Labels `json:"labels"`
T int64
V float64
}
// ExemplarSeries is a stream of exemplars belonging to a metric series.
type ExemplarSeries struct {
Metric labels.Labels `json:"metric"`
Exemplars []Exemplars `json:"exemplars"`
}
// ExemplarMatrix is a slice of ExemplarsSeries that implements sort.Interface and
// has a String method.
type ExemplarMatrix []ExemplarSeries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 1621 in c8f8100
func marshalPointJSON(ptr unsafe.Pointer, stream *jsoniter.Stream) { |
To do this right this time, I'd suggest actually making it a string and not tie our internal representations to the external representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you're suggesting modify/creating a new struct type, or registering something like jsoniter.RegisterTypeEncoderFunc("exemplar.Exemplar", marshalExemplarJSON, marshalExemplarJSONIsEmpty)
that uses something like Sprintf("%f", exemplar.Value)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting modifying the type, RegisterTypeEncoderFunc is a hack for performance reasons and should be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the status on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc needs to be updated
scrape/scrape.go
Outdated
} | ||
// todo: This smells funny. | ||
if hasExemplar := p.Exemplar(&e); hasExemplar { | ||
if err := exemplarApp.AddExemplar(ce.lset, t, e); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have sanity checks on the exemplar timestamp, like we do for scraping generally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I will look at how we're doing that. My assumption was that the majority of the time an exemplar would have the same timestamp as a scraped sample so I didn't put time into doing validation of the timestamp here yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be very rare that it'd have the same timestamp, usually it'll be in the past (if there is a timestamp) and it may even appear a bit in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be marked as resolved now with the checks in exemplar.go
addExemplar
and the ErrOutOfOrderExemplar
. Let me know if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not right. The check should be that it's not more than 10m in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing where we enforce that same not more than 10m in the future
for regular samples. I just see minValidTime
in head.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 223 in 275f7e7
const maxAheadTime = 10 * time.Minute |
thanks for the reviews so far @brian-brazil @codesome I hope to get to the rest of your comments and some of my own todo list this week still |
I did try a variation of this PR and I was surprised that for a query 10e3* histogram_quantile(foo) the exemplar value was not modified by 10e3. I don't know how easy that would be however, but when switching to "explore" with the TNS default dashboard it was confusing to me. |
I doubt it's practical, given how complex PromQL can be. |
docs/querying/api.md
Outdated
"labels": { | ||
"traceID": "EpTxMJ40fUus7aGY" | ||
}, | ||
"value": 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you're suggesting modify/creating a new struct type, or registering something like jsoniter.RegisterTypeEncoderFunc("exemplar.Exemplar", marshalExemplarJSON, marshalExemplarJSONIsEmpty)
that uses something like Sprintf("%f", exemplar.Value)
.
docs/querying/api.md
Outdated
}, | ||
"value": 6, | ||
"timestamp": 1600096945479, | ||
"hasTimestamp": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would likely be a part of the same json marshal in the previous comment thread?
scrape/scrape.go
Outdated
} | ||
// todo: This smells funny. | ||
if hasExemplar := p.Exemplar(&e); hasExemplar { | ||
if err := exemplarApp.AddExemplar(ce.lset, t, e); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be marked as resolved now with the checks in exemplar.go
addExemplar
and the ErrOutOfOrderExemplar
. Let me know if you disagree.
docs/querying/api.md
Outdated
"labels": { | ||
"traceID": "EpTxMJ40fUus7aGY" | ||
}, | ||
"value": 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting modifying the type, RegisterTypeEncoderFunc is a hack for performance reasons and should be avoided.
docs/querying/api.md
Outdated
}, | ||
"value": 6, | ||
"timestamp": 1600096945479, | ||
"hasTimestamp": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that'd all be handled together.
scrape/scrape.go
Outdated
} | ||
// todo: This smells funny. | ||
if hasExemplar := p.Exemplar(&e); hasExemplar { | ||
if err := exemplarApp.AddExemplar(ce.lset, t, e); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not right. The check should be that it's not more than 10m in the future.
tsdb/exemplar.go
Outdated
|
||
lastTs = ce.exemplars[idx].se.scrapeTimestamp | ||
// Prepend since this exemplar came before the last one we appeneded chronologically. | ||
if e.Ts >= start && e.Ts <= end { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the exemplar doesn't have a timestamp?
I think you still haven't fully worked out the right semantics here for exemplars and timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exemplars having their own optional timestamp separate from when they were scraped makes this a bit weird. IMO checking only the scrape timestamp is the simplest way forward given the current use case is range queries that will most likely be used alongside a metrics range query, and whose results will be matched up with timestamped samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd return false positives though when the timestamp of the examplar is before the relevant range, which is not desirable.
It's expected that the majority of exemplars will have explicit timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not take a deep look yet, the overall structure looks fine
tsdb/exemplar.go
Outdated
ce.nextIndex++ | ||
if ce.nextIndex >= cap(ce.exemplars) { | ||
ce.nextIndex = 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ce.nextIndex = (ce.nextIndex+1) % len(ce.exemplars)
?
@tomwilkie found some issues running this with the TNS demo, and given the last moment rush, let's not force it into 2.25 :) lots of testing still to do |
I'm happy that we've not been able to reproduce these now, must have been my machine. A few nits and then we should be good to go. |
This PR implements a basic in-memory storage of exemplars via a large circular buffer. This means there's an overall limit (configurable by users) on the number of exemplars stored total, but no limit (other than that overall limit) on the number of exemplars stored per series.
The public design doc is here.
Signed-off-by: Callum Styan callumstyan@gmail.com