Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add circular in-memory exemplars storage #6635

Merged
merged 3 commits into from
Mar 16, 2021
Merged

Add circular in-memory exemplars storage #6635

merged 3 commits into from
Mar 16, 2021

Conversation

cstyan
Copy link
Member

@cstyan cstyan commented Jan 15, 2020

This PR implements a basic in-memory storage of exemplars via a large circular buffer. This means there's an overall limit (configurable by users) on the number of exemplars stored total, but no limit (other than that overall limit) on the number of exemplars stored per series.
The public design doc is here.

Signed-off-by: Callum Styan callumstyan@gmail.com

tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/querier.go Outdated Show resolved Hide resolved
tsdb/querier.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
@shreyassrivatsan
Copy link
Contributor

shreyassrivatsan commented Feb 21, 2020

The one concern I have with a separate ExemplarAppender interface is that it expects that exemplars are stored outside of the actual time series. Given that exemplars are emitted alongside a value in a time series, I think its beneficial to make the decision of how to persist them a concern of the storage.

By augmenting the appender interface with something like this #6844 (limitedly intrusive) it allows the storage to decide how it would like to deal with exemplars - by implementing the Appender accordingly. A storage can ignore the exemplar, store both exemplar and value or just store the exemplars like in the in-memory storage you have in this PR..

@@ -224,6 +225,9 @@ func main() {
a.Flag("storage.remote.read-max-bytes-in-frame", "Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note that client might have limit on frame size as well. 1MB as recommended by protobuf by default.").
Default("1048576").IntVar(&cfg.web.RemoteReadBytesInFrame)

a.Flag("storage.exemplars.exemplars-per-series-limit", "Maximum number of exemplars to store in in-memory exemplar storage per series.").
Default("10").IntVar(&cfg.ExemplarsLimit)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we assume about 20% of series have exemplars (and have filled all 10 of those exemplars in their storage) then this would result in about a 10% memory increase based on the math in the design doc

@stale stale bot added the stale label Jun 13, 2020
@stale stale bot removed the stale label Aug 12, 2020
@cstyan
Copy link
Member Author

cstyan commented Aug 13, 2020

Since I'll be picking up this work again soon I've rebased off master, fixed issues from that, and squashed commits down to one. The failing test is in TestEndpoints for remote storage querying for exemplars.

@bwplotka I'm struggling to understand how to use the new Queriers. The exemplar API requires a set of SeriesSet that exist for the query (the API accepts PromQL), and I think the panic I am getting is related to remote read now being streaming. Any suggestions?

--- FAIL: TestEndpoints/remote (0.15s)
        --- FAIL: TestEndpoints/remote/run_47_queryExemplars_"query=test_metric3%7Bcluster%3D%22abc%22%7D+-+test_metric4%7Bcluster%3D%22abc%22%7D" (0.00s)
            --- FAIL: TestEndpoints/remote/run_47_queryExemplars_"query=test_metric3%7Bcluster%3D%22abc%22%7D+-+test_metric4%7Bcluster%3D%22abc%22%7D"/GET (0.00s)
panic: secondaryQuerier: Select invoked after first Next of any returned SeriesSet was done [recovered]
	panic: secondaryQuerier: Select invoked after first Next of any returned SeriesSet was done

web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

The error you are getting is related to concurrent selects really. You cannot run multiple Selects if you touch any Next from the same querier's select already.

Just schedule all selects first, then iterate: this allows to run selects concurrently.

@cstyan
Copy link
Member Author

cstyan commented Aug 14, 2020

@bwplotka Hmm okay so this was because I was calling the exemplars querier Select nested within the queriers Select, since with remote read the Select call can be streaming now? Is that where the requirement

You cannot run multiple Selects if you touch any Next from the same querier's select already.

comes from?

@cstyan cstyan changed the title WIP: MVP exemplar storage separate from Appender interface. MVP in-memory exemplar storage + exemplars in WAL/remote write Sep 24, 2020
@cstyan cstyan marked this pull request as ready for review September 24, 2020 22:17
- `start=<rfc3339 | unix_timestamp>`: Start timestamp.
- `end=<rfc3339 | unix_timestamp>`: End timestamp.

Exemplar storage is not part of Prometheus' TSDB, so this endpoint doesn't allow filtering based on time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you probably want to remove the start and end params?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should just update this comment rather, I added the start and end a few days later after realizing exemplar queries are most likely going to be executed alongside a query_range query associated with a grafana dashboard. If there's additional use cases that come up for other query endpoint options/implementations we can discuss those but I've been building the current endpoints implementation with the assumption that the dashboard panel -> exemplars query is the most likely path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unaddressed then

@codesome codesome self-assigned this Sep 29, 2020
@guettli
Copy link

guettli commented Oct 1, 2020

@cstyan I looked at the design docs: https://docs.google.com/document/d/1ymZlc9yuTj8GvZyKz1r3KDRrhaOjZ1W1qZVW_5Gj7gA/edit

But I have no clue, sorry. I am just a user.

But I would really love this feature: Jump from a grafana graph to traces. This would be magic.

Thank you for working on this!

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only briefly glanced over the TSDB changes and I still have to take a detailed look at the WAL implementation, but we need some changes there.

  1. tsdb/wal.go is deprecated IIUC, you should be making the changes in tsdb/wal/wal.go. (I am not aware of how the old WAL repair is set up, so might require some changes there too if required, just to translate properly, everything else in the wal package).
  2. I don't feel comfortable with the storage (i.e. the circular queue) being outside TSDB while we are also writing into WAL. Should we have the exemplar storage in the TSDB too? (And maybe in the Head, you have to touch it for WAL anyway)
  3. Are we restoring back the exemplars from the WAL to the circular queue on startup? (I could not find, I might have missed)

tsdb/exemplar.go Outdated
lastTs := ce.exemplars[idx].exemplar.Ts

for {
if idx == -1 || string(ce.exemplars[idx].seriesLabels.Bytes(buf)) != string(l.Bytes(buf)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If idx could be -1, would that not be a panic above already at lastTs := ce.exemplars[idx].exemplar.Ts? (Update: I see the if it was ok, then it won't be -1 at start. Can you add a comment for that?)

Plus, are you sure using the same buf on both sides of != is not going to cause any problems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some comments to clarify the checks here, they're not nice but the implementation of the circular buffer can still be cleaned up.

I am not sure if using the same buf is not going to cause any problems. I think it should be (unless my understanding of how bytes.Buffer uses the underlying slice is wrong), since we create a new bytes.Buffer from buf in Labels(), and return Buffer.Bytes which is only valid until the next Write/Read/etc. call on that same bytes.Buffer. But we create a new one on the right hand side of !=. I can always change this to l.String() if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this clearer now?

tsdb/exemplar.go Outdated
lastTs = ce.exemplars[idx].exemplar.Ts
// Prepend since this exemplar came before the last one we appeneded chronologically.
if e.Ts >= hints.Start && e.Ts <= hints.End {
ret = append([]exemplar.Exemplar{e}, ret...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a micro-optimisation, but we could preallocate the slice here else there is some unnecessary allocation if there are too many exemplars.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any suggestions for a size to preallocate? there could theoretically be as many exemplars as the max size for the overall circular buffer if only one series has exemplars

tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated
return storage.ErrOutOfOrderExemplar
}
ce.exemplars[ce.nextIndex] = circularBufferEntry{exemplar: e, seriesLabels: l, prev: idx}
ce.index[l.String()] = ce.nextIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will have >1 exemplar per series in general, would string interning help here for memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I'll look at using the interner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head symbol table is that basically already, it'll have all the sample labels.

tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick glance over.

cmd/prometheus/main.go Outdated Show resolved Hide resolved
docs/querying/api.md Show resolved Hide resolved
},
"value": 6,
"timestamp": 1600096945479,
"hasTimestamp": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this field? If there's no timestamp, return no timestamp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I'm just returning a JSON encoded data from the exemplar.Exemplar struct. I don't actually understand why the field exists on that struct in the first place, not sure if it's part of the OM spec or if @shreyassrivatsan had another idea there? Maybe to indicate if the timestamp was exported by the source, instead of it being the scrape timestamp?

I can remove the field from the output for the API though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exemplar timestamp is optional in the exposition. It is a different thing from the scrape/sample timestamp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying about the exposition.

Using the scrape/sample timestamp when an exported timestamp was not found allows users to query for exemplars during the same time range as their existing PromQL query. Both timestamps could be stored as a separate fields though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you send on a duplicate for every scrape, or just once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you mentioned in the other thread regarding the duplicate error (or not error), it's likely the exported exemplar for a series might not change from one scrape to the next, so for each subsequent scrape we'd skip storing the exemplar again until we see a new one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would likely be a part of the same json marshal in the previous comment thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that'd all be handled together.

"labels": {
"traceID": "EpTxMJ40fUus7aGY"
},
"value": 6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This like all float values in our APISs this should be a string, +Inf is in theory valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at how other endpoints are marshalling data then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion for where to make this change? Looks like we just marshal the promql Result type:

// Result holds the resulting value of an execution or an error
// if any occurred.
type Result struct {
	Err      error
	Value    parser.Value
	Warnings storage.Warnings
}

where Value is:

// Value is a generic interface for values resulting from a query evaluation.
type Value interface {
	Type() ValueType
	String() string
}

and types are:

// The valid value types.
const (
	ValueTypeNone   ValueType = "none"
	ValueTypeVector ValueType = "vector"
	ValueTypeScalar ValueType = "scalar"
	ValueTypeMatrix ValueType = "matrix"
	ValueTypeString ValueType = "string"
)

The exemplar API endpoint can always return a matrix of exemplars.

Would you suggest something like a new ExemplarMatrix type such as:

// Exemplar represents a single data point for a given timestamp.
type Exemplar struct {
        Labels labels.Labels `json:"labels"`
	T int64
	V float64
}

// ExemplarSeries is a stream of exemplars belonging to a metric series.
type ExemplarSeries struct {
	Metric labels.Labels `json:"metric"`
	Exemplars []Exemplars       `json:"exemplars"`
}

// ExemplarMatrix is a slice of ExemplarsSeries that implements sort.Interface and
// has a String method.
type ExemplarMatrix []ExemplarSeries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func marshalPointJSON(ptr unsafe.Pointer, stream *jsoniter.Stream) {
is the main thing to work off, however that's partially a workaround for our structs not matching the JSON output.
To do this right this time, I'd suggest actually making it a string and not tie our internal representations to the external representations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you're suggesting modify/creating a new struct type, or registering something like jsoniter.RegisterTypeEncoderFunc("exemplar.Exemplar", marshalExemplarJSON, marshalExemplarJSONIsEmpty) that uses something like Sprintf("%f", exemplar.Value).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting modifying the type, RegisterTypeEncoderFunc is a hack for performance reasons and should be avoided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc needs to be updated

prompb/types.proto Outdated Show resolved Hide resolved
scrape/scrape.go Outdated
}
// todo: This smells funny.
if hasExemplar := p.Exemplar(&e); hasExemplar {
if err := exemplarApp.AddExemplar(ce.lset, t, e); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have sanity checks on the exemplar timestamp, like we do for scraping generally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will look at how we're doing that. My assumption was that the majority of the time an exemplar would have the same timestamp as a scraped sample so I didn't put time into doing validation of the timestamp here yet.

Copy link
Contributor

@brian-brazil brian-brazil Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be very rare that it'd have the same timestamp, usually it'll be in the past (if there is a timestamp) and it may even appear a bit in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be marked as resolved now with the checks in exemplar.go addExemplar and the ErrOutOfOrderExemplar. Let me know if you disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not right. The check should be that it's not more than 10m in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where we enforce that same not more than 10m in the future for regular samples. I just see minValidTime in head.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const maxAheadTime = 10 * time.Minute

web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
tsdb/wal.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
@cstyan
Copy link
Member Author

cstyan commented Oct 7, 2020

thanks for the reviews so far @brian-brazil @codesome I hope to get to the rest of your comments and some of my own todo list this week still

@roidelapluie
Copy link
Member

I did try a variation of this PR and I was surprised that for a query 10e3* histogram_quantile(foo) the exemplar value was not modified by 10e3. I don't know how easy that would be however, but when switching to "explore" with the TNS default dashboard it was confusing to me.

@brian-brazil
Copy link
Contributor

I don't know how easy that would be however,

I doubt it's practical, given how complex PromQL can be.

@cstyan cstyan changed the title MVP in-memory exemplar storage + exemplars in WAL/remote write Add circular in-memory exemplars storage Dec 16, 2020
"labels": {
"traceID": "EpTxMJ40fUus7aGY"
},
"value": 6,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you're suggesting modify/creating a new struct type, or registering something like jsoniter.RegisterTypeEncoderFunc("exemplar.Exemplar", marshalExemplarJSON, marshalExemplarJSONIsEmpty) that uses something like Sprintf("%f", exemplar.Value).

},
"value": 6,
"timestamp": 1600096945479,
"hasTimestamp": true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would likely be a part of the same json marshal in the previous comment thread?

scrape/scrape.go Outdated
}
// todo: This smells funny.
if hasExemplar := p.Exemplar(&e); hasExemplar {
if err := exemplarApp.AddExemplar(ce.lset, t, e); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be marked as resolved now with the checks in exemplar.go addExemplar and the ErrOutOfOrderExemplar. Let me know if you disagree.

tsdb/exemplar.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Show resolved Hide resolved
docs/querying/api.md Outdated Show resolved Hide resolved
"labels": {
"traceID": "EpTxMJ40fUus7aGY"
},
"value": 6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting modifying the type, RegisterTypeEncoderFunc is a hack for performance reasons and should be avoided.

},
"value": 6,
"timestamp": 1600096945479,
"hasTimestamp": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that'd all be handled together.

promql/engine.go Outdated Show resolved Hide resolved
promql/parser/ast.go Outdated Show resolved Hide resolved
scrape/scrape.go Outdated Show resolved Hide resolved
scrape/scrape.go Outdated
}
// todo: This smells funny.
if hasExemplar := p.Exemplar(&e); hasExemplar {
if err := exemplarApp.AddExemplar(ce.lset, t, e); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not right. The check should be that it's not more than 10m in the future.

scrape/scrape.go Outdated Show resolved Hide resolved
scrape/scrape.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated

lastTs = ce.exemplars[idx].se.scrapeTimestamp
// Prepend since this exemplar came before the last one we appeneded chronologically.
if e.Ts >= start && e.Ts <= end {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the exemplar doesn't have a timestamp?

I think you still haven't fully worked out the right semantics here for exemplars and timestamps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exemplars having their own optional timestamp separate from when they were scraped makes this a bit weird. IMO checking only the scrape timestamp is the simplest way forward given the current use case is range queries that will most likely be used alongside a metrics range query, and whose results will be matched up with timestamped samples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd return false positives though when the timestamp of the examplar is before the relevant range, which is not desirable.

It's expected that the majority of exemplars will have explicit timestamps.

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not take a deep look yet, the overall structure looks fine

tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated
Comment on lines 185 to 193
ce.nextIndex++
if ce.nextIndex >= cap(ce.exemplars) {
ce.nextIndex = 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ce.nextIndex = (ce.nextIndex+1) % len(ce.exemplars)?

tsdb/exemplar.go Outdated Show resolved Hide resolved
@codesome
Copy link
Member

@tomwilkie found some issues running this with the TNS demo, and given the last moment rush, let's not force it into 2.25 :) lots of testing still to do

@roidelapluie ^

@codesome codesome changed the base branch from release-2.25 to master February 12, 2021 13:49
@codesome codesome mentioned this pull request Feb 12, 2021
scrape/helpers_test.go Outdated Show resolved Hide resolved
storage/interface.go Outdated Show resolved Hide resolved
storage/interface.go Outdated Show resolved Hide resolved
@tomwilkie
Copy link
Member

@tomwilkie found some issues running this with the TNS demo, and given the last moment rush, let's not force it into 2.25 :) lots of testing still to do

I'm happy that we've not been able to reproduce these now, must have been my machine. A few nits and then we should be good to go.

Base automatically changed from master to main February 23, 2021 19:36
scrape/scrape.go Outdated Show resolved Hide resolved
storage/interface.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved