Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add circular in-memory exemplars storage #6635

Merged
merged 3 commits into from Mar 16, 2021

Conversation

cstyan
Copy link
Member

@cstyan cstyan commented Jan 15, 2020

This PR implements a basic in-memory storage of exemplars via a large circular buffer. This means there's an overall limit (configurable by users) on the number of exemplars stored total, but no limit (other than that overall limit) on the number of exemplars stored per series.
The public design doc is here.

Signed-off-by: Callum Styan callumstyan@gmail.com

tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/querier.go Outdated Show resolved Hide resolved
tsdb/querier.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
@shreyassrivatsan
Copy link
Contributor

shreyassrivatsan commented Feb 21, 2020

The one concern I have with a separate ExemplarAppender interface is that it expects that exemplars are stored outside of the actual time series. Given that exemplars are emitted alongside a value in a time series, I think its beneficial to make the decision of how to persist them a concern of the storage.

By augmenting the appender interface with something like this #6844 (limitedly intrusive) it allows the storage to decide how it would like to deal with exemplars - by implementing the Appender accordingly. A storage can ignore the exemplar, store both exemplar and value or just store the exemplars like in the in-memory storage you have in this PR..

@@ -224,6 +225,9 @@ func main() {
a.Flag("storage.remote.read-max-bytes-in-frame", "Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note that client might have limit on frame size as well. 1MB as recommended by protobuf by default.").
Default("1048576").IntVar(&cfg.web.RemoteReadBytesInFrame)

a.Flag("storage.exemplars.exemplars-per-series-limit", "Maximum number of exemplars to store in in-memory exemplar storage per series.").
Default("10").IntVar(&cfg.ExemplarsLimit)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we assume about 20% of series have exemplars (and have filled all 10 of those exemplars in their storage) then this would result in about a 10% memory increase based on the math in the design doc

@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from 955d655 to 0fceff3 Compare April 14, 2020 00:35
@stale stale bot added the stale label Jun 13, 2020
@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from 0fceff3 to fd3236a Compare August 12, 2020 22:45
@stale stale bot removed the stale label Aug 12, 2020
@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from fd3236a to 5f012d7 Compare August 13, 2020 00:12
@cstyan
Copy link
Member Author

cstyan commented Aug 13, 2020

Since I'll be picking up this work again soon I've rebased off master, fixed issues from that, and squashed commits down to one. The failing test is in TestEndpoints for remote storage querying for exemplars.

@bwplotka I'm struggling to understand how to use the new Queriers. The exemplar API requires a set of SeriesSet that exist for the query (the API accepts PromQL), and I think the panic I am getting is related to remote read now being streaming. Any suggestions?

--- FAIL: TestEndpoints/remote (0.15s)
        --- FAIL: TestEndpoints/remote/run_47_queryExemplars_"query=test_metric3%7Bcluster%3D%22abc%22%7D+-+test_metric4%7Bcluster%3D%22abc%22%7D" (0.00s)
            --- FAIL: TestEndpoints/remote/run_47_queryExemplars_"query=test_metric3%7Bcluster%3D%22abc%22%7D+-+test_metric4%7Bcluster%3D%22abc%22%7D"/GET (0.00s)
panic: secondaryQuerier: Select invoked after first Next of any returned SeriesSet was done [recovered]
	panic: secondaryQuerier: Select invoked after first Next of any returned SeriesSet was done

web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

The error you are getting is related to concurrent selects really. You cannot run multiple Selects if you touch any Next from the same querier's select already.

Just schedule all selects first, then iterate: this allows to run selects concurrently.

@cstyan
Copy link
Member Author

cstyan commented Aug 14, 2020

@bwplotka Hmm okay so this was because I was calling the exemplars querier Select nested within the queriers Select, since with remote read the Select call can be streaming now? Is that where the requirement

You cannot run multiple Selects if you touch any Next from the same querier's select already.

comes from?

@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from 7a37935 to d40a0f6 Compare September 9, 2020 17:25
@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from 81009e4 to 6ed2466 Compare September 18, 2020 00:41
@cstyan cstyan changed the title WIP: MVP exemplar storage separate from Appender interface. MVP in-memory exemplar storage + exemplars in WAL/remote write Sep 24, 2020
@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from 6ed2466 to ba76dfa Compare September 24, 2020 21:59
@cstyan cstyan marked this pull request as ready for review September 24, 2020 22:17
- `start=<rfc3339 | unix_timestamp>`: Start timestamp.
- `end=<rfc3339 | unix_timestamp>`: End timestamp.

Exemplar storage is not part of Prometheus' TSDB, so this endpoint doesn't allow filtering based on time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you probably want to remove the start and end params?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should just update this comment rather, I added the start and end a few days later after realizing exemplar queries are most likely going to be executed alongside a query_range query associated with a grafana dashboard. If there's additional use cases that come up for other query endpoint options/implementations we can discuss those but I've been building the current endpoints implementation with the assumption that the dashboard panel -> exemplars query is the most likely path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unaddressed then

@codesome codesome self-assigned this Sep 29, 2020
@guettli
Copy link

guettli commented Oct 1, 2020

@cstyan I looked at the design docs: https://docs.google.com/document/d/1ymZlc9yuTj8GvZyKz1r3KDRrhaOjZ1W1qZVW_5Gj7gA/edit

But I have no clue, sorry. I am just a user.

But I would really love this feature: Jump from a grafana graph to traces. This would be magic.

Thank you for working on this!

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only briefly glanced over the TSDB changes and I still have to take a detailed look at the WAL implementation, but we need some changes there.

  1. tsdb/wal.go is deprecated IIUC, you should be making the changes in tsdb/wal/wal.go. (I am not aware of how the old WAL repair is set up, so might require some changes there too if required, just to translate properly, everything else in the wal package).
  2. I don't feel comfortable with the storage (i.e. the circular queue) being outside TSDB while we are also writing into WAL. Should we have the exemplar storage in the TSDB too? (And maybe in the Head, you have to touch it for WAL anyway)
  3. Are we restoring back the exemplars from the WAL to the circular queue on startup? (I could not find, I might have missed)

tsdb/exemplar.go Outdated
lastTs := ce.exemplars[idx].exemplar.Ts

for {
if idx == -1 || string(ce.exemplars[idx].seriesLabels.Bytes(buf)) != string(l.Bytes(buf)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If idx could be -1, would that not be a panic above already at lastTs := ce.exemplars[idx].exemplar.Ts? (Update: I see the if it was ok, then it won't be -1 at start. Can you add a comment for that?)

Plus, are you sure using the same buf on both sides of != is not going to cause any problems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some comments to clarify the checks here, they're not nice but the implementation of the circular buffer can still be cleaned up.

I am not sure if using the same buf is not going to cause any problems. I think it should be (unless my understanding of how bytes.Buffer uses the underlying slice is wrong), since we create a new bytes.Buffer from buf in Labels(), and return Buffer.Bytes which is only valid until the next Write/Read/etc. call on that same bytes.Buffer. But we create a new one on the right hand side of !=. I can always change this to l.String() if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this clearer now?

tsdb/exemplar.go Outdated
lastTs = ce.exemplars[idx].exemplar.Ts
// Prepend since this exemplar came before the last one we appeneded chronologically.
if e.Ts >= hints.Start && e.Ts <= hints.End {
ret = append([]exemplar.Exemplar{e}, ret...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a micro-optimisation, but we could preallocate the slice here else there is some unnecessary allocation if there are too many exemplars.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any suggestions for a size to preallocate? there could theoretically be as many exemplars as the max size for the overall circular buffer if only one series has exemplars

tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated
return storage.ErrOutOfOrderExemplar
}
ce.exemplars[ce.nextIndex] = circularBufferEntry{exemplar: e, seriesLabels: l, prev: idx}
ce.index[l.String()] = ce.nextIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will have >1 exemplar per series in general, would string interning help here for memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I'll look at using the interner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head symbol table is that basically already, it'll have all the sample labels.

tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
tsdb/exemplar.go Outdated Show resolved Hide resolved
Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick glance over.

cmd/prometheus/main.go Outdated Show resolved Hide resolved
docs/querying/api.md Show resolved Hide resolved
},
"value": 6,
"timestamp": 1600096945479,
"hasTimestamp": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this field? If there's no timestamp, return no timestamp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I'm just returning a JSON encoded data from the exemplar.Exemplar struct. I don't actually understand why the field exists on that struct in the first place, not sure if it's part of the OM spec or if @shreyassrivatsan had another idea there? Maybe to indicate if the timestamp was exported by the source, instead of it being the scrape timestamp?

I can remove the field from the output for the API though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exemplar timestamp is optional in the exposition. It is a different thing from the scrape/sample timestamp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying about the exposition.

Using the scrape/sample timestamp when an exported timestamp was not found allows users to query for exemplars during the same time range as their existing PromQL query. Both timestamps could be stored as a separate fields though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you send on a duplicate for every scrape, or just once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you mentioned in the other thread regarding the duplicate error (or not error), it's likely the exported exemplar for a series might not change from one scrape to the next, so for each subsequent scrape we'd skip storing the exemplar again until we see a new one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would likely be a part of the same json marshal in the previous comment thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that'd all be handled together.

"labels": {
"traceID": "EpTxMJ40fUus7aGY"
},
"value": 6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This like all float values in our APISs this should be a string, +Inf is in theory valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at how other endpoints are marshalling data then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion for where to make this change? Looks like we just marshal the promql Result type:

// Result holds the resulting value of an execution or an error
// if any occurred.
type Result struct {
	Err      error
	Value    parser.Value
	Warnings storage.Warnings
}

where Value is:

// Value is a generic interface for values resulting from a query evaluation.
type Value interface {
	Type() ValueType
	String() string
}

and types are:

// The valid value types.
const (
	ValueTypeNone   ValueType = "none"
	ValueTypeVector ValueType = "vector"
	ValueTypeScalar ValueType = "scalar"
	ValueTypeMatrix ValueType = "matrix"
	ValueTypeString ValueType = "string"
)

The exemplar API endpoint can always return a matrix of exemplars.

Would you suggest something like a new ExemplarMatrix type such as:

// Exemplar represents a single data point for a given timestamp.
type Exemplar struct {
        Labels labels.Labels `json:"labels"`
	T int64
	V float64
}

// ExemplarSeries is a stream of exemplars belonging to a metric series.
type ExemplarSeries struct {
	Metric labels.Labels `json:"metric"`
	Exemplars []Exemplars       `json:"exemplars"`
}

// ExemplarMatrix is a slice of ExemplarsSeries that implements sort.Interface and
// has a String method.
type ExemplarMatrix []ExemplarSeries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func marshalPointJSON(ptr unsafe.Pointer, stream *jsoniter.Stream) {
is the main thing to work off, however that's partially a workaround for our structs not matching the JSON output.
To do this right this time, I'd suggest actually making it a string and not tie our internal representations to the external representations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you're suggesting modify/creating a new struct type, or registering something like jsoniter.RegisterTypeEncoderFunc("exemplar.Exemplar", marshalExemplarJSON, marshalExemplarJSONIsEmpty) that uses something like Sprintf("%f", exemplar.Value).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting modifying the type, RegisterTypeEncoderFunc is a hack for performance reasons and should be avoided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc needs to be updated

prompb/types.proto Outdated Show resolved Hide resolved
scrape/scrape.go Outdated
}
// todo: This smells funny.
if hasExemplar := p.Exemplar(&e); hasExemplar {
if err := exemplarApp.AddExemplar(ce.lset, t, e); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have sanity checks on the exemplar timestamp, like we do for scraping generally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will look at how we're doing that. My assumption was that the majority of the time an exemplar would have the same timestamp as a scraped sample so I didn't put time into doing validation of the timestamp here yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be very rare that it'd have the same timestamp, usually it'll be in the past (if there is a timestamp) and it may even appear a bit in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be marked as resolved now with the checks in exemplar.go addExemplar and the ErrOutOfOrderExemplar. Let me know if you disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not right. The check should be that it's not more than 10m in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where we enforce that same not more than 10m in the future for regular samples. I just see minValidTime in head.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const maxAheadTime = 10 * time.Minute

web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
tsdb/wal.go Outdated Show resolved Hide resolved
web/api/v1/api.go Outdated Show resolved Hide resolved
@cstyan
Copy link
Member Author

cstyan commented Oct 7, 2020

thanks for the reviews so far @brian-brazil @codesome I hope to get to the rest of your comments and some of my own todo list this week still

storage/interface.go Outdated Show resolved Hide resolved
@tomwilkie
Copy link
Member

@tomwilkie found some issues running this with the TNS demo, and given the last moment rush, let's not force it into 2.25 :) lots of testing still to do

I'm happy that we've not been able to reproduce these now, must have been my machine. A few nits and then we should be good to go.

Base automatically changed from master to main February 23, 2021 19:36
@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from e715f3e to 98cafce Compare February 24, 2021 08:54
scrape/scrape.go Outdated Show resolved Hide resolved
storage/interface.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/wal/checkpoint.go Outdated Show resolved Hide resolved
@tomwilkie
Copy link
Member

Super close now. Dug into the implementation of AppendExemplar and added a few comments.

tsdb/head.go Outdated Show resolved Hide resolved
// labels: <labels>,
// value: "<string>",
// timestamp: <float>
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask why hasTs field is not included here? It would be useful for Thanos's proxy implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanos proxy gets exemplar from Prometheus by calling /query_exemplars API from the sidecar. If hasTs field is not included in the API, then is there a way to tell whether this exemplar has a timestamp or not?

IIUC if the exemplar doesn't have a ts, the timestamp is 0? @cstyan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasTs is an internal implementation detail. From the apy you will always get a timestamp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as Julien said HasTs is an internal implementation detail. An exemplar can optionally have a timestamp when it is exported, the same as a regular sample. If the exemplar didn't have it's own timestamp set, HasTs will be false and internal storage will assign the scrape timestamp as the exemplars timestamp. Any query to Prometheus for exemplars will always return exemplars with timestamps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it is clear now.

@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from c89e943 to 4e3f37f Compare March 8, 2021 17:47
@cstyan
Copy link
Member Author

cstyan commented Mar 8, 2021

Thanks to @mdisibio @codesome and @tomwilkie for keeping this going while I was out last week.

I've pushed two commits to simplify AppendExemplar. As others have pointed out, we'd already removed some series labels checks from AppendExemplar since it is always called after a call to Append, I've now removed the rest but also added a call to remove any empty labels from the exemplars labels.

@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch 3 times, most recently from fe89ab5 to 525874d Compare March 10, 2021 17:46
cstyan and others added 3 commits March 12, 2021 14:53
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Martin Disibio <mdisibio@gmail.com>

Co-authored-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com>
Co-authored-by: Martin Disibio <mdisibio@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan force-pushed the exemplar-storage-separate-mvp branch from bb001fa to 4f9b42f Compare March 12, 2021 22:53
@cstyan
Copy link
Member Author

cstyan commented Mar 15, 2021

@tomwilkie @codesome @roidelapluie I feel like we're good to merge here. @mdisibio and I cleaned things up based on Tom's last review, and I've made a small fix to the way the query results return array is instantiated so that an empty query result will return [] rather than null as Julien noted on the second PR.

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work!

@codesome codesome merged commit 289ba11 into prometheus:main Mar 16, 2021
10 checks passed
@roidelapluie
Copy link
Member

💯 thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants