Support scrolling for elasticsearch #696

sramakr · 2018-02-13T03:23:12Z

Resolves #668

This commit allows to support search_after when fetching traces from elasticsearch.
Steps are,

Do MultiRead, by sorting startTime and start with startTime of 0
If the TotalHits > 10K ,
Find the last span in the results
Use the StartTime from the last span
Repeat steps from 1

Signed-off-by: Sriram Ramakrishnan sramakr@gmail.com

yurishkuro · 2018-02-13T04:28:04Z

what problem is this solving?

sramakr · 2018-02-13T13:32:40Z

@yurishkuro Solving #668

mabn · 2018-02-13T14:54:40Z

Few remarks:

I've had performance problems with scrolling in ES 5.x in the past and search-after was performing much better. Use of scroll API affects indexing - lucene segments are kept longer during scrolling.
It adds additional count query which could be avoided by using "search-after"
It reads all the spans for given trace and it's a problem when there are 10mln spans in a trace, it's better to get first 10k
Right now search fetches whole traces (not great), but won't it fetch them without limits? So, after this change, it might happen that search results try to fetch json with millions of spans, even though I'm interested in a small trace that was also returned in the response? It would be an issue.

IMO it would be great if it was possible to do some kind of paging on the trace view page. E.g. see 10k spans ordered by time and be able to go to the next/previous 10k, or something like that. Fetching everything will create new problems.

sramakr · 2018-02-13T17:23:31Z

@mabn Thanks for the comments. I should've raised this concern before , you are totally correct. Instead of 10K limit , will it help to increase the limit to something manageable and possibly use search_after? I am not aware about the UI, and how to do paging from UI. Gotta check more to see how everything will fit together.

yurishkuro · 2018-02-14T00:46:27Z

Building a sensible DAG without knowing all the spans seems pretty hard. Trace with 1mm spans is an interesting theoretical use case but I doubt any existing tracing system would be able to deal with those today. On the other hand, traces with >10k spans are fairly common, so it would be great to solve this for ES (not a problem in Cassandra, btw).

As for super large traces, I still think the solution is to try to load a complete trace in memory, but employ some form of compression. A lot of span data is repeated, like operation names, tags, etc. Also, in order to construct a DAG most of this data is not needed and the spans can be compressed to just the id/parent + timing + name. Anyway, this is a problem for another day (after I retire and won't care).

If we ignore the possibility of not loading complete trace, is the search_after a reasonable solution? From the help page it seems like it will retrieve just as much data as needed if used to retrieve spans for a given trace ID.

mabn · 2018-02-14T10:04:02Z

I'm only concerned about the search page - if such 1mm trace is matched by search criteria (and if it's still happening then most likely some of its spans will match) then I'd like to show something rather than explode on memory or wait 30s for the json to download. And it happens in practice :( due to bugs and poor design.

sramakr · 2018-02-14T13:08:05Z

@mabn if spans > 10K works in cassandra, I mean , won't it be a problem in cassandra too? I think really it becomes practical limitation on number of spans we can fetch on the UI.

mabn · 2018-02-14T13:40:09Z

I didn't run jaeger with cassandra in a while, but it might be a "problem" there as well.
I've opened #178 but at some point I noticed that traces and limited to 10k spans and that kind of solved the issue.
But probably it happened because I started using ES.

So - it might make sense to fetch everything from ES and open a separate issue to limit it on the search results page?

sramakr · 2018-02-14T15:22:11Z

Yes, I can modify this PR to use search_after and we can have a separate issue to support pagination in the UI.

coveralls · 2018-02-21T14:26:10Z

Coverage remained the same at 100.0% when pulling bd345cb on sramakr:support-scroll into 135bed0 on jaegertracing:master.

Made searchrequests in the reader.go simpler Fixed unit tests Signed-off-by: sramakr <sramakr@gmail.com>

pavolloffay · 2018-03-12T16:28:15Z

@kevinearls, how hard would it be to run the perf tests against this PR?

I think we don't use jaeger-query in perf tests. Tests read the ES directly.

kevinearls · 2018-03-12T17:11:54Z

@pavolloffay we don't use jaeger-query now but I've been working on this. I hope to have some information tomorrow.

kevinearls · 2018-03-13T13:19:01Z

@pavolloffay @jpkrohling I wrote a test where I created 100,000 traces, and then retrieved them multiple times in chunks of 20,000. When run (just on my laptop with nothing else running) against the Jaeger master, the average time to retrieve 20,000 traces was 5727 milliseconds.

When run using this PR the average dropped to 3909 milliseconds.

This looks to be a decent performance improvement. I can run something additional if you think its necessary.

jpkrohling · 2018-03-15T07:58:09Z

Would it be possible to run the test a couple of times more, just to make sure the second run didn't perform better because of better temporal conditions? Ideally, this would run on a perf server, to avoid desktop-specific issues, but if you have consistent results after a few runs, I'm happy with it :)

kevinearls · 2018-03-15T08:35:38Z

@jpkrohling I did run multiple times and got fairly consistent results. Let me see if I can setup something to run on the CNCF CI Jenkins in the next couple of days.

ledor473 · 2018-04-05T17:30:19Z

Any update on this issue?

sramakr · 2018-04-05T18:57:52Z

@ledor473 Are you referring to the issue of "Span graph is not showing anymore" ? I didn't get a chance to look into it. Is that an UI issue ?

ledor473 · 2018-04-05T19:08:32Z

I meant the overall state of that PR. Seems like @kevinearls conducted the load testing needed.

sramakr · 2018-04-05T19:14:33Z

ok cool @ledor473 I am still waiting for 👍 from the reviewers.

jpkrohling

I'm giving my +1, as the code looks sane to me, but I guess we'll know about the perf improvements only once this change is delivered.

black-adder · 2018-04-06T21:08:21Z

plugin/storage/es/spanstore/reader.go

-	if err != nil {
-		return nil, err
-	}
+	var nextTime uint64


can this be initialized to startTime?

I am not sure , because nextTime is derived from the results of the query in line 232.

Since nextTime is derived from the results of the previous 10K , don't think we can initialize.

doesn't this mean that the first ever query always queries from 0 (searchAfterTime isn't initialized) even though we're only interested things after startTime? (I guess since we're looking at a subset of indices, this isn't too bad performance wise)

@black-adder Thought about this a bit... I agree , we should initialize next-time to start-time, don't see a problem there. I am going to modify this and update the PR.

black-adder · 2018-04-06T21:25:10Z

plugin/storage/es/spanstore/reader.go

+		}
+		// set traceIDs to empty
+		traceIDs = nil
+		results, err := s.client.MultiSearch().Add(searchRequests...).Index(indices...).Do(s.ctx)


Is the 10000 limit for the entire query? ie if we search for 10 traces and the first trace takes up all 10000 spans, will this return a response for only the first trace?

no, it's per trace. obviously, there is the downside of if there are 2 traces and 1st trace gets only 100 spans and next like 50K spans, entire query is going to wait to fetch 50,100 spans before it returns.

Initialize nextTime with startTime , so query will start from startTime instead of 0. Signed-off-by: sramak001 <sriram_ramakrishnan@cable.comcast.com>

black-adder · 2018-04-12T18:07:13Z

Thanks!

sramakr requested review from black-adder, pavolloffay, vprithvi and yurishkuro as code owners February 13, 2018 03:23

sramakr force-pushed the support-scroll branch 2 times, most recently from 2f570b0 to 0102035 Compare February 13, 2018 03:26

sramakr force-pushed the support-scroll branch 10 times, most recently from ceca6a6 to 037fd07 Compare February 21, 2018 14:14

sramakr force-pushed the support-scroll branch 4 times, most recently from 83aef4d to dcc0563 Compare February 22, 2018 19:02

sramakr force-pushed the support-scroll branch from df0a614 to 3f8d0ed Compare March 9, 2018 17:40

Rebased with master

5e75596

Made searchrequests in the reader.go simpler Fixed unit tests Signed-off-by: sramakr <sramakr@gmail.com>

sramakr force-pushed the support-scroll branch from 3f8d0ed to 5e75596 Compare March 9, 2018 18:17

jpkrohling approved these changes Apr 6, 2018

View reviewed changes

Merge branch 'master' into support-scroll

94513ad

ghost assigned pavolloffay Apr 6, 2018

ghost added the review label Apr 6, 2018

black-adder reviewed Apr 6, 2018

View reviewed changes

sramakr force-pushed the support-scroll branch 2 times, most recently from 240cdfa to 58c9ba3 Compare April 11, 2018 20:21

Initialize nextTime to startTime

4a641af

Initialize nextTime with startTime , so query will start from startTime instead of 0. Signed-off-by: sramak001 <sriram_ramakrishnan@cable.comcast.com>

sramakr force-pushed the support-scroll branch from 58c9ba3 to 4a641af Compare April 11, 2018 20:31

black-adder approved these changes Apr 12, 2018

View reviewed changes

Merge branch 'master' into support-scroll

bd345cb

ghost assigned black-adder Apr 12, 2018

black-adder merged commit 59cc364 into jaegertracing:master Apr 12, 2018

ghost removed the review label Apr 12, 2018

pavolloffay mentioned this pull request Jul 26, 2018

Search on large number of traces can overload Elasticserch #960

Open

pavolloffay mentioned this pull request Aug 19, 2019

Elasticsearch query with large number of spans can miss spans or produce duplicates #1739

Closed

pavolloffay mentioned this pull request Mar 25, 2020

Elasticsearch alias config requires max-span-age to be configured #2138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support scrolling for elasticsearch #696

Support scrolling for elasticsearch #696

sramakr commented Feb 13, 2018 •

edited by yurishkuro

Loading

yurishkuro commented Feb 13, 2018

sramakr commented Feb 13, 2018

mabn commented Feb 13, 2018 •

edited

Loading

sramakr commented Feb 13, 2018

yurishkuro commented Feb 14, 2018

mabn commented Feb 14, 2018 •

edited

Loading

sramakr commented Feb 14, 2018

mabn commented Feb 14, 2018 •

edited

Loading

sramakr commented Feb 14, 2018

coveralls commented Feb 21, 2018 •

edited

Loading

pavolloffay commented Mar 12, 2018

kevinearls commented Mar 12, 2018

kevinearls commented Mar 13, 2018

jpkrohling commented Mar 15, 2018

kevinearls commented Mar 15, 2018

ledor473 commented Apr 5, 2018

sramakr commented Apr 5, 2018

ledor473 commented Apr 5, 2018

sramakr commented Apr 5, 2018

jpkrohling left a comment

black-adder Apr 6, 2018

sramakr Apr 10, 2018

sramakr Apr 10, 2018

black-adder Apr 10, 2018

sramakr Apr 11, 2018

black-adder Apr 6, 2018

sramakr Apr 10, 2018

black-adder commented Apr 12, 2018

Support scrolling for elasticsearch #696

Support scrolling for elasticsearch #696

Conversation

sramakr commented Feb 13, 2018 • edited by yurishkuro Loading

yurishkuro commented Feb 13, 2018

sramakr commented Feb 13, 2018

mabn commented Feb 13, 2018 • edited Loading

sramakr commented Feb 13, 2018

yurishkuro commented Feb 14, 2018

mabn commented Feb 14, 2018 • edited Loading

sramakr commented Feb 14, 2018

mabn commented Feb 14, 2018 • edited Loading

sramakr commented Feb 14, 2018

coveralls commented Feb 21, 2018 • edited Loading

pavolloffay commented Mar 12, 2018

kevinearls commented Mar 12, 2018

kevinearls commented Mar 13, 2018

jpkrohling commented Mar 15, 2018

kevinearls commented Mar 15, 2018

ledor473 commented Apr 5, 2018

sramakr commented Apr 5, 2018

ledor473 commented Apr 5, 2018

sramakr commented Apr 5, 2018

jpkrohling left a comment

Choose a reason for hiding this comment

black-adder Apr 6, 2018

Choose a reason for hiding this comment

sramakr Apr 10, 2018

Choose a reason for hiding this comment

sramakr Apr 10, 2018

Choose a reason for hiding this comment

black-adder Apr 10, 2018

Choose a reason for hiding this comment

sramakr Apr 11, 2018

Choose a reason for hiding this comment

black-adder Apr 6, 2018

Choose a reason for hiding this comment

sramakr Apr 10, 2018

Choose a reason for hiding this comment

black-adder commented Apr 12, 2018

sramakr commented Feb 13, 2018 •

edited by yurishkuro

Loading

mabn commented Feb 13, 2018 •

edited

Loading

mabn commented Feb 14, 2018 •

edited

Loading

mabn commented Feb 14, 2018 •

edited

Loading

coveralls commented Feb 21, 2018 •

edited

Loading