Close traps when testing concurrent search #275

mikemccand · 2024-06-20T15:42:38Z

Spinoff from apache/lucene#13472 where we thought we saw massive gains from an improvement to how Lucene implements intra-query concurrency, but, at least some of the gains were due to luceneutil trappiness. Some ideas:

Rename Competitor's numThreads to numConcurrentQueries (or maybe numQueriesInFlight?) to make it clear this controls how many concurrent queries are sent to IndexSearcher
If searchConcurrency > 1 maybe we should not allow numThreads>1? This is so trappy because the reported QPS gain from running concurrently is trappily divided by numThreads (see details from this comment)
Maybe we could add a true red-line (capacity) test to luceneutil? This would drive enough threads (concurrent queries or worker thread pool) to saturate CPU and/or IO. This is a tricky change tho ... we should open a separate spinoff. E.g. the precise set of tasks matters, their mix and order matters, reproducible arrival (Poisson-like) times matter, etc.
Let's fix the default thread counts so that luceneutil runs enough concurrent things (queries or work units) such that the tests runs as efficiently as possible, but, does not bump up against false contention on the box (i.e. does NOT saturate CPU or IO) that mess up the results
Maybe luceneutil should not even report QPS but rather latency? It's confusing because it is an "effective QPS" (1/avg-of-median-query-wall-clock-time-after-discarding-warmup-and-outliers) not an actual "red-line capacity".

Also, strangely, if searchConcurrency > 1, we oddly group the tasks so that all tasks within a single cat are done at the same time (across all threads). This is weird e.g. the Browse* tasks are single threaded even if you ask for concurrency and so top will show only 100% CPU used if you are running with six threads ...

The text was updated successfully, but these errors were encountered:

mikemccand · 2024-06-20T20:11:10Z

It's the last boolean argument to LocalTaskSource ctor which groups all tasks by category together (running them all sequentially within each thread) when concurrency is enabled: 87a8063

msokolov · 2024-06-20T20:29:53Z

Grouping was added here: #56 because without it the QPS measurement per task was meaningless

mikemccand · 2024-06-21T10:56:14Z

Grouping was added here: #56 because without it the QPS measurement per task was meaningless

OH! Thanks for digging @msokolov ... hmm I think this is why we have the staticSeed: it would force same subset of tasks to be selected for each run (if I'm understanding the original issue correctly).

QPS is such a tricky metric ...

msokolov · 2024-06-21T12:18:09Z

OH! Thanks for digging @msokolov ... hmm I think this is why we have the staticSeed: it would force same subset of tasks to be selected for each run (if I'm understanding the original issue correctly).

It's been a while but I think the issue was that we want to measure wall clock time per task This works OK when we run single threaded since the times in the log represent times spent working on that single task. However when we run multiple tasks concurrently the wall clock times represent work done on all the tasks we are running, which can be of mixed types. So this grouping change ensures that we run only tasks of the same type at the same time, enabling us to attribute the wall clock times in the logs to that task unambiguously.

msokolov · 2024-06-21T15:41:43Z

I guess we could switch from measuring wall clock time to measuring CPU time using Java's JMX API. If we did that we wouldn't need to group tasks this way, and would possibly get a more realistic task mix in order to simulate cross-task effects in a mixed query load.

mikemccand · 2024-06-21T16:10:26Z

I guess we could switch from measuring wall clock time to measuring CPU time using Java's JMX API. If we did that we wouldn't need to group tasks this way, and would possibly get a more realistic task mix in order to simulate cross-task effects in a mixed query load.

+1 to explore this! This would be better than elapsed wall clock time... e.g. this would've prevented the whole confusion around lower QPS because multiple queries are running concurrently intra and inter query concurrency ...

mikemccand · 2024-06-21T16:18:24Z

The thing that caused me to dig into this particular issue was watching the java process running SearchPerfTest with six worker threads spin at 100% CPU for quite a while (10 seconds maybe) before then jumping up to that (correct, saturated) 600% CPU ... when I pulled the thread stacks I could see that all threads at the start were working on the same single-threaded Browse facet counting tasks. So I was baffled why all threads were working on just that one task for all the initial tasks ...

mikemccand · 2024-06-21T17:02:11Z

Hmm OK I tried reverting that change, so we no longer group by task category:

diff --git a/src/main/perf/SearchPerfTest.java b/src/main/perf/SearchPerfTest.java
index e77ceec..9983092 100755
--- a/src/main/perf/SearchPerfTest.java
+++ b/src/main/perf/SearchPerfTest.java
@@ -560,7 +560,7 @@ public class SearchPerfTest {
       final int taskRepeatCount = args.getInt("-taskRepeatCount");
       final int numTaskPerCat = args.getInt("-tasksPerCat");
       tasks = new LocalTaskSource(indexState, tasksFile, taskParserFactory.getTaskParser(), staticRandom, random,
-              numTaskPerCat, taskRepeatCount, doPKLookup, searchConcurrency != 0);
+              numTaskPerCat, taskRepeatCount, doPKLookup, false);
       System.out.println("Task repeat count " + taskRepeatCount);
       System.out.println("Tasks file " + tasksFile);
       System.out.println("Num task per cat " + numTaskPerCat);

Then I ran this perf.py (6 worker threads for intra-query concurrency, only one query in flight at once, picking 20 examples of each category (i.e. 20 HighTerm queries, each of which will have very different qps)):

  sourceData = competition.sourceData('wikimediumall')
  comp = competition.Competition(taskRepeatCount=20, taskCountPerCat=5, verifyCounts=False)
  comp.addNotTaskPattern('BrowseDateSSDVFacets')
  # comp.addTaskPattern('BrowseDayOfYearTaxoFacets')                                                                                                                                                                                          
  # comp.addTaskPattern('OrHighHigh')                                                                                                                                                                                                         
  comp.addTaskPattern('HighTerm')
  comp.addTaskPattern('HighPhrase')

  checkout = 'trunk'

  index = comp.newIndex(checkout, sourceData, numThreads=12, addDVFields=True,
                        grouping=False, useCMS=True,
                        analyzer = 'StandardAnalyzerNoStopWords',
                        facets = (('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear'),
                                  ('sortedset:RandomLabel.sortedset', 'RandomLabel')))
  comp.competitor('base1', checkout, index=index, pk=False, searchConcurrency=6, numThreads=1)
  comp.competitor('base2', checkout, index=index, searchConcurrency=6, pk=False, numThreads=1)
  comp.benchmark('atoa')

And got these results after 10 iterations:

                            Task   QPS base1      StdDev   QPS base2      StdDev                Pct diff p-value
                      HighPhrase      131.04      (4.1%)      130.13      (3.9%)   -0.7% (  -8% -    7%) 0.697
            HighTermTitleBDVSort        6.57      (5.2%)        6.60      (3.7%)    0.5% (  -7% -    9%) 0.817
                        HighTerm      188.17      (6.2%)      190.00      (6.6%)    1.0% ( -11% -   14%) 0.735
               HighTermMonthSort      898.48      (5.4%)      908.32      (2.4%)    1.1% (  -6% -    9%) 0.556
           HighTermDayOfYearSort      242.70      (4.6%)      246.32      (5.2%)    1.5% (  -7% -   11%) 0.495
               HighTermTitleSort       43.92      (6.5%)       44.73      (9.4%)    1.8% ( -13% -   18%) 0.612

I ran it again and got different QPS for each category because luceneutil by default will pick a different subset of 20 tasks for each category, unless you pass a fixed staticSeed:

                            Task   QPS base1      StdDev   QPS base2      StdDev                Pct diff p-value
            HighTermTitleBDVSort        5.31      (9.4%)        5.10      (6.7%)   -4.0% ( -18% -   13%) 0.280
                        HighTerm      202.40      (4.5%)      194.93      (5.7%)   -3.7% ( -13% -    6%) 0.110
               HighTermTitleSort       35.86      (6.2%)       35.76      (6.1%)   -0.3% ( -11% -   12%) 0.917
           HighTermDayOfYearSort      251.88      (3.0%)      251.36      (3.6%)   -0.2% (  -6% -    6%) 0.889
               HighTermMonthSort      920.78      (2.4%)      919.35      (1.3%)   -0.2% (  -3% -    3%) 0.857
                      HighPhrase      166.13      (4.5%)      168.74      (2.9%)    1.6% (  -5% -    9%) 0.352

mikemccand · 2024-06-21T17:03:18Z

So this grouping change ensures that we run only tasks of the same type at the same time, enabling us to attribute the wall clock times in the logs to that task unambiguously.

Maybe this is also fixed by forcing numThreads=1 (number of in-flight queries at once) whenever searchConcurrency > 0? I.e. the issue was that the N worker thread were executing work units from different task categories?

…from 2 to max(2, <cpu-cores>/3); don't run with both intra- and inter-query concurrency

mikemccand · 2024-06-23T17:54:33Z

OK I had wondered why the nightly benchmarks didn't show the (trappy) "effective QPS" slowdown when we enabled searchConcurrency=1 and it turns out ... it did! (It just had not successfully run since we enabled single threaded intra-query concurrent search): https://home.apache.org/~mikemccand/lucenebench/Term.html

I'll change the nightly bench to use higher intra-query concurrency, with only one query in flight at once, so it doesn't take forever to run. But the reported effective QPS will be wonky (too high).

I think we really need to switch the the true per-thread CPU count aggregation from the JMX bean. I'll open a spinoff for that ...

mikemccand mentioned this issue Jun 23, 2024

TaskExecutor should not fork unnecessarily apache/lucene#13472

Merged

mikemccand added a commit that referenced this issue Jun 23, 2024

#275: rename numThreads to numConcurrentQueries; improve its default …

bd9a222

…from 2 to max(2, <cpu-cores>/3); don't run with both intra- and inter-query concurrency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close traps when testing concurrent search #275

Close traps when testing concurrent search #275

mikemccand commented Jun 20, 2024

mikemccand commented Jun 20, 2024

msokolov commented Jun 20, 2024

mikemccand commented Jun 21, 2024

msokolov commented Jun 21, 2024

msokolov commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 23, 2024

Close traps when testing concurrent search #275

Close traps when testing concurrent search #275

Comments

mikemccand commented Jun 20, 2024

mikemccand commented Jun 20, 2024

msokolov commented Jun 20, 2024

mikemccand commented Jun 21, 2024

msokolov commented Jun 21, 2024

msokolov commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 21, 2024

mikemccand commented Jun 23, 2024