Add command "tuning" (#508) #515

chishui · 2024-04-16T10:41:13Z

Description

When ingesting data to OpenSearch using bulk API, using different variables could result in different ingestion performance. For example, the amount of document in bulk API, how many OpenSearch clients are used to send requests, batch size (a variable for batch ingestion opensearch-project/OpenSearch#12457) etc. It's not easy for user to experiment with all the combinations of the variables and find the option which could lead to optimal ingestion performance.

This tool is to help dealing with the pain point of tuning these variables which could impact ingestion performance and automatically find the optimal combination of the variables. It utilizes the OpenSearch-Benchmark, uses different variable combinations to run benchmark, collects their outputs, analyzes and visualizes the results.

Issues Resolved

#508

Testing

New functionality includes testing

The first version of this PR is only to demonstrate ideas, tests will be added later.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

IanHoang · 2024-04-22T14:40:59Z

To help community members understand the intentions better and since this is a newly suggested subcommand, we recommend contributors to add an example output from this feature.

IanHoang · 2024-04-22T14:44:05Z

osbenchmark/tuning/optimal_finder.py

+def run_benchmark(params):
+    commands = ["opensearch-benchmark", "execute-test"]
+    for k, v in params.items():
+        commands.append(k)
+        if v:
+            commands.append(v)
+
+    proc = None
+    try:
+        proc = subprocess.Popen(
+            commands,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE)
+
+        _, stderr = proc.communicate()
+        return proc.returncode == 0, stderr.decode('ascii')
+    except KeyboardInterrupt as e:
+        proc.terminate()
+        print("Process is terminated!")
+        raise e


I understand the convenience of this but at the same time I'm not sure if this is best practice. This makes it feel like more of a wrapper utilizing OSB rather than a built-in feature for OSB.

It's similar to compare command which also utilizes OSB and can potentially reside outside OSB, but it's also strongly coupled with OSB as it always has to work together with OSB. I feel it would just bring unnecessary churns to user if the command is in its own package that users have to install both to make it work.

compare subcommand is a light-weight operation that pulls from the test execution store (whether that is local or from an external metric datastore) and does some diffs between them. Essentially, this is still a a recommendation tool that wraps OSB.

chishui · 2024-04-24T09:44:23Z

To help community members understand the intentions better and since this is a newly suggested subcommand, we recommend contributors to add an example output from this feature.

@IanHoang this is an example output from console

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] There will be 4 tests to run with 2 bulk sizes, 2 batch sizes, 1 client numbers.
[INFO] Running benchmark with: bulk size: 100, number of clients: 1, batch size: 50
[INFO] Running benchmark with: bulk size: 200, number of clients: 1, batch size: 50
[INFO] Running benchmark with: bulk size: 100, number of clients: 1, batch size: 100
[INFO] Running benchmark with: bulk size: 200, number of clients: 1, batch size: 100
[INFO] The optimal variable combination is: bulk size: 200, batch size: 50, number of clients: 1

---------------------------------
[INFO] SUCCESS (took 235 seconds)
---------------------------------

Also, I added support for output aggregated intermediate benchmark results to a file, if the file is in markdown format, the data in file would look like below:

Metric	Task	bulk size: 100, batch size: 50, number of clients: 1	bulk size: 200, batch size: 50, number of clients: 1	bulk size: 100, batch size: 100, number of clients: 1	bulk size: 200, batch size: 100, number of clients: 1	Unit
Cumulative indexing time of primary shards		0.117533	0.11755	0.117567	0.1176	min
Min cumulative indexing time across primary shards		0	0	0	0	min
Median cumulative indexing time across primary shards		0	0.0	0	0.0	min
Max cumulative indexing time across primary shards		0.00906667	0.009066666666666667	0.00906667	0.009066666666666667	min
Cumulative indexing throttle time of primary shards		0	0	0	0	min
Min cumulative indexing throttle time across primary shards		0	0	0	0	min
Median cumulative indexing throttle time across primary shards		0	0.0	0	0.0	min
Max cumulative indexing throttle time across primary shards		0	0	0	0	min
Cumulative merge time of primary shards		0.00693333	0.00705	0.00705	0.00705	min
Cumulative merge count of primary shards		39	40	40	40
Min cumulative merge time across primary shards		0	0	0	0	min
Median cumulative merge time across primary shards		0	0.0	0	0.0	min
Max cumulative merge time across primary shards		0.00408333	0.004083333333333333	0.00408333	0.004083333333333333	min
Cumulative merge throttle time of primary shards		0	0	0	0	min
Min cumulative merge throttle time across primary shards		0	0	0	0	min
Median cumulative merge throttle time across primary shards		0	0.0	0	0.0	min
Max cumulative merge throttle time across primary shards		0	0	0	0	min
Cumulative refresh time of primary shards		0.0706167	0.07096666666666666	0.0711833	0.07139999999999999	min
Cumulative refresh count of primary shards		762	766	770	774
Min cumulative refresh time across primary shards		0	0	0	0	min
Median cumulative refresh time across primary shards		0	0.0	0	0.0	min
Max cumulative refresh time across primary shards		0.0252167	0.025216666666666665	0.0252167	0.025216666666666665	min
Cumulative flush time of primary shards		0.00853333	0.008533333333333334	0.00853333	0.008533333333333334	min
Cumulative flush count of primary shards		126	126	126	126
Min cumulative flush time across primary shards		0	0	0	0	min
Median cumulative flush time across primary shards		0	0.0	0	0.0	min
Max cumulative flush time across primary shards		0.001	0.001	0.001	0.001	min
Total Young Gen GC time		0	0	0	0	s
Total Young Gen GC count		0	0	0	0
Total Old Gen GC time		0	0	0	0	s
Total Old Gen GC count		0	0	0	0
Store size		0.155355	0.15530386846512556	0.155333	0.15536146704107523	GB
Translog size		1.37556e-05	1.5400350093841553e-05	1.70432e-05	1.8687918782234192e-05	GB
Heap used for segments		0	0	0	0	MB
Heap used for doc values		0	0	0	0	MB
Heap used for terms		0	0	0	0	MB
Heap used for norms		0	0	0	0	MB
Heap used for points		0	0	0	0	MB
Heap used for stored fields		0	0	0	0	MB
Segment count		159	152	154	156
Min Throughput	bulk	289.93	563.50	394.78	563.82	docs/s
Mean Throughput	bulk	289.93	563.50	394.78	563.82	docs/s
Median Throughput	bulk	289.93	563.50	394.78	563.82	docs/s
Max Throughput	bulk	289.93	563.50	394.78	563.82	docs/s
50th percentile latency	bulk	108.905	117.17062500247266	104.839	105.04941700492054	ms
90th percentile latency	bulk	225.736		302.013		ms
100th percentile latency	bulk	408.461	416.53016599593684	396.162	419.2513330053771	ms
50th percentile service time	bulk	108.905	117.17062500247266	104.839	105.04941700492054	ms
90th percentile service time	bulk	225.736		302.013		ms
100th percentile service time	bulk	408.461	416.53016599593684	396.162	419.2513330053771	ms
error rate	bulk	0	0.00	0	0.00	%
Total time		53	53	53	53	s

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

chishui · 2024-05-06T05:21:28Z

@IanHoang any more comments on this PR?

IanHoang · 2024-05-14T18:19:59Z

@chishui I provided some input on the issue here: #508 (comment)

Regarding this PR, I'm still concerned over how this subcommand invokes OSB's command line execeute-test subcommand. Although it simplifies the implementation, it doesn't seem to abide by design best practices and feels like pure scripting. Since it's spawning subprocesses of OSB, it also makes debugging and logging more difficult.

IanHoang

Left some comments yesterday regarding this.

IanHoang · 2024-05-24T19:07:50Z

@chishui please feel free to reopen this. Closing for now as no activity. We can work together and collaborate to find a suitable solution for your needs!

chishui requested review from IanHoang, gkamat, beaioun, cgchinmay and rishabh6788 as code owners April 16, 2024 10:41

IanHoang reviewed Apr 22, 2024

View reviewed changes

chishui added 3 commits April 28, 2024 13:45

Add command "tuning" (opensearch-project#508)

62b337b

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

Add UT, copyright, fix pylint reported code

4800397

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

Add result output option

d0d6a68

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

chishui force-pushed the main branch from 9f99248 to d0d6a68 Compare April 28, 2024 05:46

IanHoang requested changes May 15, 2024

View reviewed changes

IanHoang closed this May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add command "tuning" (#508) #515

Add command "tuning" (#508) #515

chishui commented Apr 16, 2024 •

edited

IanHoang commented Apr 22, 2024 •

edited

IanHoang Apr 22, 2024

chishui Apr 23, 2024

IanHoang May 14, 2024

chishui commented Apr 24, 2024

chishui commented May 6, 2024

IanHoang commented May 14, 2024 •

edited

IanHoang left a comment

IanHoang commented May 24, 2024 •

edited

Add command "tuning" (#508) #515

Add command "tuning" (#508) #515

Conversation

chishui commented Apr 16, 2024 • edited

Description

Issues Resolved

Testing

IanHoang commented Apr 22, 2024 • edited

IanHoang Apr 22, 2024

Choose a reason for hiding this comment

chishui Apr 23, 2024

Choose a reason for hiding this comment

IanHoang May 14, 2024

Choose a reason for hiding this comment

chishui commented Apr 24, 2024

chishui commented May 6, 2024

IanHoang commented May 14, 2024 • edited

IanHoang left a comment

Choose a reason for hiding this comment

IanHoang commented May 24, 2024 • edited

chishui commented Apr 16, 2024 •

edited

IanHoang commented Apr 22, 2024 •

edited

IanHoang commented May 14, 2024 •

edited

IanHoang commented May 24, 2024 •

edited