Node collector split #162

adityaj1107 · 2020-07-30T22:14:46Z

Issue #, if available: #160

Description of changes: This Node Collector split is created based on the metrics which are required for all the shards on the node and other which can be collected on a few number of shards per iteration.

Testing: Built the Jar from this Patch and applied on the AES cluster. The Cache related metrics which should be collected for all the shards irrespective of the value of shardsPerCollection value are getting collected.

Tested with a zero value of this parameter (shardsPerCollection). Metrics reported in the metrics DB file are:

{"Cache_Query_Hit":0,"Cache_Query_Miss":0,"Cache_Query_Size":0,"Cache_FieldData_Eviction":0,"Cache_FieldData_Size":0,"Cache_Request_Hit":0,"Cache_Request_Miss":0,"Cache_Request_Eviction":0,"Cache_Request_Size":0}

Exactly the ones added in the split.

The entries for the cache metrics (Cache_Request_Size) are as follows:

geonames|661|0.0|0.0|0.0|0.0
geonames|662|0.0|0.0|0.0|0.0
geonames|663|0.0|0.0|0.0|0.0
geonames|664|0.0|0.0|0.0|0.0

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

...mazon/opendistro/elasticsearch/performanceanalyzer/collectors/NodeStatsMetricsCollector.java

...ndistro/elasticsearch/performanceanalyzer/collectors/NodeStatsAllShardsMetricsCollector.java

...istro/elasticsearch/performanceanalyzer/collectors/NodeStatsFixedShardsMetricsCollector.java

src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/util/Utils.java

...ndistro/elasticsearch/performanceanalyzer/collectors/NodeStatsAllShardsMetricsCollector.java

vigyasharma · 2020-08-03T22:39:26Z

...ndistro/elasticsearch/performanceanalyzer/collectors/NodeStatsAllShardsMetricsCollector.java

+                                CommonStatsFlags.Flag.RequestCache));
+                for (ShardStats shardStats : currentIndexShardStats.getShards()) {
+                    StringBuilder value = new StringBuilder();
+


nit: remove empty line

vigyasharma · 2020-08-03T22:48:58Z

src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/util/Utils.java

+    }
+
+    public static String getUniqueShardIdKey(ShardId shardId) {
+        return "[" + shardId.hashCode() + "][" + shardId.getId() + "]";


I think this will be same for both pri and rep shards. You will have to add that info here if you want them to be unique.

I think this will be same for both pri and rep shards. You will have to add that info here if you want them to be unique.

ShardId's hashcode is generated with its UUID. Is it sufficient to differentiate among primary/replica ? @vigyasharma

ShardId's hashcode is generated with its UUID. Is it sufficient to differentiate among primary/replica ?

No, the hashcode just uses index.hashcode and the int shardId. In general, in ES, there is no difference maintained b/w pri and replica shards. We generally require shardrouting.primary() if there is a need to differentiate between the two. We first need to figure out if we need to differentiate b/w pri and rep here. That depends on the target consumers of this util fn.

The node stats metrics don't have a ShardRole as a dimension. I am not sure if that was intentional. They just have a ShardID and IndexName as the dimensions. Currently we would only emit these metrics for the shards present on a node. Since no node would have the primary and replica of the same shard on the same node, there is no conflict and metric is emitted for both the primary and the replica.

rguo-aws · 2020-08-03T22:44:38Z

src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/util/Utils.java

+    }
+
+    public static HashMap<String, IndexShard> getShards() {
+        HashMap<String, IndexShard> shards =  new HashMap<>();


let's change this to HashMap<ShardId, IndexShard>

Yes this would take off the need to generate the UniqueShardIdKey altogether, but it would be better if we decide on having the shardRole in the dimensions as well because then these tasks can be taken up together as a separate PR.

rguo-aws · 2020-08-03T22:52:47Z

...ndistro/elasticsearch/performanceanalyzer/collectors/NodeStatsAllShardsMetricsCollector.java

+        }
+
+        try {
+            populateCurrentShards();


we are populating this shard haspmap twice per collection (1 per each collector). Is there any specific reason to run two collectors in separate threads? Can we combine those two collector to reduce memory footprint. We should be conservative here as PA itself needs to be lightweighted

This is just populating shard IDs from the inbuilt shard Iterator from ES. But I agree we should save memory wherever possible. This can be taken up as a separate task.

* Fix NullPointerException when PA starts collecting metrics before master node is fully up * Fix unit tests on Mac. Fix NPE during MasterServiceEventMetrics collection. * Reorder imports, refactor unit tests * add RFC for RCA * Create README.md * Split Elasticsearch version independent code (#75) This commit has 3 major changes - 1) Performance Analyzer code that is Elasticsearch version independent. 3) Performance optimization in the Elasticsearch plugin to emit events to a single event log file. This brings down CPU utilization by an order of magnitude on large clusters. * Update Performance Analyzer to support ElasticSearch version 7.3.2 * This commit merges some of the fixes and features that should have been in the split version of PA. Features and fixes introduced in this PR: Allow performance analyzer to be en(dis)abled through a cluster setting across the cluster. Allow logging to be controllable through the cluster setting Capture a node's role along with the host address and the node name Checkstyle compliance Some issues that are still not addressed in this PR: Update build scripts to start the agent from the reader location instead of the plugin location Remove pa_config, pa_bin and other folders that are already present in the reader * Adding the dependency on the renamed jar performanceanalyzer-rca from performanceanalyzer * Delete Unused test class Remove NewFormatProcessorTest class which is not used. * Adding shardsPerCollection REST API to update the shards Per Collection in node stats collector (#83) * Update gradle wrapper * Add isMasterNode to NodeDetailsStatus (#84) * make the unit test backward compatible with the isMasterNode in NodeDetailsStatus * Create gradle.yml (#87) * Create gradle.yml * Update gradle.yml * Update gradle.yml * Added the bouncy castle jars * Added the licenses file * Add cd.yml and enable CD pipeline to upload artifact to S3 (#90) * add cd.yml * upgrade ospackage version to 8.2.0 * change s3 * update cd.yml * removing the -i flag * fix a bug in StatsTests.java (#97) * Update CONTRIBUTORS.md * Update CONTRIBUTORS.md * We must handle all exceptions while intercepting ES requests (#99) * Making sure that we don't throw exceptions while intercepting ES requests PerformanceAnalyzer intercepts various ES request paths toget detailed metrics. But today if we throw an exception, then it will bubble all the way upto ES and fail the request. * Addressing the PR comments * Updating the .gitignore * style changes * Adding Shard Size Metric as a part of Node Stats (#101) * Adding Shard Size Metric as a part of Node Stats * removing the -i flag * fix a bug in StatsTests.java (#97) * Update CONTRIBUTORS.md * Update CONTRIBUTORS.md * Adressing Typos Co-authored-by: Aditya Jindal <aditjind@amazon.com> Co-authored-by: Joydeep Sinha <49728262+yojs@users.noreply.github.com> Co-authored-by: Ruizhen Guo <55893852+rguo-aws@users.noreply.github.com> Co-authored-by: Balaji <sendkb@gmail.com> * collect queue latency metric in PerformanceAnalyzer (#111) Authored-By: rguo-aws * Remove unnecessary string formatting (#112) * Odfe it framework release (#107) * ODFE IT Framework POC * Testing to see if Dockerstuff is set up * Modified workflow to set DOCKER_COMPOSE_LOCATION * Modify workflow to include stacktrace and no symbolic linkage * Set DOCKER_COMPOSE_LOCATION using set-env * Try set-env in a different location * Attempt to fix docker-compose set-env * Make workflow set vm.max_map_count * Use sudo when setting vm.max_map_count * Make performance-analyzer execute integration tests on checkin * Clean up PerformanceAnalyzerIT and build.gradle script * Add newline to end of gradle.properties file * Modify gradle.yml and checkMetrics * Fix ObjectMapper allocation and move TestUtils definition Co-authored-by: Sid Narayan <sid@signalfx.com> * Add github badges (#114) * Add github badges * Add github badges * Run integ tests as part of git workflow instead of build (#115) This commit makes it so that you can build PA without running integration tests. This is useful for many reasons, including being able to build RCA without depending on PA's integration tests and dramatically reducing build times. Integration Testing has been added as part of the Github Actions workflow * Fixup ITs and binding issues (#119) This commit makes IT execution much more robust and only executes ITs if the user passes the -Dtests.enableIT flag to the Gradle environment. This commit also ensures that we bind to all interfaces when we spin up a local Docker cluster for testing. * collect queue capacity on writer (#118) * Pa build fix (#122) * Remove * junit import * Fix logger usage * Ignore JsonKeyTests * Remove sed operation from build.gradle The sed logic is now baked into the Dockerfile in performance-analyzer-rca so it's no longer necessary here. * Restore JsonKeyTests * PA will no longer crash when SecurityManager says no (#113) * PA will no longer crash when SecurityManager says no PA attempts to set the default SSL Socket Factory (which defines rules for the SSL Sockets it creates) as well as default hostname verification rules when it is initialized by the Elasticsearch plugin loader. However, this behavior would result in an AccessControlException when run alongside the opendistro-security plugin. This commit is a simple fix which allows these two plugins to work together. * Update logging to WARN level * Calculate rejection increase and emit the delta increase of rejection as metric (#124) * Enable spotbugs, address spotbug warnings (#126) * Fix cluster state when pa is enabled from controller (#125) * Fix cluster state when pa is Enabled from controller * Add license info * Move PA files to subdir owned by elasticsearch user (#146) * IT improvements (#143) * Use true/false instead of null/present for integTest props integTest is a gradle task which runs our integration tests. It uses system properties like -Dtests.useDockerCluster to decide whether or not to perform certain actions like spinning up a docker cluster for testing. The task would previously perform the property's action if the property was present. This commit makes the integTest task only execute a system property action if that property is set to "true" * Make IT port number configurable The PerformanceAnalyzerIT class previously assumed that the Performance Analyzer Webservice would always be listening on port 9600 for any deployment of PerformanceAnalyzer. Since this isn't always the case, this commit makes the port number configurable through a gradle property. * Allow logging to be enabled for ensurePaAndRcaEnabled * cache max size metric collector (#145) * Adding changes to collect Cache Max Size metric * Updating the Cache Max Size Dimension to use toString (#153) * Fixing checkstyle build failure (#158) * Add an IT which verifies that the RCA REST endpoint can be queried (#157) * Add an IT which verifies that the RCA REST endpoint can be queried * Add try-catch to handle 404 exceptions * Add initial support for dynamic config overriding (#148) * Add initial support for dynamic config overriding * Use helper to serialize/deserialize instead of the wrapper * Add licence header to new files * Update licence year to 2020 * Node collector split (#162) Node Collector split is created based on the metrics which are required for all the shards on the node and other which can be collected on a few number of shards per iteration. Built the Jar from this Patch and applied on the AES cluster. The Cache related metrics which should be collected for all the shards irrespective of the value of shardsPerCollection value are getting collected. Tested with a zero value of this parameter (shardsPerCollection). * Use the correct ctor for NodeDetailsCollector (#166) * Use the correct ctor for NodeDetailsCollector * Check for null ConfigOverrides wrapper while appending timestamps * Add unit test for null cluster setting (#167) * Use the correct ctor for NodeDetailsCollector * Check for null ConfigOverrides wrapper while appending timestamps * Add unit test for null cluster setting for config overrides * Split capacity/latency collecting logic into separate try/catch block (#168) * Update PULL_REQUEST_TEMPLATE.md * Fix invalid cluster state (#172) * Fix invalid cluster state * Address PR comments * Skip RCA tests when building PA in Github workflows (#177) * Build against elasticsearch 7.9 and resolve dependency conflicts * Add licenses for dependencies * Add licenses for dependencies * Change minor version * Add release notes and contributors * Changed licenses * Modify github workflows * Fix merge conflicts * Fix jarHell around log4j Co-authored-by: Karthik Kumarguru <52506191+ktkrg@users.noreply.github.com> Co-authored-by: Karthik Kumarguru <kkumargu@amazon.com> Co-authored-by: Partha Kanuparthy <pak@amazon.com> Co-authored-by: Partha Kanuparthy <40440819+aesgithub@users.noreply.github.com> Co-authored-by: Adithya Chandra <adithyac@amazon.com> Co-authored-by: Venkata Jyothsna Donapati <donapv@dev-dsk-donapv-2b-df66a81b.us-west-2.amazon.com> Co-authored-by: Palash Hedau <palashhedau900@gmail.com> Co-authored-by: Joydeep Sinha <joydees@amazon.com> Co-authored-by: Balaji <sendkb@gmail.com> Co-authored-by: khushbr <59671881+khushbr@users.noreply.github.com> Co-authored-by: Chandra <adithyac@a483e7b9d55f.ant.amazon.com> Co-authored-by: Pardeep Singh <56094865+spardeepsingh@users.noreply.github.com> Co-authored-by: Ruizhen <ruizhen@amazon.com> Co-authored-by: Joydeep Sinha <49728262+yojs@users.noreply.github.com> Co-authored-by: Joydeep Sinha <yojsyojs7@gmail.com> Co-authored-by: Ruizhen Guo <55893852+rguo-aws@users.noreply.github.com> Co-authored-by: Aditya Jindal <adityajindal1194@gmail.com> Co-authored-by: Aditya Jindal <aditjind@amazon.com> Co-authored-by: Ricardo L. Stephen <43506361+ricardolstephen@users.noreply.github.com> Co-authored-by: Sid Narayan <sidnaray@amazon.com> Co-authored-by: Sid Narayan <sid@signalfx.com> Co-authored-by: Peter Zhu <zhujiaxi@amazon.com>

Node Collector Fix

e167382

adityaj1107 requested review from khushbr and rguo-aws July 30, 2020 22:16

khushbr reviewed Jul 30, 2020

View reviewed changes

Aditya Jindal added 4 commits August 1, 2020 06:51

Splitting the Node Collector in 2 Parts

053b000

Renaming Node Stats Util File

e7a16e9

BUild

890a7ec

Adding Unit Tests

dec28c8

khushbr reviewed Aug 3, 2020

View reviewed changes

Addressing PR Comments

7698296

khushbr approved these changes Aug 3, 2020

View reviewed changes

rguo-aws reviewed Aug 3, 2020

View reviewed changes

...ndistro/elasticsearch/performanceanalyzer/collectors/NodeStatsAllShardsMetricsCollector.java Outdated Show resolved Hide resolved

...ndistro/elasticsearch/performanceanalyzer/collectors/NodeStatsAllShardsMetricsCollector.java Outdated Show resolved Hide resolved

Aditya Jindal and others added 2 commits August 3, 2020 15:40

Addressing PR Comments

16fbf70

Merge branch 'master' into collector

b51a51d

vigyasharma approved these changes Aug 3, 2020

View reviewed changes

rguo-aws suggested changes Aug 3, 2020

View reviewed changes

Removing Unused Imports

f55e224

adityaj1107 requested a review from rguo-aws August 3, 2020 22:57

adityaj1107 merged commit 67a1257 into master Aug 3, 2020

ktkrg mentioned this pull request Aug 5, 2020

Use the correct ctor for NodeDetailsCollector #166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node collector split #162

Node collector split #162

adityaj1107 commented Jul 30, 2020 •

edited

Loading

vigyasharma Aug 3, 2020

vigyasharma Aug 3, 2020

rguo-aws Aug 3, 2020 •

edited

Loading

vigyasharma Aug 3, 2020

adityaj1107 Aug 4, 2020

rguo-aws Aug 3, 2020

adityaj1107 Aug 4, 2020

rguo-aws Aug 3, 2020

adityaj1107 Aug 4, 2020

Node collector split #162

Node collector split #162

Conversation

adityaj1107 commented Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rguo-aws Aug 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adityaj1107 commented Jul 30, 2020 •

edited

Loading

rguo-aws Aug 3, 2020 •

edited

Loading