Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Stats collection #8

Merged

Conversation

jmazanec15
Copy link
Member

#4

Adding stats collection and api to the anomaly detection plugin to give users ability to monitor plugin health and performance.

The following node/cluster level stats will be tracked:

Node Level Stats

  • ad_execute_request_count — the total number of detection requests that that node receives. Counter does not automatically reset and is stored in memory.
  • ad_execute_failure_count — the total number of requests for detection that failed. Same counter as ad_execute_request_count
  • models — information about the models being hosted on that node
    • detector_id
    • model_id
    • model_type

Cluster Level Stats

  • anomaly_detectors_index_status — status of anomaly_detectors index (green, yellow, red, non-existent)
  • anomaly_results_index_status — status of index that “opendistro-anomaly-results” alias currently points to (green, yellow, red, non-existent)
  • model_checkpoint_index_status — status of model checkpoint index (green, yellow, red, non-existent)
  • detector_count — number of detectors in the domain

User has the ability to retrieve statistics from select nodes and to retrieve only select statistics.

GET "_opendistro/_anomaly_detection/{nodeId},/stats/"
GET "_opendistro/_anomaly_detection/{nodeId},/stats/{stat},"
GET "_opendistro/_anomaly_detection/stats/"
GET "_opendistro/_anomaly_detection/stats/{stat},"

Sample call:

GET "_opendistro/_anomaly_detection/stats"

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "multi-node-run",
  "anomaly_detectors_index_status" : "green",
  "detector_count" : 1,
  "models_checkpoint_index_status" : "green",
  "anomaly_results_index_status" : "green",
  "nodes" : {
    "QXa4zxCmRbao9AbX-9RP4w" : {
      "ad_execute_request_count" : 0,
      "ad_execute_failure_count" : 0,
      "models" : [
        {
          "detector_id" : "kgYS7G4BILu2LrPi5LQH",
          "model_type" : "rcf",
          "model_id" : "kgYS7G4BILu2LrPi5LQH_model_rcf_1"
        }
      ]
    },
    "PQIuCasiQuShjmUUybF1OQ" : {
      "ad_execute_request_count" : 8,
      "models" : [
        {
          "detector_id" : "kgYS7G4BILu2LrPi5LQH",
          "model_type" : "rcf",
          "model_id" : "kgYS7G4BILu2LrPi5LQH_model_rcf_2"
        },
        {
          "detector_id" : "kgYS7G4BILu2LrPi5LQH",
          "model_type" : "threshold",
          "model_id" : "kgYS7G4BILu2LrPi5LQH_model_threshold"
        }
      ],
      "ad_execute_failure_count" : 7
    },
    "jJDUccumQXioZhdpVyxsnw" : {
      "ad_execute_request_count" : 0,
      "ad_execute_failure_count" : 0,
      "models" : [
        {
          "detector_id" : "kgYS7G4BILu2LrPi5LQH",
          "model_type" : "rcf",
          "model_id" : "kgYS7G4BILu2LrPi5LQH_model_rcf_0"
        }
      ]
    }
  }
}

Comment on lines 376 to 389
final CountDownLatch responseLatch = new CountDownLatch(1);
AtomicReference<Long> count = new AtomicReference<>();
LatchedActionListener<SearchResponse> searchListener = new LatchedActionListener<>(
new DocumentCountListener(count), responseLatch);

if (response.isPresent()) {
return response.get().getHits().getTotalHits().value;
} else {
return 0L;
client.search(searchRequest, searchListener);

try {
responseLatch.await(DOCUMENT_COUNT_SEARCH_TIMEOUT_S, TimeUnit.SECONDS);
return count.get();
} catch (InterruptedException e) {
logger.error("Interrupted Exception", e);
return -1L;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why use the CountDownLatch here which is a blocking operation too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is blocking. It will timeout after 3 seconds. Will address blocking issue in a later PR.

}

private class DocumentCountListener implements ActionListener<SearchResponse> {
private AtomicReference<Long> count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't use AtomicLong here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will update to use AtomicLong

build.gradle Outdated
@@ -246,6 +246,8 @@ List<String> jacocoExclusions = [
'com.amazon.opendistroforelasticsearch.ad.resthandler.RestGetAnomalyResultAction',
'com.amazon.opendistroforelasticsearch.ad.metrics.MetricFactory',
'com.amazon.opendistroforelasticsearch.ad.indices.AnomalyDetectionIndices',
'com.amazon.opendistroforelasticsearch.ad.indices.AnomalyDetectionIndices.DocumentCountListener',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have enough coverage on these 2 classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add coverage for these two.

public String getIndexHealthStatus(String indexName) {
if (!clusterService.state().getRoutingTable().hasIndex(indexName)) {
return "nonexistent";
public String getIndexHealthStatus(String indexOrAliasName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a general function. Could you put it to some util class so that other components can use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I will do.

searchRequest.source(sourceBuilder);

Optional<SearchResponse> response = requestUtil.timedRequest(searchRequest, logger, client::search);
final CountDownLatch responseLatch = new CountDownLatch(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a general function. Could you put it to some util class so that other components can use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add it to IndexUtills class

Comment on lines 63 to 64
public static final String RCF_TYPE_VALUE = "rcf";
public static final String THRESHOLD_TYPE_VALUE = "threshold";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using enum as it is a better approach than Strings. They are type safe and comparing them is faster than comparing Strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For only 2 values, do you think that this is worthwhile?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will make the update!

@ylwu-amzn ylwu-amzn added this to Review in progress in Ongoing Dec 13, 2019
Comment on lines 151 to 177
public Map<String, ADStat<?>> getNodeStats() {
Map<String, ADStat<?>> nodeStats = new HashMap<>();

for (Map.Entry<String, ADStat<?>> entry : stats.entrySet()) {
if (!entry.getValue().isClusterLevel()) {
nodeStats.put(entry.getKey(), entry.getValue());
}
}
return nodeStats;
}

/**
* Get a map of the stats that are kept at the cluster level
*
* @return Map of stats kept at the cluster level
*/
public Map<String, ADStat<?>> getClusterStats() {
Map<String, ADStat<?>> clusterStats = new HashMap<>();

for (Map.Entry<String, ADStat<?>> entry : stats.entrySet()) {
if (entry.getValue().isClusterLevel()) {
clusterStats.put(entry.getKey(), entry.getValue());
}
}
return clusterStats;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to combine these two functions since they are only different in terms of the true/false of the same if condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will combine these two to remove redundant code.

if (indexMetaDataList.size() == 0) {
return ALIAS_EXISTS_NO_INDICES_STATUS;
} else {
indexOrAliasName = indexMetaDataList.get(0).getIndex().getName();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if index at pos 1 is yellow and index at pos 0 is green? I meant why do we just consider index at pos 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only consider the index at position 0 because our aliases will only point to one index at a time. The indexMetaDataList for our purposes should never have a size greater than 1. I will add a comment to explain this and then add an else if condition that checks if the alias points to multiple indices.

Copy link
Member

@wnbts wnbts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments across this change. Otherwise, looks good overall.

All interfaces (public methods, fields, classes) should be documented. This is one opportunity/reason to design and select which should be visible to customers and which should be behind abstraction. Access to classes, methods, fields (public, private, package, protected) reflects the design.

Implementation-wise, I'd encourage using more stream api when appropriate.

* Get the ModelInformation for all hosted models
*
* @return modelsInformation list of model information for all hosted models
* getAllModels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. Standard java doc is needed for readers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make update thanks!

private Instant lastUsedTime;
private Instant lastCheckpointTime;

/**
* Constructor.
*
* @param model ML model
* @param modelId Id of Model this model partition is a part of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. Model id is the id of the partitioned model. There is no abstract/logical id for the collective model for a detector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thanks I will update this

+ " and individual stats");
} else {
Set<String> invalidStats = new TreeSet<>();
adStatsRequest.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. This doesn't seem needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, all stats are set to be retrieved. So, I needed to clear them here. However, after more consideration, I believe that I should set all stats to not be retrieved by default. Therefore, it is more clear about what stats are being retrieved.

/**
* Initialize the map that keeps track of all of the stats
*/
private void initStats() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style suggestion. I would treat this as configuration (data) rather than the initialization code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I will modify this so that the map can be passed in and move the creation of this map to the createComponents method of anomalyDetectorPlugin.


void increment();

void add(long n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. Is add and reset needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they are not. I will remove.

/**
* ADStatsNodeRequest to get a nodes stat
*/
class ADStatsNodeRequest extends BaseNodeRequest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. This looks like a public class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix thanks.


public static final String ALL_STATS_KEY = "_all";

private Map<String, Boolean> statsRetrievalMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. Why not using a set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A set does make more sense here. Will update thanks!

public class ADStatsTransportAction extends TransportNodesAction<ADStatsRequest, ADStatsResponse,
ADStatsNodeRequest, ADStatsNodeResponse> {

ADStats adStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. Should this be private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should. Good catch thanks!

@@ -207,16 +210,20 @@ protected void doExecute(Task task, ActionRequest actionRequest, ActionListener<
AnomalyResultRequest request = AnomalyResultRequest.fromActionRequest(actionRequest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. Is it possible to collect metrics using the listener so it's easier and doesn't miss? something in the line of overriding the onFailure in a wrapper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that would make the code significantly cleaner and prevent misses. I will add that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might go even further to replace the input listener with the wrapped listener.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is possible. doExecute is called outside the AD code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant using listener again to refer to the wrapped listener. It might be controversial but it reduces mistakes and simplifies overall for this case, something I would probably do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay yes I think that will reduce potential for mistakes and clarify code. Will make updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, because the original listener is called in the lambda, it is not possible to refactor in this way unless I create a copy of the original listener to use in the lambda. Do you think creating this copy is worth it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the original can be reused this way

final ActionListener original = listener
listener = wrap(original)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense. Will make change

sourceBuilder.size(0);
searchRequest.source(sourceBuilder);
Optional<SearchResponse> response = clientUtil.timedRequest(searchRequest, logger, client::search);
return response.map(r -> r.getHits().getTotalHits().value).orElse(-1L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. The output value -1 is not consistent with output value 0. The behaviour needs to be documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the index does not exist, I thought it makes sense to return 0 because there are 0 documents in a non-existent index. I thought it makes sense to return -1 on request failure because there are an unknown number of documents in the index. I will document this behavior.

Ongoing automation moved this from Review in progress to Reviewer approved Dec 26, 2019
@jmazanec15 jmazanec15 removed the request for review from ylwu-amzn December 30, 2019 16:33
@@ -26,13 +26,26 @@
private IndexUtils indexUtils;
private String indexName;

private final String UNABLE_TO_RETRIEVE_HEALTH_MESSAGE = "unable to retrieve health";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. In general when something is used in output, it's part of interface and should be visible to public, unless the value really doesn't matter to any.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will make update.

@@ -207,16 +210,20 @@ protected void doExecute(Task task, ActionRequest actionRequest, ActionListener<
AnomalyResultRequest request = AnomalyResultRequest.fromActionRequest(actionRequest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might go even further to replace the input listener with the wrapped listener.

Copy link
Contributor

@jngz-es jngz-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

for (Map.Entry<String, Boolean> entry : statsRetrievalMap.entrySet()) {
out.writeBoolean(entry.getValue());
}
out.writeStringCollection(validStats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are validStats the same all the time? If yes, do we need to record them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I can take validStats out of here and move it to the RestStatsAnomalyDetectorAction. This way we do not have to pass it around.

@jmazanec15 jmazanec15 merged commit cb2acd6 into opendistro-for-elasticsearch:development Jan 11, 2020
Ongoing automation moved this from Reviewer approved to Done Jan 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Ongoing
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants