-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional missing data even though heatmap cell indicates anomalies #119
Comments
Opensearch version 1.1.0.0 In OpenSearch version 1.0.0.0, the entity name appeared in the anomaly, confidence graph. and it showed the anomaly occurences as i can remember, but it never showed a confidence/anomaly graph. This is only the case if we explicitly do not use a Categorical field but instead a data filter for exaclty one entity. Then we can see a confidence graph and anomaly detection. |
@prpaluch hello, thanks for bringing this issue up. Regarding the differences between 1.1.0.0 and 1.0.0.0: there was some changes in 1.1.0.0 which reformatted the layout of the charts shown below the heatmap, but should not have affected their functionality regarding showing populated anomalies or not. Additionally, both versions show the same content (anomaly results table, anomaly results chart, and feature breakdown). If you are seeing "Click on an anomaly entity to view data", then that means the chart isn't recognizing a heatmap cell has been selected. If it's selected, but there is no results, then you should see the error like in the screenshot above ("There are no anomalies currently."). If you are consistently seeing the former when trying to click on a heatmap cell, then that may be because of cluster load (taking a long time to update the charts), or some separate issue. If so, can you open a separate issue describing how to reproduce that bug? I would be happy to help assist and deep dive that problem, thanks. |
Hi @ohltyler, is there an option to debug this behavior or maybe find out the cluster load, or that the cluster is overladed? The detector runs on several indices with a wildcard * option. Each index could grow to 1-2 tb on data. The data for a specific index can delay 1-2 days, means data can be written to an index that is 2 days old. Currently the whole configuration runs as a POC to see how anomaly detection performs with our data situation and runs on 3 nodes. with a total of 264 Gb of Memory. So currently nothing happens if we click on a heat map square except that all other cells are greyed out, the chart stays the same also after 1 Minute waiting. |
One note to say, over GET _plugins/_anomaly_detection/stats and POST _plugins/_anomaly_detection/detectors/results/_search for each detector, data seems to be available. |
@prpaluch thanks for those details. When a cell is selected, the plugin makes a call to fetch the individual anomaly results for the detector filtered on the cell's time range. My assumption is that request is timing out due to cluster load, or there is a bug in displaying the results. To check for errors, can you try selecting a heatmap cell and waiting a few minutes, while monitoring the console output using Chrome's dev tools (or another browser's version of it)? That should give you more information on if the request is timing out, or any errors occurring. You can check real-time cluster-level stats using the node stats API ( |
Hi @ohltyler these errors could be found in the chrome dev. console during loading the detector and clicking on a heatmap cell. The first exception is thrown if i visit the side of our configured detector: And next if I click on a cell, this exception is thrown in the console: a_nomalyDetectionDashboards.chunk.2.js:1 Uncaught TypeError: entityListAsString.split is not a function |
@prpaluch great, this is really helpful. The first exception (missing alerting config) is okay and will not interrupt the plugin from working. The second exception, when clicking on a cell, seems to be where the issue is. Might be due to an NPE on |
Looks to be an NPE happening on this line, which means If the heatmap chart is populated with anomalies and if there are values on the y-axis, those should be persisted in the chart data, where Apologies for all of the requested info. I'm unable to reproduce this myself and am trying to figure out what data could lead to this error being thrown. To help unblock you as well, you may look into upgrading to 1.2.0.0. A lot of the logic around the heatmap cells has been refactored, and the function that is causing the NPE here has actually been removed. Regardless, this is helpful to try to fix this bug in order to provide a patch for users using 1.1.0.0 (and potentially 1.0.0.0). Thanks! |
Checked version 1.2.0 and our problems are gone, it seems that the refactoring in version 1.2.0 helped a lot to make the heatmap work. Now it is possible to click on a square and you get the anomalies and also the confidence graph as it should. |
Thanks for the detailed information! I'm glad that 1.2 is working better for you, but sounds like there is still issues with the y-axis scaling. Your assumption about the overlapping y-axes causing an empty selectedEntityList in 1.1 are probably related - I'm assuming that the aggregated data within the cell is probably corrupt because of this, and leads to the errors (which isn't handled properly in 1.1). If you don't mind, could you provide some more details on the index and detector configuration? I'd like to try to reproduce locally to root cause the bug. Some useful info would be:
|
hi yes of course. We are analysing ha-proxy logs. Here is the index template for the log data. It contains the settings and mappings. You can set it up via dev tools. POST /_template/haproxy_example And here a snippet how the data looks like, all xxxxxxx are string values "hits" : [ Here the detector configuation GET _plugins/_anomaly_detection/detectors/g61xN30Bju0JQ98M1BaN If you need more information, i am glad to help you. |
@prpaluch I'm able to reproduce the issue on 1.2 after setting up a similar local environment (similar indices, detector config). Will respond back once I can dive deeper into the root cause. I have a small assumption that it's some conflict with the values of the y-axis, where it's read in the plotly heatmap as numerical rather than strictly string/keyword, since the y-axis values always seem to be organized in descending order no matter how it's filtered (by severity/occurrence, top 10/20/30, etc). Also, they seem to be spaced evenly based on their values - for example, the gaps are larger between numbers that are numerically farther apart (see 84 & 115), and the gaps are small between numbers that are numerically close (see 162 and 163 which are overlapping): |
Found out Plotly will automatically try to determine the data type based on the axis data given. In this case, it looks like it is labeling it as I'll work on a patch for this and will update this issue with the latest progress. |
Heatmap chart axes fixed as part of #167. Will still leave this issue open for tracking the root cause of occasional empty/missing data in the cells. |
Occasionally, a heatmap cell summary will indicate an anomaly present, but when clicked on, shows 0 available anomalies.
The anomaly summaries and the anomaly data are fetched in 2 different calls, so likely the issue has to do with the time bounds being different between the two queries. If an anomaly is on the edge, it may be getting included in the summary, but not included in the raw results, leading to the discrepancy.
Screenshot of the error:
The text was updated successfully, but these errors were encountered: