-
Notifications
You must be signed in to change notification settings - Fork 869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bias in "streaming" datasets by API offering minValue,maxValue from base.py #350
Comments
CC @ctrl-z-9000-times what do you think? We'll fix this for community/NAB |
I don't like this change.
|
this is true, on the other hand, "optimizing" bounds exactly for a dataset is a bias. A compromise to what you're saying?: Make min/max known, but is computed from ALL datasets.
|
The way I see it is that each dataset was probably recorded using a physical device (the sensor) which has well known limitations on the range of values it can record. Although the NAB data sets did not save the min/max value of the sensor hardware, it can safely infer them. |
In real projects, many sensors that provide data for the anomaly detection, give values that are calculated from a variety of indicators. Therefore, it is impossible to know the range of possible values, since it is not physical equipment. Sorry for my intervened :) |
Minimum and maximum. In software complexes for detecting anomalies, the architecture defines a method for obtaining data. In some systems, you can know the maximum and minimum initially. In some systems, you can use the first N samples, so that the system decides which range of values is valid and performs the internal optimization of the detector settings. In some systems, even this is unacceptable. Sorry for my English. |
Any ideas on how to solve this? This does not make sense for streaming data, even for stock market timeSeries data. I can update the model everyday with new Max and Min, but that's against the philosophy of onlineML. Anybody got any idea on how to solve this? |
It's a valid point, and one we discussed quite a bit. We chose the current method due to the very high dynamic, but known, range of some of the streams. We found in practical applications this assumption (knowing the min/max) was valid in most cases. Dynamically figuring out min/max is a hard task, and beyond the scope of the NAB dataset. Maybe it's something that could be addressed in a future version. For something like stock, I would suggest picking a large max upfront, say 2X or 4X the current max value. That should work fine. Keep in mind that raw stock market price data is inherently very unpredictable, so not a good dataset for any anomaly detection algorithm that I know of. I wouldn't expect good results no matter what. |
The abstract class
AnomalyDetector
(frombase.py
) dictates in its API min/max bounds of the dataset.https://github.com/numenta/NAB/blob/master/nab/detectors/base.py#L47
This is a bias as it makes it easier for encoders to choose optimal settings. In real life on streaming datasets such value is not know. And encoders have to deal with the fact (using encoder that does not require fixed bounds (RDSE) or setting the values large (too) enough).
I think this is a bug in NAB API design and the information should be removed.
The text was updated successfully, but these errors were encountered: