Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bias in "streaming" datasets by API offering minValue,maxValue from base.py #350

Open
breznak opened this issue Jul 25, 2019 · 8 comments
Open

Comments

@breznak
Copy link
Member

breznak commented Jul 25, 2019

The abstract class AnomalyDetector (from base.py) dictates in its API min/max bounds of the dataset.
https://github.com/numenta/NAB/blob/master/nab/detectors/base.py#L47

This is a bias as it makes it easier for encoders to choose optimal settings. In real life on streaming datasets such value is not know. And encoders have to deal with the fact (using encoder that does not require fixed bounds (RDSE) or setting the values large (too) enough).

I think this is a bug in NAB API design and the information should be removed.

@breznak
Copy link
Member Author

breznak commented Jul 25, 2019

CC @ctrl-z-9000-times what do you think? We'll fix this for community/NAB

@ctrl-z-9000-times
Copy link

I don't like this change.

  • It is not unreasonable to know the physical limits of the sensor.
  • It will break the code.

@breznak
Copy link
Member Author

breznak commented Jul 25, 2019

It is not unreasonable to know the physical limits of the sensor.

this is true, on the other hand, "optimizing" bounds exactly for a dataset is a bias.

A compromise to what you're saying?: Make min/max known, but is computed from ALL datasets.

  • that more precisely describes what you say

@ctrl-z-9000-times
Copy link

"optimizing" bounds exactly for a dataset is a bias.

The way I see it is that each dataset was probably recorded using a physical device (the sensor) which has well known limitations on the range of values it can record. Although the NAB data sets did not save the min/max value of the sensor hardware, it can safely infer them.

@smirmik
Copy link
Contributor

smirmik commented Jul 26, 2019

The way I see it is that each dataset was probably recorded using a physical device (the sensor) which has well known limitations on the range of values it can record. Although the NAB data sets did not save the min/max value of the sensor hardware, it can safely infer them.

In real projects, many sensors that provide data for the anomaly detection, give values that are calculated from a variety of indicators. Therefore, it is impossible to know the range of possible values, since it is not physical equipment.

Sorry for my intervened :)

@smirmik
Copy link
Contributor

smirmik commented Jul 26, 2019

Minimum and maximum.

In software complexes for detecting anomalies, the architecture defines a method for obtaining data. In some systems, you can know the maximum and minimum initially. In some systems, you can use the first N samples, so that the system decides which range of values ​​is valid and performs the internal optimization of the detector settings. In some systems, even this is unacceptable.
NAB is a universal benchmark for testing any detectors. The fact that it provides a minimum and maximum does not oblige to use it in the detector.
On the other hand, the presence of the initial minimum and maximum led to the NAB that there are only detectors that use these values, since It is easier to optimize detector results. And this is contrary to the concept of "ideal detector".
I think that a more correct approach is to remove the minimum and maximum and allow the detectors to solve this problem for themselves. But this will lead to the need to redesign all the detectors that are now in the NAB.
An alternative solution is to make two tables of results. One for detectors that receive a minimum and a maximum from the outside, and another for those that do not use these values.
This is a question of concept, there is no right decision from the point of view of logic.

Sorry for my English.

@ankitnayan
Copy link

Any ideas on how to solve this? This does not make sense for streaming data, even for stock market timeSeries data. I can update the model everyday with new Max and Min, but that's against the philosophy of onlineML. Anybody got any idea on how to solve this?

@subutai
Copy link
Member

subutai commented Jun 19, 2020

It's a valid point, and one we discussed quite a bit. We chose the current method due to the very high dynamic, but known, range of some of the streams. We found in practical applications this assumption (knowing the min/max) was valid in most cases. Dynamically figuring out min/max is a hard task, and beyond the scope of the NAB dataset. Maybe it's something that could be addressed in a future version.

For something like stock, I would suggest picking a large max upfront, say 2X or 4X the current max value. That should work fine. Keep in mind that raw stock market price data is inherently very unpredictable, so not a good dataset for any anomaly detection algorithm that I know of. I wouldn't expect good results no matter what.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants