Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomaly detection tutorial #223

Merged
merged 12 commits into from Aug 21, 2023
Merged

Anomaly detection tutorial #223

merged 12 commits into from Aug 21, 2023

Conversation

ianspektor
Copy link
Collaborator

No description provided.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to see Temporian in action :).

I cannot comment in the file itself, so let me add the comments here.

"but it means that the timestamp column has no actual semantic meaning, and should not be used for feature engineering at all."

We can use the timestamps, but we should not apply assume it is a unix epoch and apply calendar features. However, we can still use the timestamps for computing distance between observations (expressed in index).

**Feature names

What about calling the features "f1", "f2", etc. instead of "0", "1", etc.

It will be especially useful to interpret augmented features "avg_20_13" => "avg_20_f13".

"last_train_timestamp = int(len(X.get_arbitrary_index_data()) * 0.8)"

I would add a note saying that this only work if all the time-series have the same length.

As we noticed previously, the dataset is very unbalanced.

Example weighting is generally used when the training algorithm is not able to train on unbalanced data or to speed-up training.
But if the dataset is small enough and algorithm supports unbalanced data, it does not make much sense.

The issue is that accuracy is not a good metric for this problem. Instead, I would report a ROC+AUC (or event better an AMOC [activity monitoring operating characteristic]).

An AMOC is better suited here because, when there is an anomaly a technician is generally called. Therefore, two anomaly detected at almost the same time count generally as "one" (in term of cost).

Alongside the raw predictions and labels, it could be interesting to also plot the number of predictions / labels in the last N (e.g., N=100) timestamps.

*Plots

39 plots is a bit too much. If possible, I would only plot a subset of features (e.g., 5).

Conclusion

Each time a new accuracy (or other metric) is computed, it would be great to compare it to previous runs.
Maybe we could have a simple utility function like:

all_metrics = {}
def save_metric(experiment_name, metric):
all_metrics[experiment_name] = metric
nice_print(all_metrics)

Abnormality detection

The two common challenges of abnormality detection is that anomalies are generally rare, and that we generally don't know in advance what they look like (and two different anomalies don't look the same).

You commented about that in the last section, but the fact that you use the training labels for an abnormality detection is a bit counter intuitive. It is unlikely you will have the same quality of results, but it would be interesting of actually also demonstrating this version (in a different colab of course).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the feedback!

  • Improved comment about timestamps having no semantic meaning
  • Renamed features to f1, f2, etc
  • Added comment about cutoff strategy only working in equal-length time series in each index
  • Added roc_auc score since it's easily available inside sklearn.
    • I'd argue that balanced accuracy is not a useless metric in this scenario, I'd keep it. Note that I'm using the sample weights only for this purpose, not to train with (MLPClassifier doesn't support sample weights like some tree-based models do). Wdyt?
  • Reduced number of plots everywhere
  • Printed all metrics every time we get a new one

And finally: I know, training supervised for anomaly detection isn't the most common, but I felt it would still be enough to be a good example of treating this kind of data + getting us traffic of people searching for those terms. I have IsolationForest model a try with another dataset some days ago and results were terrible, but I can definitely give it a try again with this one next week and create a new tutorial / a second part to this one!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

LGTM and a few more comments:

Unify string " or '

classes = y_train.unique()
We know the classes are 0 and 1. Maybe we don't need to complexify the code for the evaluation for multiclass.

Note for another PR. I think the memory usage is not printed anymore when print(eventset).

"eval" is a reserved python function :).

Apart from the "raw" call, "eval" does not show correctly the name of the experience.
Also, maybe the experiment names (e.g. "raw", "stats") could be more descriptive.

+1 for a classical abnormality detection in another tutorial

@ianspektor ianspektor merged commit 802d669 into main Aug 21, 2023
5 of 13 checks passed
@ianspektor ianspektor deleted the anomaly-detection-tutorial branch August 21, 2023 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants