Anomaly detection tutorial #223

ianspektor · 2023-08-17T18:56:07Z

No description provided.

achoum · 2023-08-18T13:46:59Z

docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb

It is great to see Temporian in action :).

I cannot comment in the file itself, so let me add the comments here.

"but it means that the timestamp column has no actual semantic meaning, and should not be used for feature engineering at all."

We can use the timestamps, but we should not apply assume it is a unix epoch and apply calendar features. However, we can still use the timestamps for computing distance between observations (expressed in index).

**Feature names

What about calling the features "f1", "f2", etc. instead of "0", "1", etc.

It will be especially useful to interpret augmented features "avg_20_13" => "avg_20_f13".

"last_train_timestamp = int(len(X.get_arbitrary_index_data()) * 0.8)"

I would add a note saying that this only work if all the time-series have the same length.

As we noticed previously, the dataset is very unbalanced.

Example weighting is generally used when the training algorithm is not able to train on unbalanced data or to speed-up training.
But if the dataset is small enough and algorithm supports unbalanced data, it does not make much sense.

The issue is that accuracy is not a good metric for this problem. Instead, I would report a ROC+AUC (or event better an AMOC [activity monitoring operating characteristic]).

An AMOC is better suited here because, when there is an anomaly a technician is generally called. Therefore, two anomaly detected at almost the same time count generally as "one" (in term of cost).

Alongside the raw predictions and labels, it could be interesting to also plot the number of predictions / labels in the last N (e.g., N=100) timestamps.

*Plots

39 plots is a bit too much. If possible, I would only plot a subset of features (e.g., 5).

Conclusion

Each time a new accuracy (or other metric) is computed, it would be great to compare it to previous runs.
Maybe we could have a simple utility function like:

all_metrics = {}
def save_metric(experiment_name, metric):
all_metrics[experiment_name] = metric
nice_print(all_metrics)

Abnormality detection

The two common challenges of abnormality detection is that anomalies are generally rare, and that we generally don't know in advance what they look like (and two different anomalies don't look the same).

You commented about that in the last section, but the fact that you use the training labels for an abnormality detection is a bit counter intuitive. It is unlikely you will have the same quality of results, but it would be interesting of actually also demonstrating this version (in a different colab of course).

Thanks for all the feedback!

Improved comment about timestamps having no semantic meaning

Renamed features to f1, f2, etc

Added comment about cutoff strategy only working in equal-length time series in each index

Added roc_auc score since it's easily available inside sklearn.

I'd argue that balanced accuracy is not a useless metric in this scenario, I'd keep it. Note that I'm using the sample weights only for this purpose, not to train with (MLPClassifier doesn't support sample weights like some tree-based models do). Wdyt?

Reduced number of plots everywhere

Printed all metrics every time we get a new one

And finally: I know, training supervised for anomaly detection isn't the most common, but I felt it would still be enough to be a good example of treating this kind of data + getting us traffic of people searching for those terms. I have IsolationForest model a try with another dataset some days ago and results were terrible, but I can definitely give it a try again with this one next week and create a new tutorial / a second part to this one!

Thanks.

LGTM and a few more comments:

Unify string " or '

classes = y_train.unique()
We know the classes are 0 and 1. Maybe we don't need to complexify the code for the evaluation for multiclass.

Note for another PR. I think the memory usage is not printed anymore when print(eventset).

"eval" is a reserved python function :).

Apart from the "raw" call, "eval" does not show correctly the name of the experience.
Also, maybe the experiment names (e.g. "raw", "stats") could be more descriptive.

+1 for a classical abnormality detection in another tutorial

Ian Spektor added 5 commits August 14, 2023 16:15

first iteration on anomaly detection tutorial

05de988

notebook

a817494

finish notebook, write story

836b0df

reduce memory usage

e12a317

use subset of data

495dc2b

achoum reviewed Aug 18, 2023

View reviewed changes

address pr comments

c66da21

achoum approved these changes Aug 19, 2023

View reviewed changes

Ian Spektor added 6 commits August 21, 2023 13:18

unify strings

cb90c70

rename eval to evaluate

529481a

improve naming

8122379

Merge branch 'main' into anomaly-detection-tutorial

0df446b

add open in colab button

01a4ee9

point open in colab to stable tag

4dc3044

ianspektor merged commit 802d669 into main Aug 21, 2023
5 of 13 checks passed

ianspektor deleted the anomaly-detection-tutorial branch August 21, 2023 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomaly detection tutorial #223

Anomaly detection tutorial #223

ianspektor commented Aug 17, 2023

achoum Aug 18, 2023

ianspektor Aug 18, 2023

achoum Aug 19, 2023

Anomaly detection tutorial #223

Anomaly detection tutorial #223

Conversation

ianspektor commented Aug 17, 2023

achoum Aug 18, 2023

Choose a reason for hiding this comment

ianspektor Aug 18, 2023

Choose a reason for hiding this comment

achoum Aug 19, 2023

Choose a reason for hiding this comment