Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detection of outliers (anomaly detection) using umap - robust dimension reduction #42

Open
den-run-ai opened this issue Feb 13, 2018 · 5 comments

Comments

@den-run-ai
Copy link

Can I use umap for anomaly detection? Is the dimensionality reduction tolerant towards the outliers in the dataset or this totally screws up the results?

More generally I'm looking for generalization of robust PCA, but for nonlinear cases:

https://en.wikipedia.org/wiki/Robust_principal_component_analysis

@lmcinnes
Copy link
Owner

UMAP will tend pull outliers in. It will find extreme outliers, but this is not the approach you probably want. I think the 'outlier' notions in this gist are more what you are after. Ultimately this is a sort of co-UMAP (reverse the arrows) for clustering, and dual co-UMAP for outlier detection. I haven't written code to do all of this efficiently yet, but it is on my todo list.

@den-run-ai
Copy link
Author

@lmcinnes so essentially first pre-filter with hdbscan and then apply umap?

@lmcinnes
Copy link
Owner

I think it really depends on what you are trying to do, but yes, something like that would represent something that bears similarities to Robust PCA. The again I think you really want some sort of regularized UMAP to do that properly. I would have to think about what that would mean/look like -- certainly an intriguing problem. Thanks for the ideas!

@den-run-ai
Copy link
Author

Sometimes the outliers are so bad that it is hard to regularize them, just excluding is easier. For example that's why I like RANSAC regression more than regularizers for linear regression.

@lmcinnes
Copy link
Owner

That makes a lot of sense -- it does certainly depend on the data and your use case. At that rate filtering things out with hdbscan would probably work well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants