# Handling outliers

This document introduces the concept of outliers and a selected list of methods for detecting outliers.

## Outlier overview
- What is an outlier? Broadly speaking, the set of data points that **do not** behave like the rest of data
- Examples of outlier detection usage:
  - **Outlier:**  Given a dataset (say data matrix $X$), find all observations ($x_i$'s) that do not belong.
  - **Novelty:**  Given a new observation $x'_i$, determine if it belongs to the known dataset.
- Applications:
  - Fraud detection
  - Symptomless disease detection
  - Epidemic initialization
  - Sports analytics
  - Quality control
  - Network intrusion

### Outlier category
- Local anomalies: a data point not belonging to their local neighborhood
- Global anomalies: out-of-range / different from "normal" data
- Group anomalies: a cluster of points not belonging to the "normal" data
- *Dependency anomalies*

```{figure} ../img/anomalies.png
---
width: 90%
name: anomalies
---
Reproduced from Figure 2 of {cite:t}`lee2021gen`.
```

### Method category
Three general categories for outlier detection methods {cite:p}`han2022adbench`:
- Supervised: both training data for normal and abnormal cases are available.
  - i.e. a binary classification problem
  - e.g., readmission, fraud, simulation failure
- Semi-supervised: training data for only normal (or abnormal) cases are available.
  - i.e., learn what is "normal", and decide if unseen cases belong.
- Unsupervised: there is no knowledge of normal vs. abnormal cases.

```{figure} ../img/adbench-methods.png
---
width: 90%
name: adbench
---
Reproduced from Figure 2 of Han et al, (2022).
```

### Key takeaway from {cite:p}`han2022adbench` (https://github.com/Minqi824/ADBench)
1. surprisingly none of the benchmarked unsupervised algorithms is statistically better than others, emphasizing the importance of algorithm selection;
2. with merely 1% labeled anomalies, most semi-supervised methods can outperform the best unsupervised method, justifying the importance of supervision;
3. in controlled environments, we observe that best unsupervised methods for specific types of anomalies are even better than semi- and fully-supervised methods, revealing the necessity of understanding data characteristics;
4. semi-supervised methods show potential in achieving robustness in noisy and corrupted data, possibly due to their efficiency in using labels and feature selection;ng: and many more can be found in our papers (Sectionaches