# 1. Statistical approaches
An outlier is a data point that differs significantly from other observations. Outliers can cause serious problems in statistical analysis. Detecting outliers is more likely be an art rather than a science, therefore you need both quantitative and qualitative methods to identify outliers.

However, there's no best rule for handling outliers. You need to ask yourself *Why are they outliers?* and *How can they affect your analysis?*. In this section, we discuss how to detect and handle outliers using Pandas.

In [None]:
import numpy as np
import pandas as pd

#### Using z-score
Given a vector, $x$, we calculate z-score (denoted $z$) with the following formula:

$$z = \frac{x-\mu}{\sigma}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

The approach of this method is to eliminate values of $z<-3$ and $z>3$. You can also remove $x<\mu-3\sigma$ and $x>\mu+3\sigma$ which gives the same result. Notice that the coefficient can be changed to 2.5 or 3.5 depends on the problem.

In [None]:
def outliers_zscore(array, z):
    'Return a new array has the outliers being replaced with NaN.'
    import numpy as np
    array = np.array(array, dtype=float)
    mean = array.mean()
    std = array.std()
    lower = mean - z*std
    upper = mean + z*std
    array[(array < lower) | (array > upper)] = np.nan
    return array

In [None]:
wine = pd.read_excel(r'data\wine_quality.xlsx')
wine.head()

In [None]:
# handling outliers for all columns
for i in wine.columns:
    wine[i] = outliers_zscore(wine[i], z=3)

In [None]:
pd.DataFrame({
    'removed_count': wine.isna().sum(),
    'removed_rate': (wine.isna().sum() / wine.shape[0]).apply(lambda x: f'{x:.2%}')
})

#### Using interquartile range
This method considers eliminating values that is lower than $Q_1-1.5\times \mbox{IQR}$ or higher than $Q_3+1.5\times \mbox{IQR}$, where: $Q_1$, $Q_2$ and $Q_3$ are the quartiles; $\mbox{IQR}=Q_3-Q_1$ is the interquartile range.

In [None]:
def outliers_iqr(array):
    import numpy as np
    array = np.array(array, dtype=float)
    Q1, Q3 = np.quantile(array, [0.25, 0.75])
    IQR = Q3 - Q1
    lower = Q1 - 1.5*IQR
    upper = Q3 + 1.5*IQR
    array[(array < lower) | (array > upper)] = np.nan
    return array

In [None]:
wine = pd.read_excel(r'data\wine_quality.xlsx')
wine.head()

In [None]:
# handling outliers for all columns
for i in wine.columns:
    wine[i] = outliers_iqr(wine[i])

In [None]:
pd.DataFrame({
    'removed_count': wine.isna().sum(),
    'removed_rate': (wine.isna().sum() / wine.shape[0]).apply(lambda x: f'{x:.2%}')
})

#### Logarithmic transforming
Another strategy to handling outliers is to perform a log transformation on the data, which dampens the effect of outliers.

In [None]:
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y = [7.46, 6.77, 10, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style='whitegrid')

fig, ax = plt.subplots(ncols=2, sharex=True, figsize=(15,4))
sns.regplot(x=x, y=y, ax=ax[0]).set_title('Effect of ouliers')
sns.regplot(x=x, y=np.log(y), ax=ax[1]).set_title('Effect of log transformed outliers')
plt.axis('equal')
plt.show()

# 2. Machine Learning methods

## 2.1. DBSCAN

## 2.2. Isolation Forest

## 2.3. Local Outlier Factor

## 2.4. Minimum Covariance Determinant

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*