# Handling outliers

Viri:
- [Detecting and Handling Outliers Properly](https://medium.com/analytics-vidhya/detect-and-handling-outliers-53723d8ec17a)
- [Outliers in Data: How to Find and Deal with Them in Satistics](https://cxl.com/blog/outliers/)
- [Detecting and Treating Outliers | Treating the odd one out!](https://www.analyticsvidhya.com/blog/2021/05/detecting-and-treating-outliers-treating-the-odd-one-out/)

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

**An outlier is a data point that is significantly different from the remaining data.** 
- **Statistical parameters** such as the mean and variance are **sensitive** to outliers. 
- Outliers may also affect the performance of some machine learning models, such as **linear regression or AdaBoost**



## Trimming outliers from the dataset

**Trimming, or truncating, is the process of removing observations that show outliers in one
or more variables in the dataset.** 

There are three commonly used methods to set the boundaries beyond which a value can be considered an outlier.
- If the variable is **normally distributed**, the boundaries are given by the mean plus or minus three times the standard deviation, as approximately 99% of the data will be distributed between those limits. 
- For **normally, as well as not normally, distributed** variables, we can determine the limits using the inter-quartile range proximity rules or by directly setting the limits to the 5th and 95th quantiles.

In [None]:
boston = pd.read_csv("data/boston.csv")
boston.head()

In [None]:
# The outliers are the asterisks sitting outside the whiskers, which delimit the interquartile range proximity rule boundaries:
sns.boxplot(boston['RM'])
plt.show()

In [None]:
def find_boundaries(df, variable, distance, rule="inter-quartile"):
    if rule == "inter-quartile":
        IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)
        lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
        upper_boundary = df[variable].quantile(0.75) + (IQR * distance)
        return upper_boundary, lower_boundary
    elif rule == "meanstddev":
        lower_boundary = df[variable].mean() - (df[variable].std() * distance)
        upper_boundary = df[variable].mean() + (df[variable].std() * distance)
        return upper_boundary, lower_boundary

In [None]:
RM_upper_limit, RM_lower_limit = find_boundaries(boston, 'RM', 1.5)

In [None]:
# Let's print those limits beyond which we will consider a value an outlier:
RM_upper_limit, RM_lower_limit

In [None]:
outliers_RM = np.where(boston['RM'] > RM_upper_limit, True, np.where(boston['RM'] < RM_lower_limit, True, False))
boston_trimmed = boston.loc[~(outliers_RM)]

In [None]:
sns.boxplot(boston_trimmed['RM'])
plt.show()

In [None]:
RM_upper_limit, RM_lower_limit = find_boundaries(boston, 'RM', 3, rule="meanstddev")
RM_upper_limit, RM_lower_limit