Often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier values.

You will discover more about about outliers and two statistical methods that you can use to identify and filter outliers from your dataset.

### That an outlier is an unlikely obversation in a dataset and may have one of many causes.
### That standard deviation can be used to identify outliers in Gaussuan or Gaussian-like data.
### That the interquartile range can be used to identify outliers in data regardless of the distribution.

Outliers can have many causes, such as:
- Measurement or input error.
- Data corruption.
- True outlier observation(e.g. Michael Jordan in basketball).

There is no precise way to define and identify outliers in general bacause of the specifics of each dataset. Instead, you, or a domain expert, must interpret the raw observations and decide whether a value is an outlier or not.

We can user statistical methods to identify observations that appear to be rare or unlikely giver the available data. This doesn't mean that the values identified are outliers and should be removed. But, the tools described in this tutorial can be helpful in shedding light on rare events that may require a second look.


## Standard Deviation Method

We can calculate the mean and standard deviation of a given example, then calculate the cut-off for identifying outliers as more than 3 standard deviations(99.7%) from the mean.

The complete example is listed below.

In [1]:
# identify outliers with standard deviation
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# calculate summary statistics
data_mean, data_std = mean(data), std(data)
# identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off
# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))
# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))

Identified outliers: 29
Non-outlier observations: 9971


Running the example will first print the number of identified outliers and then the number od observations that are not outliers, demonstrating how to identify and filter out outliers respectively.

## Interquartile Range Method

A good statistic for summarizing a non-Gaussian distribution sample of data is the Interquartile Range, or IQR for short.

The IQR is calculated as the difference between the 75th and the 25th percentiles of the data and defines the box in a box and whisker plot.

The IQR can be used to identify outliers by defining limits on the sample values taht are a factor k of the IQR below the 25th precentile or above the 75th percentile. The common value for the fictor k is the value 1.5. A  factor k of 3 or more can be used to identify values that are extreme outliers or "far outs" when described in the context of box and whisker plots.

The complete example is listed below.

In [5]:
# identify outliers with interquartile range
from numpy.random import seed
from numpy.random import randn
from numpy import percentile
# seed the random number generator
seed(3)
# generate univariate observations
data = 5 * randn(10000) + 50
# calculate interquartile range
q25, q75 = percentile(data, 25), percentile(data, 75)
iqr = q75 - q25
print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))
# calculate the outlier cutoff
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off
# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))
# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))

Percentiles: 25th=46.501, 75th=53.232, IQR=6.731
Identified outliers: 71
Non-outlier observations: 9929


Running the example first prints the identified 25th and 75th oercentiles and the calculated IQR. The number of outliers identified is printed followed by the number of non-outlier observations. 

The approach can bbbe used for multivariate data bty calculation the limits on each variable in the dataset in turn, and taking outliers as observations that fall outside of the rectangle or hyper-rectangle.