<a href="https://colab.research.google.com/github/nnbphuong/datascience4biz/blob/master/Identify_Outliers_in_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**IDENTIFY OUTLIERS IN DATA**

Phuong Nguyen @ [DataScience.vn](https://datascience.vn)


---

In [0]:
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
from numpy import percentile

In [0]:
# Seed the random number generator
seed(1)
# randn(): Return a sample from the “standard normal” distribution
# For random samples from N(mu, sigma^2), use: sigma * randn() + mu
data = 5 * randn(10000) + 50
data

In [0]:
# Calculate statistical measures
# Use a % as a placeholder. The .3f indicates to round to 3 places after the decimal point.
# String formatting operations: https://docs.python.org/2/library/stdtypes.html#string-formatting
print('mean=%.3f stdv=%.3f' % (mean(data), std(data)))

# Standard Deviation Method

In [0]:
# Calculate the outlier cut-off
data_mean, data_std = mean(data), std(data)
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

In [0]:
# Identify outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))

In [0]:
# Remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))

# Interquartile Range (IQR) Method

In [0]:
# Calculate interquartile range
q25, q75 = percentile(data, 25), percentile(data, 75)
iqr = q75 - q25
print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))

In [0]:
# Calculate the outlier cut-off
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off
# Identify outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))

In [0]:
# Remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))

---
**Reference**

Brownlee, J. (2018). How to Use Statistics to Identify Outliers in Data. *Machine Learning Mastery*, April 25, 2018. URL: https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/