# What is an outlier?
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# What are the impacts of having outliers in a dataset?

1. It causes various problems during our statistical analysis

2. It may cause a significant impact on the mean and the standard deviation

# Various ways of finding the outlier.

1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR interquantile range

In [23]:
dataset= [11,10,12,14,12,15,14,13,15,100,120,140,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]
len(dataset)

34

In [24]:
np.mean(dataset)

28.0

In [25]:
np.median(dataset)

13.5

In [26]:
np.std(dataset)

36.57948637596504

# Detecting outlier using Z score

Z score is also called standard score. This score helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean.

Z score = (x -mean) / std. deviation

In a normal distribution it is estimated that

68% of the data points lie between +/- 1 standard deviation.

95% of the data points lie between +/- 2 standard deviation.

99.7% of the data points lie between +/- 3 standard deviation.

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

In [27]:
outliers=[]
def detect_outliers(data):
    
    threshold=3 #data is falling in 3rd std
    mean = np.mean(data)
    std =np.std(data)
    
    
    for i in data:
        z_score= (i - mean)/std 
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return outliers

In [28]:
outlier_pt=detect_outliers(dataset)
outlier_pt

[140]

# InterQuantile Range

75%- 25% values in a dataset

Steps

1. Arrange the data in increasing order

2. Calculate first(q1) and third quartile(q3)

3. Find interquartile range (q3-q1)

4. Find lower bound q1*1.5

5. Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

In [29]:
## Perform all the steps of IQR
#1. Arrange data in increasing order
sorted(dataset)

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 100,
 107,
 108,
 120,
 140]

In [30]:
#2.Calculate q1 and q3
quantile1, quantile3= np.percentile(dataset,[25,75])
print(quantile1, quantile3)

12.0 15.0


In [31]:
## 3. Find the IQR

iqr_value=quantile3-quantile1
print(iqr_value)

3.0


In [32]:
## 4. ## Find the lower bound value and the higher bound value

lower_bound_val = quantile1 -(1.5 * iqr_value) 
upper_bound_val = quantile3 +(1.5 * iqr_value)
print(lower_bound_val,upper_bound_val)

7.5 19.5


In [33]:
dataset = np.where((dataset < lower_bound_val) | (dataset > upper_bound_val))
dataset

(array([ 9, 10, 11, 14, 20], dtype=int64),)