# Identify Outliers
Outliers are data point that lies outside the dataset.

# Criteria to identify the outliers
    1.Data point falls outside of 1.5 times of an interquartile range above 3rd quartile and below 1st quartile
    2.Data point falls outside of 3 standard deviations. In case of z-score, if z-score falls outside of 2 standard deviation.

# Reasons for existance of outliers
    1. Variability in data
    2. error in measurement

# Ways to find outliers
    1. Using scatter plot
    2. Box plot
    3. Using z-score
    4. Using the IQR (interquarentile range)

# 1.Using scatter plot
![](dg_scatter_plot.png)

# 2.Using box plot
![](box_plot.png)

# 3.Using z-score
    Ref: https://www.math.ubc.ca/~pwalls/math-python/jupyter/latex/
    Formula: z-score = (Observation - mean)/standard deviation
$$z = \frac{(X_i - \mu)}{\sigma}$$


In [2]:
import numpy as np

In [44]:
np.random.seed(13)
dataset = np.random.randint(900,1000,100)
dataset = np.append(np.array([1,2,3,6]),dataset)

# Finding the Outliers



In [46]:
outliers = []
def find_outliers(data):
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    
    for n in data:
        z_score = (n-mean)/std
        #print(n,z_score,sep=':')
        if np.abs(z_score) > threshold:
            outliers.append(n)
    return outliers
    

In [47]:
#dataset = np.array([2,25,26,27,28,29,34,35])
find_outliers(dataset)

[1, 2, 3, 6]

# 4. Interquartile range
75% - 25% values in dataset
Steps involved:
    1. Sort the dataset in increasing order.
    2. Calculate 1st and 3rd quartile.
    3. Find interquartile range: iqr = q3-q1
    4. Find lower bound = q1 - (iqr*1.5)
    5. Find upper bound = q2 + (iqr*1.5)
    
    Anything that lies outside the lower and upper bound range is OUTLIER.

In [48]:
#Sorting the dataset
sorted(dataset)

[1,
 2,
 3,
 6,
 900,
 901,
 902,
 902,
 904,
 905,
 905,
 905,
 906,
 907,
 910,
 911,
 911,
 913,
 916,
 916,
 918,
 919,
 919,
 919,
 920,
 921,
 922,
 922,
 922,
 924,
 924,
 924,
 925,
 925,
 926,
 926,
 926,
 928,
 928,
 932,
 934,
 935,
 936,
 936,
 936,
 937,
 938,
 940,
 941,
 945,
 945,
 946,
 947,
 947,
 948,
 948,
 949,
 950,
 951,
 952,
 952,
 953,
 954,
 957,
 958,
 958,
 959,
 959,
 959,
 960,
 961,
 962,
 962,
 965,
 966,
 967,
 971,
 971,
 972,
 973,
 974,
 974,
 975,
 975,
 976,
 977,
 979,
 981,
 982,
 983,
 985,
 985,
 986,
 986,
 988,
 989,
 991,
 994,
 997,
 998,
 999,
 999,
 999,
 999]

In [57]:
#Calculate 1st and 3rd quartile
q1,q3 = np.percentile(dataset,[25,75])
q1,q3

(921.75, 971.25)

In [59]:
#Find interquartile range
iqr = q3-q1
iqr

49.5

In [64]:
low_bound_value = q1 - (iqr*1.5)
upp_bound_value = q3 + (iqr*1.5)
low_bound_value,upp_bound_value
# Anything outside the low_bound_value and upp_bound_value range is Outliers.

(847.5, 1045.5)