### Outlier
#### It is a datapoint that is distant from all the other observations
#### They cause various problems during our statistical analysis, cause significant impact on mean and standard deviation. Little effect on median and mode

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#### Criteria to identify an outlier
##### IQR (Box Plot): Data point values that are less than Q1-1.5(IQR) or greater than Q3+1.5(IQR)
##### Z-score (Scatter Plot): Data that falls outside third standard deviation

In [3]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

#### Detecting outlier using Z-score ( Z-score = Observation-Mean/Standard Deviation)
##### To convert a gaussian distribution (mean, S.D) into standard normal distribution ( mean = 0, S.D= 1 ) we apply Z-score. Within first standard deviation of gaussian distribution there will be 68% of S.N.D, within second 95% of S.N.D and within third there will be 99.7% of S.N.D

In [4]:
outliers=[]
def detect_outliers(data):
    
    threshold=3
    mean = np.mean(data)
    std =np.std(data)
    
    
    for i in data:
        z_score= (i - mean)/std 
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return outliers

In [5]:
outlier_pt = detect_outliers(dataset)
outlier_pt

[102, 107, 108]

#### Detecting outliers using IQR

In [6]:
## Detecting outlier using IQR
## 5 values: Lower Bound , Q1 , Median , Q3
## IQR = Q3-Q1
sorted(dataset) ## First we have to sort the dataset

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

In [7]:
## Now we have to find the values of Q1 and Q3
quantile1, quantile3= np.percentile(dataset,[25,75]) # We r specifying that we need a 25 percentile value and 75 percentile value
print(quantile1 , quantile3)

12.0 15.0


In [8]:
## Now we have to find out IQR
IQR = quantile3 - quantile1
IQR

3.0

In [9]:
## Finding lower bound value and upper bound value
lower_bound = quantile1 - ( 1.5 * IQR)
upper_bound = quantile3 + ( 1.5 * IQR)
print(lower_bound , upper_bound) ## All the values less than lower_bound and greater than upper_bound are considered as outliers

7.5 19.5


In [10]:
outliers=[]
def detect_outliers(data):

    for i in data:
        if i < lower_bound or i > upper_bound:
            outliers.append(i)
    return outliers

In [11]:
outlier_present = detect_outliers(dataset)
outlier_present

[102, 107, 108]