# Outliers
an outlier is a data point in a dataset that is distant from all other observations. A data point lies outside the overall distribution of the dataset. 

As for example, Suppose we have a 2D graph of height and Weight. This two attribute is actually positively correlated to each other. We can say this from the domain knowledge. Now, If it so happens that for a smaller value of weight the height is extremely high. This is then an outlier. These data points are very very rare. Wheather this outliers are useful for us or not, we also determine this from the type of data and case that we are dealing with. Like, in credit card fraud detection, these outliers can be useful. when it comes to height and weight, it is not useful at all rather it is going to put a question on the credibility of the data.

### Import Libraries

In [48]:
import numpy as np
import matplotlib.pyplot as plt

# What are the criteria to identify outlier?


*   **Z - Score**: Data point that lies outside of the third standard deviation are considered outliers.
*   **IQR(InterQuartile Range)**: Data points that fall out of the range of 25 percentile to 75 percentile are considered outliers.


# What are the reason for an outlier to exist in a dataset?


*   Variablity in the data.
*   An experiemental measurement error.


# What are the impact of having outliers in a dataset?


*   It interprets the data in a wrong way
*   It puts a belied impact on data by changing the cental tendency of the data.


# Ways to find the outlier


*   Scatter Plot
*   Box Plot
*   Using Z-Score
*   Using the IQR

In [49]:
dataset = [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

# Detecting Outliers using the Z-Score

formula for Z- Score is,
Z = x-miu/sigma

x = data point
miu = mean
sigman = standard deviation

In [50]:
outliers = []
def detect_outliers(data):
    threshold = 3 # this is the standard deviation limit for both sides
    mean = np.mean(data)
    print("Mean: "+ str(mean))
    std = np.std(data)
    print("Std: " + str(std))

    for i in data:
        z_score = (i-mean)/std
        if np.abs(z_score)>threshold:
            outliers.append(i)
    return outliers

In [51]:
outliers_pt = detect_outliers(dataset)

Mean: 21.176470588235293
Std: 26.37230118696876


In [52]:
outliers_pt

[102, 107, 108]

# InterQuartile Range (IQR)

data out of 75 and 25 percentile

## Steps

**1. sort in increasing order.**

**2. We calculate the first quartile(Q1) and third quartile(Q3).**

**3. find the InterQuartile Range which is Q3-Q1.**

**4. Find the lower bound, q1 - q1*1.5**

**5. Find the upper bound q3 - q3*1.5**


In [54]:
#step 1
#sort data
dataset.sort()
print(dataset)

[10, 10, 10, 10, 10, 11, 11, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 17, 19, 102, 107, 108]


In [55]:
quartile1, quartile3 = np.percentile(dataset,[25,75])

In [56]:
print(quartile1, quartile3)

12.0 15.0


In [57]:
iqr_value = quartile3 - quartile1
print(iqr_value)

3.0


In [60]:
lower_bound_value = quartile1 - (1.5 * iqr_value)
upper_bound_value = quartile3 + (1.5 * iqr_value)

In [61]:
print(lower_bound_value, upper_bound_value)

7.5 19.5


In [64]:
outliers2 = []
def detectOutliersIQR(data):
    for i in data:
        if i <lower_bound_value or i>upper_bound_value:
            outliers2.append(i)
    return outliers2

In [65]:
outliersIQR = detectOutliersIQR(dataset)

In [66]:
print(outliersIQR)

[102, 107, 108]
