# **Outlier Detection**

### **What is an Outlier?**

An outlier is an observation that is numerically distant from the rest of the data or, in a nutshell, is the value that is out of range. Let’s take an example to check what happens to a dataset with a dataset without outliers.

| **Metric**            | **Data without Outliers** | **Data with Outliers** |
|------------------------|---------------------------|-------------------------|
| **Data**              | 1, 2, 3, 3, 4, 5, 4      | 1, 2, 3, 3, 4, 5, 400  |
| **Mean**              | 3.142                     | 59.714                 |
| **Median**            | 3                         | 3                      |
| **Standard Deviation**| 1.345185                  | 150.057                |

As you can see, the dataset with outliers has a significantly different mean and standard deviation. In the first scenario, we will say that the average is 3.14. But with the outlier, the average climbs to 59.71. This would completely change the estimate.

### **Causes of Outliers**

Data Entry Errors: Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.
Measurement Errors: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty.
Natural Outliers: When an outlier is not artificial (due to error), it is a natural outlier. Most real-world data belong to this category.

### **Outlier Detection in Machine Learning using Hypothesis Testing**

Now, I will use the Python programming language for the task of outlier detection in machine learning.

An outlier can be of two types: 
- Univariate and 
- Multivariate

**Hypothesis testing** is a common technique for detecting outliers in machine learning. Hypothesis testing is a method of testing a claim or hypothesis about a parameter in a population, using data measured in a sample.

The purpose of the hypothesis test is to determine the probability that a population parameter, such as the mean, is likely to be true. There are four steps in the hypothesis test:

- State the assumptions.
- Define the criteria for a decision.
- Calculate the test statistic.
- Make a decision.

Now let’s see how to use the Python programming language to implement the hypothesis testing for the task of Outlier Detection in Machine Learning:

In [1]:
import numpy as np
import scipy.stats as stats

In [2]:
x = np.array([12,13,14,19,21,23])
y = np.array([12,13,14,19,21,23,45])

In [3]:
def grubbs_test(x):
    n = len(x)
    mean_x = np.mean(x)
    sd_x = np.std(x)
    numerator = max(abs(x-mean_x))
    g_calculated = numerator/sd_x
    
    print("Grubbs Calculated Value:",g_calculated)
    t_value = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
    g_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
    print("Grubbs Critical Value:",g_critical)
    
    if g_critical > g_calculated:
        print("From grubbs_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers\n")
    else:
        print("From grubbs_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers\n")

In [4]:
grubbs_test(x)
grubbs_test(y)

Grubbs Calculated Value: 1.4274928542926593
Grubbs Critical Value: 1.8871451177787137
From grubbs_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers

Grubbs Calculated Value: 2.2765147221587774
Grubbs Critical Value: 2.019968507680656
From grubbs_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers

