<a href="https://colab.research.google.com/github/keshavchandel13/Pandas/blob/main/OutlierPandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Outlier: data point that significantly deviates from the rest of the data in a Series or DataFrame column.

In [9]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

**Impacts of having outliers in the dataset**
1. It causes various problems during our statistical analysis
2. It may cause a significant impact on the mean and the standard deviation

** Various Ways of finding the outlier **
1. using scatter plots
2. Box plot
3. using the IQR interquantile range

In [25]:
# dataset
dataset = [12, 5, 1, 13, 16, 17, 18, 19, 20, 14,
        22, 1, 23, 2, 5, 6, 7, 0, 3, 3,
        33, 5, 6, 37, 38, 40, 10, 10, 10, 15,
        12, 1, 10, 550,400]
len(dataset)

35

In [26]:
print("Mean")
print(np.mean(dataset))

print("Median")
print(np.median(dataset))

print("Standard deviataion")
print(np.std(dataset))

Mean
39.542857142857144
Median
12.0
Standard deviataion
109.20710151088235


# detecting outlier using Z score
Z score is also called standard score. This score helps to understand if a data value is greater or smaller than mean and how
far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean.

Z score = (x -mean) / std, deviation

In a normal distribution it is estimated that

68% of the data points lie between +/- 1 standard deviation.

95% of the data points lie between +/- 2 standard deviation.

99.7% of the data points lie between +/- 3 standard deviation.

Formula for Z score = (Observation - Mean)/Standard Deviation

In [27]:
outlier = []
def detect_outlier(data):
  threshold = 3
  mean = np.mean(data)
  std = np.std(data)
  for i in data:
    z_score = (i-mean)/std
    if np.abs(z_score) > threshold:
      outlier.append(i)
  return outlier



In [28]:
outlier_pt =detect_outlier(dataset)
outlier_pt

[550, 400]

# Interquatile range
75%-25% values in a dataset
Steps
1. Arrange the data in increasing order
2. Calculate first(q1) and third quartile(q3)
3. Find interquartile range (q3-q1)
4. Find lower bound q1*1.5
5. Find upper bound q3*1.5
Anything that lies outside of lower and upper bound is an outlier

In [30]:
def detect_outliers_iqr(data):
    Q1 = np.percentile(data, 25)  # First quartile (25th percentile)
    Q3 = np.percentile(data, 75)  # Third quartile (75th percentile)
    IQR = Q3 - Q1  # Interquartile range

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
    print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

    outliers = [x for x in data if x < lower_bound or x > upper_bound]
    return outliers

In [31]:
outlier_pts = detect_outliers_iqr(dataset)
print("\nDetected Outliers:", outlier_pts)

Q1: 5.0, Q3: 19.5, IQR: 14.5
Lower Bound: -16.75, Upper Bound: 41.25

Detected Outliers: [550, 400]
