# Outlier removal using the IQR method 

IQR method
Another robust method for labeling outliers is the IQR (interquartile range) method of outlier detection developed by John Tukey, the pioneer of exploratory data analysis. This was in the days of calculation and plotting by hand, so the datasets involved were typically small, and the emphasis was on understanding the story the data told. If you’ve seen a box-and-whisker plot (also a Tukey contribution), you’ve seen this method in action.1

A box-and-whisker plot uses quartiles (points that divide the data into four groups of equal size) to plot the shape of the data. The box represents the 1st and 3rd quartiles, which are equal to the 25th and 75th percentiles. The line inside the box represents the 2nd quartile, which is the median.

The interquartile range, which gives this method of outlier detection its name, is the range between the first and the third quartiles (the edges of the box). Tukey considered any data point that fell outside of either 1.5 times the IQR below the first – or 1.5 times the IQR above the third – quartile to be “outside” or “far out”. In a classic box-and-whisker plot, the ‘whiskers’ extend up to the last data point that is not “outside”.

Box-and-whisker plot

http://colingorrie.github.io/images/galton-boxplot.png![image.png](attachment:image.png)

In [59]:
import numpy as np
import pandas as pd

In [77]:
# Read the source file and for example just taken 2 integer columns

df=pd.read_csv('Life Expectancy Data.csv')

df=df.loc[:,['Measles ','Diphtheria ']]


In [78]:
# Columns names pyplotrenamed
df.columns = ['Measles','Diphtheria']

In [63]:
df.isnull().sum()

Measles        0
Diphtheria    19
dtype: int64

In [64]:
df.head()

Unnamed: 0,Measles,Diphtheria
0,1154,65.0
1,492,62.0
2,430,64.0
3,2787,67.0
4,3013,68.0


## Function to remove outliers

### Pass the dataframe & column name as input and this will return a dataframe with removed outlier records

In [65]:
#Code to remove outlier
#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

In [67]:
# Count before applying the function
df.count()

Measles       2938
Diphtheria    2919
dtype: int64

In [51]:
df_1=remove_outlier(df,'Diphtheria')

In [68]:
# Count after applying the function

df_1.count()

Measles       2621
Diphtheria    2621
dtype: int64