## Numpy Statistical Functions - Outlier IQR Method

#### Interquartile Range (IQR)

###### Formula: Q3 - Q1 <br>
Q1: First Quartile  = Median of Lower Half <br>
Q3: Third Quartile = Median of Upper Half <br>

In [1]:
import numpy as np

1D Array (Odd numbers)

In [2]:
y = np.array([15, 18, 7, 11, 4, 16, 1, 3, 24, 6, 8])
y

array([15, 18,  7, 11,  4, 16,  1,  3, 24,  6,  8])

In [3]:
# sort the data
y = np.sort(y)
y

array([ 1,  3,  4,  6,  7,  8, 11, 15, 16, 18, 24])

In [4]:
#determine position to cut into halves
n = len(y)
size_of_half=int(n/2)
size_of_half

5

In [5]:
#compute lower half and Q1
lower_half = y[:size_of_half]
print('lower_half:',lower_half)
q1 = np.median(lower_half)
print('Q1:',q1)

lower_half: [1 3 4 6 7]
Q1: 4.0


In [6]:
#compute upper half and Q3

#note: for odd number case, 
#middle number have to be excluded while splitting halves
upper_half = y[size_of_half+1:]
print('upper_half:',upper_half)
q3 = np.median(upper_half)
print('Q3:',q3)

upper_half: [11 15 16 18 24]
Q3: 16.0


In [7]:
#compute IQR
iqr = q3 - q1
print('IQR:',iqr)

IQR: 12.0


1D Array (Even numbers)

In [8]:
y = np.array([12, 4, 8, 19, 6, 28, 22, 9, 21, 18])
y

array([12,  4,  8, 19,  6, 28, 22,  9, 21, 18])

In [9]:
# sort the data
y = np.sort(y)
y

array([ 4,  6,  8,  9, 12, 18, 19, 21, 22, 28])

In [10]:
#determine position to cut into halves
n = len(y)
size_of_half=int(n/2)
size_of_half

5

In [11]:
#compute lower half and Q1
lower_half = y[:size_of_half]
q1 = np.median(lower_half)
print('Q1:',q1)

Q1: 8.0


In [12]:
#compute upper half and Q3
#note: for even numbers, middle numbers are included in lower and upper halves
upper_half = y[size_of_half:]
q3 = np.median(upper_half)
print('Q3:',q3)

Q3: 21.0


In [13]:
#compute IQR
iqr = q3 - q1
print('IQR:',iqr)

IQR: 13.0


In [15]:
def compute_quartiles_1D(arr):
    #STEP 1: sort the data first
    arr = np.sort(arr)
    #size of array
    n = len(arr)
    
    #STEP 2: divide dataset into 2 halves
    #compute size of each half
    #for even number, it returns exact half of the array size
    #for odd number, it returns the size excluding median
    size_of_half=int(n/2)
    #slice lower half of the array
    lower_half=arr[:size_of_half]
    if (n % 2) == 0:
        upper_half=arr[size_of_half:]
    else:
        #exclude median in case of odd number of values
        upper_half=arr[size_of_half+1:]
    
    #STEP 3: Compute Q1 and Q3
    #Q1 is median of lower half and Q3 is median of upper half
    q1 = np.median(lower_half)
    q3 = np.median(upper_half)
    
    print('q1:',q1)
    print('q3:',q3)
        
    return q1, q3  

def compute_iqr_1D(arr):
    q1, q3 = compute_quartiles_1D(arr)
    iqr = q3 - q1
    return iqr

In [16]:
y = np.array([15, 18, 7, 11, 4, 16, 1, 3, 24, 6, 8])
iqr = compute_iqr_1D(y)
print('IQR for odd numbers:',iqr)

q1: 4.0
q3: 16.0
IQR for odd numbers: 12.0


In [17]:
y = np.array([12, 4, 8, 19, 6, 28, 22, 9, 21, 18])
iqr = compute_iqr_1D(y)
print('IQR for even numbers:',iqr)

q1: 8.0
q3: 21.0
IQR for even numbers: 13.0


2D Array

In [18]:
np.random.seed(5)
x= np.random.randint(low=15,high=100,size=(7,4))
x

array([[93, 76, 31, 88],
       [23, 77, 42, 45],
       [95, 22, 91, 30],
       [68, 95, 42, 59],
       [92, 90, 80, 62],
       [45, 99, 33, 24],
       [56, 77, 16, 97]])

In [19]:
def compute_quartiles(arr,axis=None):
    #STEP 1: sort the data first
    arr = np.sort(arr,axis=axis)
    #size of array
    n = np.size(arr,axis)
    
    #STEP 2: divide dataset into 2 halves
    #compute size of each half
    #for even number, it returns exact half of the array size
    #for odd number, it returns the size excluding median
    size_of_half=int(n/2)
    print('Size of half:',size_of_half)
    #slice lower half of the array
    lower_half = arr.take(indices=range(0, size_of_half), axis=axis)
    print('Lower Half\n',lower_half) 
    
    if (n % 2) == 0:
        upper_half = arr.take(indices=range(size_of_half, n), axis=axis)
    else:
        #exclude median in case of odd number of values
        upper_half = arr.take(indices=range(size_of_half+1, n), axis=axis)
    
    print('Upper Half\n',upper_half)    
    
    #STEP 3: Compute Q1 and Q3
    #Q1 is median of lower half and Q3 is median of upper half
    q1 = np.median(lower_half,axis=axis)
    q3 = np.median(upper_half,axis=axis)
    
    print('q1:',q1)
    print('q3:',q3)
        
    return q1,q3  

def compute_iqr(arr,axis=None):
    q1, q3 = compute_quartiles(arr,axis)
    iqr = q3 - q1
    return iqr

In [20]:
iqr = compute_iqr(x,axis=0)
print('\nIQR for axis=0:',iqr)

Size of half: 3
Lower Half
 [[23 22 16 24]
 [45 76 31 30]
 [56 77 33 45]]
Upper Half
 [[92 90 42 62]
 [93 95 80 88]
 [95 99 91 97]]
q1: [45. 76. 31. 30.]
q3: [93. 95. 80. 88.]

IQR for axis=0: [48. 19. 49. 58.]


In [21]:
iqr = compute_iqr(x,axis=1)
print('\nIQR for axis=1:',iqr)

Size of half: 2
Lower Half
 [[31 76]
 [23 42]
 [22 30]
 [42 59]
 [62 80]
 [24 33]
 [16 56]]
Upper Half
 [[88 93]
 [45 77]
 [91 95]
 [68 95]
 [90 92]
 [45 99]
 [77 97]]
q1: [53.5 32.5 26.  50.5 71.  28.5 36. ]
q3: [90.5 61.  93.  81.5 91.  72.  87. ]

IQR for axis=1: [37.  28.5 67.  31.  20.  43.5 51. ]


In [22]:
#iqr for axis=None
iqr = compute_iqr(x)
print('\nIQR for axis=None:',iqr)

Size of half: 14
Lower Half
 [16 22 23 24 30 31 33 42 42 45 45 56 59 62]
Upper Half
 [68 76 77 77 80 88 90 91 92 93 95 95 97 99]
q1: 37.5
q3: 90.5

IQR for axis=None: 53.0


In [23]:
#just to check that new function works even for 1D array
print('data:',y)
print()
print('\nIQR for 1D array:',compute_iqr(y))

data: [12  4  8 19  6 28 22  9 21 18]

Size of half: 5
Lower Half
 [ 4  6  8  9 12]
Upper Half
 [18 19 21 22 28]
q1: 8.0
q3: 21.0

IQR for 1D array: 13.0


In [1]:
import numpy as np

In [2]:
def compute_quartiles(arr,axis=None):
    #STEP 1: sort the data first
    arr = np.sort(arr,axis=axis)
    #size of array
    n = np.size(arr,axis)
    
    #STEP 2: divide dataset into 2 halves
    #compute size of each half
    #for even number, it returns exact half of the array size
    #for odd number, it returns the size excluding median
    size_of_half=int(n/2)
    print('Size of half:',size_of_half)
    #slice lower half of the array
    lower_half = arr.take(indices=range(0, size_of_half), axis=axis)
    print('Lower Half\n',lower_half) 
    
    if (n % 2) == 0:
        upper_half = arr.take(indices=range(size_of_half, n), axis=axis)
    else:
        #exclude median in case of odd number of values
        upper_half = arr.take(indices=range(size_of_half+1, n), axis=axis)
    
    print('Upper Half\n',upper_half)    
    
    #STEP 3: Compute Q1 and Q3
    #Q1 is median of lower half and Q3 is median of upper half
    q1 = np.median(lower_half,axis=axis)
    q3 = np.median(upper_half,axis=axis)
    
    print('q1:',q1)
    print('q3:',q3)
        
    return q1,q3  

#this function is modified to return q1 and q3 also along with iqr
def compute_iqr(arr,axis=None):
    q1, q3 = compute_quartiles(arr,axis)
    iqr = q3 - q1
    return q1,q3,iqr

<u>IQR Method</u> 

Lower Bound: Q1 – N * IQR <br>
Upper Bound: Q3 +  N * IQR <br>
Outlier: x < Lower Bound (or)  x > Upper Bound <br>
N  = 1.5 for mild outliers <br>
N  = 3.0 for extreme outliers <br>

In [3]:
def outlier_by_iqr(arr,N):
    q1,q3,iqr = compute_iqr(arr)
    print('******** Outlier Check ************')
    
    #computer lower bound and upper bound of IQR Method
    lower_bound = q1 - N * iqr
    upper_bound = q3 + N * iqr
    
    print('IQR:',iqr)
    print('Lower Bound:',lower_bound)
    print('Upper Bound:',upper_bound)
    
    #select outliers using numpy boolean indexing with "or" condition
    outliers = arr[(arr < lower_bound) | (arr > upper_bound)  ]
    return outliers

In [4]:
data = np.array([25,47,49,54,57,59,61,63,64,67,71,72,73,79, 225])

In [5]:
#outlier detection using inner fence, (i.e) N=1.5 by default
outliers = outlier_by_iqr(data,N=1.5)
print('\nOUTLIERS:',outliers)

Size of half: 7
Lower Half
 [25 47 49 54 57 59 61]
Upper Half
 [ 64  67  71  72  73  79 225]
q1: 54.0
q3: 72.0
******** Outlier Check ************
IQR: 18.0
Lower Bound: 27.0
Upper Bound: 99.0

OUTLIERS: [ 25 225]


In [6]:
#outlier detection using outer fence, (i.e) N=3.0
outliers = outlier_by_iqr(data,N=3.0)
print('\nOUTLIERS:',outliers)

Size of half: 7
Lower Half
 [25 47 49 54 57 59 61]
Upper Half
 [ 64  67  71  72  73  79 225]
q1: 54.0
q3: 72.0
******** Outlier Check ************
IQR: 18.0
Lower Bound: 0.0
Upper Bound: 126.0

OUTLIERS: [225]


Example 2 (Impact on data size)

In [7]:
#create array with size of 100 and value=5
data2 = np.full(100,5)
data2

array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In [8]:
#modify array to have some other elements also
data2[96] = 15
data2[97] = 67
data2[98]= -32
data2[99] = 150

data2[35:45] =7
data2[57:77] =8
data2[80:85] =10

data2

array([  5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,   5,   5,   5,   5,   7,   7,   7,   7,
         7,   7,   7,   7,   7,   7,   5,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,   8,   8,   8,   8,   8,   8,   8,   8,
         8,   8,   8,   8,   8,   8,   8,   8,   8,   8,   8,   8,   5,
         5,   5,  10,  10,  10,  10,  10,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,  15,  67, -32, 150])

In [9]:
#outlier detection using inner fence, (i.e) N=1.5 by default
outliers = outlier_by_iqr(data2,N=1.5)
print('\nOUTLIERS:',outliers)

Size of half: 50
Lower Half
 [-32   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5
   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5
   5   5   5   5   5   5   5   5   5   5   5   5   5   5]
Upper Half
 [  5   5   5   5   5   5   5   5   5   5   5   5   7   7   7   7   7   7
   7   7   7   7   8   8   8   8   8   8   8   8   8   8   8   8   8   8
   8   8   8   8   8   8  10  10  10  10  10  15  67 150]
q1: 5.0
q3: 8.0
******** Outlier Check ************
IQR: 3.0
Lower Bound: 0.5
Upper Bound: 12.5

OUTLIERS: [ 15  67 -32 150]


In [10]:
#outlier detection using outer fence, (i.e) N=3.0
outliers = outlier_by_iqr(data2,N=3.0)
print('\nOUTLIERS:',outliers)

Size of half: 50
Lower Half
 [-32   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5
   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5
   5   5   5   5   5   5   5   5   5   5   5   5   5   5]
Upper Half
 [  5   5   5   5   5   5   5   5   5   5   5   5   7   7   7   7   7   7
   7   7   7   7   8   8   8   8   8   8   8   8   8   8   8   8   8   8
   8   8   8   8   8   8  10  10  10  10  10  15  67 150]
q1: 5.0
q3: 8.0
******** Outlier Check ************
IQR: 3.0
Lower Bound: -4.0
Upper Bound: 17.0

OUTLIERS: [ 67 -32 150]


Suppose let us keep only few data points (i.e only few repetitions) from the input

In [11]:
#let us reduce the size of data from 100 to just 11 by having only 2 repetitions
data3  = np.array([-32,5,5,7,7,8,8,10,15,67,150])

In [12]:
#outlier detection using inner fence, (i.e) N=1.5 by default
outliers = outlier_by_iqr(data3,N=1.5)
print('\nOUTLIERS:',outliers)

Size of half: 5
Lower Half
 [-32   5   5   7   7]
Upper Half
 [  8  10  15  67 150]
q1: 5.0
q3: 15.0
******** Outlier Check ************
IQR: 10.0
Lower Bound: -10.0
Upper Bound: 30.0

OUTLIERS: [-32  67 150]


In [13]:
#outlier detection using outer fence, (i.e) N=3.0
outliers = outlier_by_iqr(data3,N=3.0)
print('\nOUTLIERS:',outliers)

Size of half: 5
Lower Half
 [-32   5   5   7   7]
Upper Half
 [  8  10  15  67 150]
q1: 5.0
q3: 15.0
******** Outlier Check ************
IQR: 10.0
Lower Bound: -25.0
Upper Bound: 45.0

OUTLIERS: [-32  67 150]


IQR could detect extreme outliers even with just few repetitions of similar data. This implies that IQR method could work even with small data size.