## Numpy Statistical Functions - Z Score

### <center>Z Score</center>
## $$ Formula: \frac{(x - \mu)}{\sigma}$$</center>
<center><i> $x$ : value of the observation</i></center>
<center><i> $\mu$ : Mean</i></center>
<center><i> $\sigma$ : Standard Deviation</i></center>

In [1]:
import numpy as np
data = np.array([25,47,49,54,57,59,61,63,64,67,71,72,73,79, 230])
z_scores = (data - np.mean(data) ) / np.std(data)
z_scores

array([-1.04839729, -0.55131237, -0.50612283, -0.39314898, -0.32536468,
       -0.28017514, -0.2349856 , -0.18979606, -0.16720129, -0.09941698,
       -0.00903791,  0.01355686,  0.03615163,  0.17172025,  3.58353039])

In [2]:
from scipy import stats
z_scores = stats.zscore(data)
z_scores

array([-1.04839729, -0.55131237, -0.50612283, -0.39314898, -0.32536468,
       -0.28017514, -0.2349856 , -0.18979606, -0.16720129, -0.09941698,
       -0.00903791,  0.01355686,  0.03615163,  0.17172025,  3.58353039])

In [4]:
threshold = 3.0
z_scores_outliers = z_scores[(z_scores < -threshold) | (z_scores > threshold)]
print('z scores of outliers:',z_scores_outliers)
outliers = data[(z_scores < -threshold) | (z_scores > threshold)]
print('OUTLIERS:',outliers)

z scores of outliers: [3.58353039]
OUTLIERS: [230]


Example 2 (Impact on data size)

In [5]:
#create array with size of 100 and value=5
data2 = np.full(100,5)
data2

array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In [6]:
#modify array to have some other elements also
data2[96] = 15
data2[97] = 67
data2[98]= -32
data2[99] = 150

data2[35:45] =7
data2[57:77] =8
data2[80:85] =10

data2

array([  5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,   5,   5,   5,   5,   7,   7,   7,   7,
         7,   7,   7,   7,   7,   7,   5,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,   8,   8,   8,   8,   8,   8,   8,   8,
         8,   8,   8,   8,   8,   8,   8,   8,   8,   8,   8,   8,   5,
         5,   5,  10,  10,  10,  10,  10,   5,   5,   5,   5,   5,   5,
         5,   5,   5,   5,   5,  15,  67, -32, 150])

In [7]:
z_scores = stats.zscore(data2)
z_scores

array([-0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.05284628, -0.05284628, -0.05284628, -0.05284628, -0.05284628,
       -0.05284628, -0.05284628, -0.05284628, -0.05284628, -0.05284628,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047, -0.17719047, -0.17719047, -0.17719047,
       -0.17719047, -0.17719047,  0.00932581,  0.00932581,  0.00932581,
        0.00932581,  0.00932581,  0.00932581,  0.00932581,  0.00932581,
        0.00932581,  0.00932581,  0.00932581,  0.00932581,  0.00

In [8]:
threshold =2.0

In [9]:
z_scores_outliers = z_scores[(z_scores < -threshold) | (z_scores > threshold)]
print('z scores of outliers:',z_scores_outliers)
outliers = data2[(z_scores < -threshold) | (z_scores > threshold)]
print('OUTLIERS:',outliers)

z scores of outliers: [ 3.67747932 -2.47755792  8.83776307]
OUTLIERS: [ 67 -32 150]


Z-Score method have detected 3 outliers [67, -32, 150]

Suppose let us keep only few data points (i.e only few repetitions) from the input

In [10]:
#let us reduce the size of data from 100 to just 11 by having only 2 repetitions
data3  = np.array([-32,5,5,7,7,8,8,10,15,67,150])

In [11]:
z_scores = stats.zscore(data3)
z_scores

array([-1.1986088 , -0.38825368, -0.38825368, -0.3444507 , -0.3444507 ,
       -0.32254921, -0.32254921, -0.27874623, -0.16923878,  0.96963868,
        2.78746232])

In [12]:
threshold =2.0

In [13]:
z_scores_outliers = z_scores[(z_scores < -threshold) | (z_scores > threshold)]
print('z scores of outliers:',z_scores_outliers)
outliers = data3[(z_scores < -threshold) | (z_scores > threshold)]
print('OUTLIERS:',outliers)

z scores of outliers: [2.78746232]
OUTLIERS: [150]


if we have only few data points from the list it detects only 1 outlier [150]. This implies that we need to have good number of data size for Z-score to work.