In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn")


In [2]:
np.exp([-1, 0, 1, 2])

array([0.36787944, 1.        , 2.71828183, 7.3890561 ])

In [3]:
np.log(np.exp([-1, 0, 1, 2]))  # the natural logarithm is the inverse of exp

array([-1.,  0.,  1.,  2.])

## Standardisation

In [4]:
heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
heights[-5:] # preview

array([157. , 167.4, 159.6, 168.5, 147.8])

In [5]:
np.mean(heights), np.std(heights)

(160.13679222932953, 7.062021850008261)

A standardised version of a vector $(x_1,\dots,x_n)$ consists of subtracting from each element the sample arithmetic mean (which we call centring) and then dividing it by the standard deviation.

In [6]:
# A standardised version of a vector $(x_1,\dots,x_n)$... 

heights_std = (heights-np.mean(heights))/np.std(heights)
heights_std[-5:] # preview

# This gives us our Z Scores

array([-0.44417764,  1.02848843, -0.07601113,  1.18425119, -1.74692071])

* z-score of 0 corresponds to an observation equal to the sample mean (perfectly average);
* z-score of 1 is obtained for a datum 1 standard deviation above the mean;
* z-score of -2 means that it is a value 2 standard deviations below the mean;

In [7]:
np.mean(heights_std), np.std(heights_std)

(1.8920872660373198e-15, 1.0)

## Min-Max Scaling and Clipping

A less frequent, but still noteworthy, transformation is called min-max scaling and involves subtracting the minimum and then dividing by the range.

In [8]:
x = np.array([-1.5, 0.5, 3.5, -1.33, 0.25, 0.8])
(x - np.min(x))/(np.max(x)-np.min(x))

array([0.   , 0.4  , 1.   , 0.034, 0.35 , 0.46 ])

Here, the smallest value is mapped to 0, and the largest one is equal to 1.

Note that 0.5 does not mean that the value is equal to the mean (unless we are very lucky!).

Also, clipping can be used to replace all values less than 0 with 0 and those greater than 1 with 1.

In [9]:
np.clip(x, 0, 1)

array([0.  , 0.5 , 1.  , 0.  , 0.25, 0.8 ])

The function is, of course, flexible; another popular choice is clipping to [-1, 1].

This can be implemented manually by means of the vectorised pairwise minimum and maximum functions.

In [10]:
np.minimum(1, np.maximum(0, x))

array([0.  , 0.5 , 1.  , 0.  , 0.25, 0.8 ])

## Normalisation
Normalisation is the scaling of a given vector so that it is of unit length. Usually, by length we mean the square root of the sum of squares, i.e., the Euclidean norm whose special case for n=2 we know well from high school: the length of a vector $(a,b)$ is $\sqrt{a^2+b^2}$, e.g., $\|(1, 2)\| = \sqrt{5} \simeq 2.236$.

In [11]:
x = np.array([1, 5, -4, 2, 2.5])
x/np.sqrt(np.sum(x**2))  # x divided by the Euclidean norm of x

array([ 0.13834289,  0.69171446, -0.55337157,  0.27668579,  0.34585723])

Note that normalisation is pretty similar to standardisation if data are already centred (when the mean was subtracted). Actually, we can obtain one from the other by scaling by $\sqrt{n-1}$.

At other times, by length we can also mean the Manhattan norm,
being the sum of absolute values.

In [12]:
x / np.sum(np.abs(x))

array([ 0.06896552,  0.34482759, -0.27586207,  0.13793103,  0.17241379])

This is frequently applied upon vectors of nonnegative values, whose normalised versions can be interpreted as probabilities: values between 0 and 1 which additionally add up to 1 (or, equivalently, 100%). In particular, on binned data:

In [13]:
c, b = np.histogram(heights, [-np.inf, 150, 160, 170, np.inf])
print(c)  # counts

[ 306 1776 1773  366]


And now, converting the counts to empirical probabilities:

In [14]:
p = c/np.sum(c)
print(p)

# We did not apply numpy.abs, because the values were already nonnegative.

[0.07249467 0.42075338 0.42004264 0.08670931]
