This jupyter notebook is part of the supplementary material for the book "Materials Data Science" (Stefan Sandfeld, Springer, 2024, DOI 10.1007/978-3-031-46565-9). For further details please refer to the accompanying webpage https://mds-book.org.

## 8.4 Sampling Strategies

### 8.4.1 Simple Random Sampling
The easiest approach for sampling is simple random sampling (SRS). It is a method that
is based on assigning the same probability to each item that could be chosen during
sampling. Then, the items contained in the sample are chosen without replacement
(also see Chapter 6).

Here is an example where we compute the mean value of a population and a
random sample. We start by importing numpy and creating a population that consists
of the numbers from 1 to 99:

In [12]:
import numpy as np
population = np.arange(1, 100)
population.sum() / population.size

50.0

Next, a random number generator rng is created. It comes with the convenient
method choice which chooses from the given population a number of items (here:
10) without replacement:

In [13]:
rng = np.random.default_rng()
sample = rng.choice(population, size=10, replace=False)
sample

array([ 7, 36, 89, 42,  3,  4, 83, 63, 34, 91])

Now, we can also compute the mean value of the sample which is close to the true value of 50:

In [14]:
sample.sum() / sample.size

45.2

If more than one sample is to be created then each sample is chosen independently
from the other samples, i.e., for each sample items are chosen from the full set of
items. Thus, the same item can occur in more than one sample, but no more than
once in a single sample. Thus, based on the above code we could create three samples
s1 , s2 , and s3 by

In [15]:
n = 10
s1 = rng.choice(population, size=n, replace=False)
s2 = rng.choice(population, size=n, replace=False)
s3 = rng.choice(population, size=n, replace=False)
print("mean values:", s1.sum() / n, s2.sum() / n, s3.sum() / n)

mean values: 35.4 56.7 66.3


Each of the three samples has its own mean value; each of these mean values is an estimate of the population mean.

### 8.4.2 Systematic Sampling

Systematic sampling or linear sampling is a very simple methodology where from an
ordered list of 𝑁 elements the first sample is chosen randomly. Subsequently, every
𝑘-th element is chosen. After the end of the list was reached, one can return to the
beginning of the list.

In [16]:
import numpy as np
rng = np.random.default_rng()

N = 25
population = np.array(2 * list(range(N)))  # copy the list so that it starts again after element 24
population

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24,  0,  1,  2,  3,  4,  5,  6,  7,  8,
        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])

... compute the sampling interval (see the MDS book for more explanations):

In [17]:
n = 6
k = int(N / n)
k

4

... and pick elements starting from start every $k$ elements; the second square bracket
chooses only the first n elements.

In [18]:
start = rng.integers(0, N)
population[start::k][:n]

array([ 4,  8, 12, 16, 20, 24])

### 8.4.5 Stratified Sampling

Stratification denotes the process of splitting the whole population into “subpopulations”, the so-called strata (plural of stratum). The partitioning has to be non-overlapping, i.e., each element is exactly in one of the strata. Subsequently, simple random sampling (Section 8.4.1) can be employed, which has to be done for all strata.

We start by creating a population:

In [19]:
import numpy as np
rng = np.random.default_rng()
N = 25
population = np.arange(N)
population

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

This is then split into four strata:

In [20]:
strata = np.array_split(population, 4)
strata

[array([0, 1, 2, 3, 4, 5, 6]),
 array([ 7,  8,  9, 10, 11, 12]),
 array([13, 14, 15, 16, 17, 18]),
 array([19, 20, 21, 22, 23, 24])]

In a compact approach we create a list of all random samples from each strata:

In [21]:
s = [rng.choice(stratum, size=2, replace=False) for stratum in strata]
sample = np.concatenate(s)
sample

array([ 2,  0,  8,  9, 13, 14, 22, 20])

This “inline for loop” is called list comprehension and adds for each element of the
for loop the result of rng.choice to the list.