# Python Programming: Systematic Sampling

<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

In [None]:
# Importing Numpy
#
import numpy as np

## Examples 

```text

# The idea in systematic sampling is that, given the population units numbered from 1 to  N , 
# we compute for the sampling interval, given by  k=N/n , where  n  is the number of units needed for the sample. 
# After that, we choose for the random start, number between  1  and  k. 
# This random start will be the first sample, and then the second unit in the sample is obtained by adding 
# the sampling interval to the random start, and so on. 
# There are two types of systematic sampling namely, linear and circular systematic samplings. 
# Circular systematic sampling treats the population units numbered from  1  to  N  in circular form,
# so that if the increment step is more than the number of  N  units, say  N+2 , 
# the sample unit is the  2nd element in the population, and so on. 
# The code that we will working with can be used both for linear and circular. 
# Since there are rules in linear that are not satisfied in the function, 
# one of which is if  k  is not a whole number, despite that, however, we can always extend it to a more general function.

```

In [None]:
# Example 1
# ---
# ---
# Question: Perform systematic sampling given the following dataset:
# ---
#

# The data
sal_dat = np.array([25, 15, 20, 25, 18, 12, 24, 30, 15, 20, 10, 10, 11, 14, 22, 16])
salary = sal_dat * 1000

# Function for systematic sampling
def sys_sample(df, r, n):
    k = df.shape[0] // n

    b = [None] * n; a = r
    b[0] = a

    for i in np.arange(1, n):
        a = a + k

        if a > df.shape[0]:
            a = a - df.shape[0]

        b[i] = a

    return {"Data" : df[b], "Index" : b, "K" : k}

# Do the sampling for random start,
# r = 2, and number of sample, n = 4
sys_sample(salary, r = 1, n = 8)

{'Data': array([15000, 25000, 12000, 30000, 20000, 10000, 14000, 16000]),
 'Index': [1, 3, 5, 7, 9, 11, 13, 15],
 'K': 2}

## <font color="green">Challenges</font>

In [None]:
# Challenge 1
# ---
# Question: Peform systematic sampling given the following dataset 
# ---
# Dataset url = [33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 
#                56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79]
# ---
# 
Dataset = np.array([33, 34, 35,  36,  37,  38,  39,  40,  41,  42,  43 , 44 , 45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  
56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69])

l = len(Dataset)

sys_sample(Dataset, 1,15)

{'Data': array([34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62]),
 'Index': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29],
 'K': 2}

In [None]:
# Challenge 2
# ---
# Question: You're given data ranging from 175 to 1000. Select a sample from this data by performing systematic sampling.
# ---
# 
OUR CODE GOES HERE

In [None]:
# Challenge 3
# ---
# Question: There are 19 students in this class. Let’s choose a 1-in-3 systematic sample from the 19 students in the class.
# ---
# 
OUR CODE GOES HERE

In [None]:
# Challenge 4
# ---
# Question: Select a sample from n = 12 members from a population of size N = 287. 
# ---
# 
OUR CODE GOES HERE

In [17]:
# Challenge 5
# ---
# Question: You work for the Olympics Data Analytics in Geneva and would like perform a study on the performance of the top marathon
# olympics athletes. For reasons beyond your control, resort to perform systematic sampling from the given Boston 2017 marathon dataset.
# ---
# Question: http://bit.ly/BostonMarathonDataset
# ---
# 
import pandas as pd
import numpy as np

In [3]:

url  = "http://bit.ly/BostonMarathonDataset"

df = pd.read_csv(url)

In [47]:
#
(df.shape[0] // 100)

264

In [26]:
m = [None] * 100
len(m)

100

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,5K,10K,15K,20K,Half,25K,30K,35K,40K,Pace,Proj Time,Official Time,Overall,Gender,Division
0,0,11,"Kirui, Geoffrey",24,M,Keringet,,KEN,,,0:15:25,0:30:28,0:45:44,1:01:15,1:04:35,1:16:59,1:33:01,1:48:19,2:02:53,0:04:57,-,2:09:37,1,1,1
1,1,17,"Rupp, Galen",30,M,Portland,OR,USA,,,0:15:24,0:30:27,0:45:44,1:01:15,1:04:35,1:16:59,1:33:01,1:48:19,2:03:14,0:04:58,-,2:09:58,2,2,2
2,2,23,"Osako, Suguru",25,M,Machida-City,,JPN,,,0:15:25,0:30:29,0:45:44,1:01:16,1:04:36,1:17:00,1:33:01,1:48:31,2:03:38,0:04:59,-,2:10:28,3,3,3
3,3,21,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,,,0:15:25,0:30:29,0:45:44,1:01:19,1:04:45,1:17:00,1:33:01,1:48:58,2:04:35,0:05:03,-,2:12:08,4,4,4
4,4,9,"Chebet, Wilson",31,M,Marakwet,,KEN,,,0:15:25,0:30:28,0:45:44,1:01:15,1:04:35,1:16:59,1:33:01,1:48:41,2:05:00,0:05:04,-,2:12:35,5,5,5


In [40]:

df = pd.DataFrame(df)

# Function for systematic sampling
def sys_sample(df, r, n):

    k = df.shape[0] // n

    b = [None] * n

    #a = r

    b[0] = r

    for i in np.arange(1, n):
        r = r + k
        
        # checks and balances:
        if r > df.shape[0]:
            r =  r - df.shape[0]

        b[i] = r

    return df.iloc[b,:]



new_df = sys_sample(df, 1,100)

In [41]:
new_df = pd.DataFrame(new_df)

In [48]:
new_df.shape

(100, 25)

In [51]:
new_df

<bound method Series.unique of 1        NaN
265      NaN
529      NaN
793      NaN
1057     NaN
        ... 
25081    NaN
25345    NaN
25609    NaN
25873    NaN
26137    NaN
Name: Citizen, Length: 100, dtype: object>