## Selection of primary sampling units (PSUs) <a name="section1"></a>

In the sections below, we draw primary sampling units (PSUs) using probability proportional to size (PPS) sampling techniques implemented in the *Sample* class. The class *Sample* has two main methods that is *inclusion_probs* and *select*. The method *inclusion_probs()* computes the probability of selection and *select()* draws the random samples. 

The following will illustrate the use of samplics for sample selection. For the illustration,
- We consider a stratified cluster design.
- We will a priori decide how many PSUs to sample from each stratum
- For the clusters selection, we demonstrate PPS methods

This example is not meant to be exhaustif. There are many use cases that are not covered in this tutorial. For example, some PSUs may be segmented due to their size and segments selected in a subsequent step. Segment selection can be done with Samplics in a similar way as the PSUs selection, with PPS or SRS, after the segements have been created by the user.

First, let us import the python packages necessary to run the tutorial. 

In [1]:
import numpy as np
import pandas as pd

import samplics
from samplics.datasets import PSUFrame
from samplics.sampling import SampleSelection

### Sample Dataset <a name="section10"></a>

The file *sample_frame.csv* - shown below - contains synthetic data of 100 clusters classified by region (East, North, South and West). Clusters represent a group of households. In the file, each cluster has an associated number of households (number_households) and a status variable indicating whether the cluster is in scope or not. 

This synthetic data represents a simplified version of enumeration areas (EAs) frames found in many countries and used by major household survey programs such as the Demographic and Health Surveys (DHS), the Population-based HIV Impact Assessment (PHIA) surveys and the Multiple Cluster Indicator Surveys (MICS). 

In [2]:
psu_frame_cls = PSUFrame()
psu_frame_cls.load_data()

psu_frame = psu_frame_cls.data
psu_frame.head(25)

Unnamed: 0,cluster,region,number_households_census,cluster_status,comment
0,1,North,105,1,
1,2,North,85,1,
2,3,North,95,1,
3,4,North,75,1,
4,5,North,120,1,
5,6,North,90,1,
6,7,North,130,1,
7,8,North,55,1,
8,9,North,30,1,
9,10,North,600,1,due to a large building


Often, sampling frames are not available for the sampling units of interest. For example, most countries do not have a list of all households or people living in the country. Even if such frames exist, it may not be operationally and financially feasible to directly select sampling units without any form of clustering. 

Hence, stage sampling is a common strategy used by large household national surveys for selecting samples of households and people. At the first stage, geographic or administrative clusters of households are selected. At the second stage, a frame of households is created from the selected clusters and a sample of households is selected. At the third stage (if applicable), a sample of people is selected from the households in the sample. This is a high level description of the process; usually implementations are much less straightforward and may require many adjustments to address complexities. 

### PSU Probability of Selection <a name="section11"></a>

At the first stage, we use the proportional to size (pps) method to select a random sample of clusters. The measure of size is the number of households (number_households) as provided in the psu sampling frame. The sample is stratified by region. The probabilities, for stratified pps, is obtained as follow: \begin{equation} p_{hi} = \frac{n_h M_{hi}}{\sum_{i=1}^{N_h}{M_{hi}}} \end{equation} where $p_{hi}$ is the probability of selection for unit $i$ from stratum $h$, $M_{hi}$ is the measure of size (mos), $n_h$ and $N_h$ are the sample size and the total number of clusters in stratum $h$, respectively.

**Important.** The pps method is used in many surveys not just for multistage household surveys. For example, in business surveys, establishments can greatly vary in size; hence pps methods are often use to select samples. Simarly, facility-based surveys can benefit from pps methods when frames with measures of size are available. 

### PSU Sample size

For a stratified sampling design, the sample size is provided using a Python dictionary. Python dictionaries allow us to pair the strata with the sample sizes. Let's say that we want to select 3 clusters from stratum *East*, 2 from *West*, 2 from *North* and 3 from *South*. The snippet of code below demonstrates how to create the Python dictionary. Note that it is important to correctly spell out the keys of the dictionary which corresponds to the values of the variable stratum (in our case it's *region*).

In [3]:
psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}

print(f"\nThe sample size per domain is: {psu_sample_size}\n")


The sample size per domain is: {'East': 3, 'West': 2, 'North': 2, 'South': 3}



The function *array_to_dict()* converts an array to a dictionnary by pairing the values of the array to their frequency. We can use this function to calculates the number of clusters per stratum and store the result in a Python dictionnary. Then, we modify the values of the dictionnary to create the sample size dictionnary.

If some of the clusters are certainties then an exception will be raised. Hence, the user will have to manually handle the certaininties. Better handling of certainties is planned for future versions of the library *samplics*.

In [4]:
from samplics import array_to_dict

frame_size = array_to_dict(psu_frame["region"])
print(f"\nThe number of clusters per stratum is: {frame_size}")

psu_sample_size = frame_size.copy()
psu_sample_size["East"] = 3
psu_sample_size["North"] = 2
psu_sample_size["South"] = 3
psu_sample_size["West"] = 2
print(f"\nThe sample size per stratum is: {psu_sample_size}\n")


The number of clusters per stratum is: {'East': 25, 'North': 10, 'South': 20, 'West': 45}

The sample size per stratum is: {'East': 3, 'North': 2, 'South': 3, 'West': 2}



In [5]:
stage1_design = SampleSelection(method="pps-sys", stratification=True, with_replacement=False)

psu_frame["psu_prob"] = stage1_design.inclusion_probs(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"],
    psu_frame["number_households_census"],
    )

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the PSU frame \n")
psu_frame.head(nb_obs)


First 15 observations of the PSU frame 



Unnamed: 0,cluster,region,number_households_census,cluster_status,comment,psu_prob
0,1,North,105,1,,0.151625
1,2,North,85,1,,0.122744
2,3,North,95,1,,0.137184
3,4,North,75,1,,0.108303
4,5,North,120,1,,0.173285
5,6,North,90,1,,0.129964
6,7,North,130,1,,0.187726
7,8,North,55,1,,0.079422
8,9,North,30,1,,0.043321
9,10,North,600,1,due to a large building,0.866426


### PSU Selection <a name="section12"></a>

In this section, we select a sample of psus using pps methods. In the section above, we have calculated the probabilities of selection. That step is not necessary when using *samplics*. We can use the method *select()* to calculate the probability of selection and select the sample, in one run. As shown below, *select()* method returns a tuple of  three arrays.       
* The first array indicates the selected units (i.e. psu_sample = 1 if selected, and 0 if not selected).       
* The second array provides the number of hits, useful when the sample is selected with replacement.          
* The third array is the probability of selection.        

NB: *np.random.seed()* fixes the random seed to allow us to reproduce the random selection. 

In [6]:
np.random.seed(23)

psu_frame["psu_sample"], psu_frame["psu_hits"], psu_frame["psu_probs"] = stage1_design.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"]
    )

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the PSU frame with the sampling information \n")
psu_frame.head(nb_obs)


First 15 observations of the PSU frame with the sampling information 



Unnamed: 0,cluster,region,number_households_census,cluster_status,comment,psu_prob,psu_sample,psu_hits,psu_probs
0,1,North,105,1,,0.151625,0,0,0.151625
1,2,North,85,1,,0.122744,0,0,0.122744
2,3,North,95,1,,0.137184,0,0,0.137184
3,4,North,75,1,,0.108303,0,0,0.108303
4,5,North,120,1,,0.173285,0,0,0.173285
5,6,North,90,1,,0.129964,0,0,0.129964
6,7,North,130,1,,0.187726,1,1,0.187726
7,8,North,55,1,,0.079422,0,0,0.079422
8,9,North,30,1,,0.043321,0,0,0.043321
9,10,North,600,1,due to a large building,0.866426,1,1,0.866426


The default setting ```sample_only=False``` returns the entire frame. We can easily reduce the output data to the sample by filtering i.e. ```psu_sample == 1```. However, if we are only interested in the sample, we could use ```sample_only=True``` when calling *select()*. This will reduce the output data to the sampled units and ```to_dataframe=true``` will convert the data to a pandas dataframe (pd.DataFrame). Note that the columns in the dataframe will be reduced to the minimum.

In [7]:
np.random.seed(23)

psu_sample = stage1_design.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    to_dataframe = True,
    sample_only = True
    )

print("\nPSU sample without the non-sampled units\n")
psu_sample


PSU sample without the non-sampled units



Unnamed: 0,_samp_unit,_stratum,_mos,_sample,_hits,_probs
0,7,North,130,1,1,0.187726
1,10,North,600,1,1,0.866426
2,16,South,190,1,1,0.209174
3,24,South,75,1,1,0.082569
4,29,South,200,1,1,0.220183
5,34,East,305,1,1,0.210587
6,45,East,450,1,1,0.310702
7,52,East,700,1,1,0.483314
8,64,West,300,1,1,0.091673
9,86,West,280,1,1,0.085561


The systematic selection method can be implemented with or without replacement. The other *samplics* algorithms for selecting sample with unequal probablities of selection are Brewer, Hanurav-Vijayan (hv), Murphy, and Rao-Sampford (rs) methods. As shown below, all these sampling techniques can be specified when extentiating a *Sample* class; then call *select()* to draw samples. 

```python 
Sample(method="pps-sys", with_replacement=True)
Sample(method="pps-sys", with_replacement=False)
Sample(method="pps-brewer", with_replacement=False)
Sample(method="pps-hv", with_replacement=False) # Hanurav-Vijayan method
Sample(method="pps-murphy", with_replacement=False)
Sample(method="pps-sampford", with_replacement=False) # Rao-Sampford method
```
For example, if we wanted to select the sample using the Rao-Sampford method, we could use the following snippet of code. 

In [8]:
np.random.seed(23)

stage1_sampford = SampleSelection(method="pps-rs", stratification=True, with_replacement=False)

psu_sample_sampford = stage1_sampford.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    to_dataframe=True,
    sample_only=False
    )

psu_sample_sampford

Unnamed: 0,_samp_unit,_stratum,_mos,_sample,_hits,_probs
0,1,North,105,0,0,0.151625
1,2,North,85,0,0,0.122744
2,3,North,95,1,1,0.137184
3,4,North,75,0,0,0.108303
4,5,North,120,0,0,0.173285
...,...,...,...,...,...,...
95,96,West,95,1,1,0.029030
96,97,West,40,0,0,0.012223
97,98,West,105,0,0,0.032086
98,99,West,320,0,0,0.097785
