# Week 7: Task 14-15
#### Name: Kai Ferragallo-Hawkins
#### Date: 1.3.2023

## Task 14: What is Survey Sampling?

For this task, our aim is to give a summary of the topics covered throughout the past 7 weeks of classes. I feel that what we learned can be broken down into three main categories, which will be covered later in greater depth:
- **Survey Design**, which covers the fundamental basics of surveys such as the selection of a population, the undercoverage and overcoverage of the frame, and weighting to account for disparities.
- **Survey Methods**, which describes the tools used to select a population for your survey, such as simple random (SRS), systematic (SYS) and cluster sampling (CLU).
- **Survey Accuracy**, where descriptors (i.e., mean, standard deviation) and estimators (i.e. design variance) can help you understand your sample and isolate more efficient survey methods.

### Survey Design
When first designing a survey, it is important to understand:
- What is the **target population**, or population of interest, you are aiming to look at?
*i.e., When looking at unemployment, the population of interest would be laborers and unemployed.*
- What is the **sampling frame**, or the database from where the sample is drawn?
*i.e., You can use Province91, a dataset created from the official statistics of Finland that breaks down unemployment information by Municipality. You can see how that can be important into Python below.*


In [30]:
import pandas as pd

### Province 91
## Importing and Displaying Province91 Data
province91 = pd.read_csv("files/assignment1/province91.txt", delim_whitespace=True)
display(province91)

Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85
0,1,1,1,Jyväskylä,67200,33786,4123,26881,1
1,1,2,2,Jämsä,12907,6016,666,4663,1
2,1,2,3,Jämsänkoski,8118,3818,528,3019,1
3,1,2,4,Keuruu,12707,5919,760,4896,1
4,1,3,5,Saarijärvi,10774,4930,721,3730,1
5,1,5,6,Suolahti,6159,3022,457,2389,1
6,1,3,7,Äänekoski,11595,5823,767,4264,1
7,2,5,8,Hankasalmi,6080,2594,391,2179,0
8,2,6,9,Joutsa,4594,2069,194,1823,0
9,2,7,10,Jyväskmlk,29349,13727,1623,9230,0


- Potential for errors, like measurement, non-response, sampling, and coverage errors - and from that, how you would then **weight** your sample to better account for the disparities.
*i.e., say I have the very specific sample below. If I wanted to get its values directly in relation to the actual survey, I would need to take a **design weight**. This is a 4, or that the specific sample must be multiplied by 4 to get to the same relative weight as the original sample. Relatively simplistic example but does show what design weight is. Auxiliary data can be used to improve this process.*

In [31]:
### Weighting
## Specific Sample
# Here, I get a list of specific municipalities, then pick those out of the total province 91 dataframe. 
municipalities_to_select = ['Jyväskylä', 'Saarijärvi', 'Joutsa', 'Kinnula', 'Korpilahti', 'Leivonmäki', 'Petäjävesi', 'Säynätsalo']
province91_spec = province91[province91['Municipality'].isin(municipalities_to_select)]
display(province91_spec)

# Getting the dataframe's weight.
prov91_weight = len(province91)/len(province91_spec)
print(f"Prov91 Weight: {prov91_weight}")

Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85
0,1,1,1,Jyväskylä,67200,33786,4123,26881,1
4,1,3,5,Saarijärvi,10774,4930,721,3730,1
8,2,6,9,Joutsa,4594,2069,194,1823,0
12,2,8,13,Kinnula,2324,927,129,675,0
16,2,1,17,Korpilahti,5181,2144,239,1793,0
20,2,6,21,Leivonmäki,1370,573,61,545,0
24,2,7,25,Petäjävesi,3800,1737,262,1352,0
28,2,1,29,Säynätsalo,3628,1615,166,1226,0


Prov91 Weight: 4.0


### Survey Methods
When you are handling massive populations, it is unreasonable to go and speak or monitor every single individual. There are a number of useful ways to handle sampling surveys in order to try and reduce the load while still getting an accurate sample and result. These methods are:

- A **Simple Random Sample (SRS)**, which is when the values in a larger dataset are randomly selected to some new n in order to get a sample. There are two types of SRS samples - **Without Replacement (SRSWOR)**, or when the values are simply chosen out from the original dataset at random without changing the original dataset during the selection process, and **With Replacement (SRSWR)**, which is when the sample element is removed from the dataset after selecting one random value, so that it is not chosen again.
*i.e., Below is an example of how you would do a basic simple random sampling in Python on a pandas dataframe.*

In [32]:
### Simple Random Sampling
## Without Replacement
# I am setting a random_state here again because .sample comes from pandas usage of numpy, rather than numpy itself. Numpy's random state, therefore, does not automatically apply to pandas.
prov91_SRSWOR = province91.sample(n=8, replace=False, random_state=123)
print("Province 91 Simple Random Without Replacement (SRSWOR)")
display(prov91_SRSWOR)

## With Replacement
prov91_SRSWR = province91.sample(n=8, replace=True, random_state=123)
print("Province 91 Simple Random With Replacement (SRSWR)")
display(prov91_SRSWOR)

Province 91 Simple Random Without Replacement (SRSWOR)


Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85
7,2,5,8,Hankasalmi,6080,2594,391,2179,0
31,2,8,32,Viitasaari,8641,4011,568,3119,0
5,1,5,6,Suolahti,6159,3022,457,2389,1
26,2,4,27,Pylkönmäki,1266,545,98,473,0
8,2,6,9,Joutsa,4594,2069,194,1823,0
27,2,3,28,Sumiainen,1426,617,79,485,0
12,2,8,13,Kinnula,2324,927,129,675,0
21,2,6,22,Luhanka,1153,522,54,435,0


Province 91 Simple Random With Replacement (SRSWR)


Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85
7,2,5,8,Hankasalmi,6080,2594,391,2179,0
31,2,8,32,Viitasaari,8641,4011,568,3119,0
5,1,5,6,Suolahti,6159,3022,457,2389,1
26,2,4,27,Pylkönmäki,1266,545,98,473,0
8,2,6,9,Joutsa,4594,2069,194,1823,0
27,2,3,28,Sumiainen,1426,617,79,485,0
12,2,8,13,Kinnula,2324,927,129,675,0
21,2,6,22,Luhanka,1153,522,54,435,0


- **Systematic Sampling (SYS)**, a commonly used form of sampling where you select every successive x element, where x is defined as N/n, or the original sample size to your new desired sample size.
*i.e., Below is a basic function for systematic sampling that will start at the 0 index and step based on a given parameter throughout the function.*

In [33]:
import numpy as np

### Systematic Sampling
def systematic_sampling(df, step):
    """Preforms basic systematic sampling on a python dataset, utilizing a dataframe and step value. Indexes the values with numpy.arrange, which takes evenly spaced values in an interval, and then locates those relevant indexes into a new dataframe."""
    indexes = np.arange(0, len(df), step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample

## Creating the desired sample
province91_sys = systematic_sampling(province91, step=4)
print("Province 91 Systematic Sample (SYS)")
display(province91_sys)


Province 91 Systematic Sample (SYS)


Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85
0,1,1,1,Jyväskylä,67200,33786,4123,26881,1
4,1,3,5,Saarijärvi,10774,4930,721,3730,1
8,2,6,9,Joutsa,4594,2069,194,1823,0
12,2,8,13,Kinnula,2324,927,129,675,0
16,2,1,17,Korpilahti,5181,2144,239,1793,0
20,2,6,21,Leivonmäki,1370,573,61,545,0
24,2,7,25,Petäjävesi,3800,1737,262,1352,0
28,2,1,29,Säynätsalo,3628,1615,166,1226,0


- **Stratified Sampling (SYS)**, or when you select specific subpopulations from within the sample. There are two types of stratified sampling allocations. **Equal Stratum Sampling** is when your new stratum(s) are chosen from an equal amount, regardless of their relative individual size. **Proportional Stratum Sampling** is when you choose based on their relative individual size.
*i.e., below I take a new stratum based on whether the province's unemployed population is above the sample mean. This is only 25% of provinces. if I do an equal allocation, then there would be 4 above the mean and 4 below. If I do proportional, then there would only be 2 above the mean and 6 from below.*
- **Proportional Sampling (PPS)**, or an unequal form of sampling built off of stratified sampling where the probability of choosing an element is proportional to its size.
*i.e., if one province had a population 10x that of another, it would be more likely to be selected from the sample. The size variable must therefore be known before sampling. This is not shown below, but my stratified sample follows a similar mindset, with a greater selection of more populated municipalities.*

In [34]:
### Stratified Sampling
# Mean notably above the median for UE91, so creating new stratum that separates UE91 by the mean.
province91['UE91 Above Mean'] = (province91['UE91'] > province91['UE91'].mean()).astype(int)

## Stratum Sampling by UE91 Mean Stratum - Proportional Allocation
prov91_prop_ue91mean = province91.groupby("UE91 Above Mean", group_keys = False).apply(lambda x: x.sample(frac=0.25, random_state=123))
print("Province 91 Proportional Stratum Sample (SYS-P)")
display(prov91_prop_ue91mean)

## Stratum Sampling by UE91 Mean Stratum - Equal Allocation
prov91_equal_ue91mean = province91.groupby("UE91 Above Mean", group_keys = False).apply(lambda x: x.sample(4, random_state=123))
print("Province 91 Equal Allocation Stratum Sample (SYS-E)")
display(prov91_equal_ue91mean)

Province 91 Proportional Stratum Sample (SYS-P)


Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85,UE91 Above Mean
12,2,8,13,Kinnula,2324,927,129,675,0,0
27,2,3,28,Sumiainen,1426,617,79,485,0,0
28,2,1,29,Säynätsalo,3628,1615,166,1226,0,0
26,2,4,27,Pylkönmäki,1266,545,98,473,0,0
22,2,7,23,Multia,2375,1059,119,925,0,0
15,2,5,16,Konnevesi,3453,1557,201,1215,0,0
19,2,5,20,Laukaa,16042,7218,874,4952,0,1
0,1,1,1,Jyväskylä,67200,33786,4123,26881,1,1


Province 91 Equal Allocation Stratum Sample (SYS-E)


Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85,UE91 Above Mean
12,2,8,13,Kinnula,2324,927,129,675,0,0
27,2,3,28,Sumiainen,1426,617,79,485,0,0
28,2,1,29,Säynätsalo,3628,1615,166,1226,0,0
26,2,4,27,Pylkönmäki,1266,545,98,473,0,0
19,2,5,20,Laukaa,16042,7218,874,4952,0,1
0,1,1,1,Jyväskylä,67200,33786,4123,26881,1,1
6,1,3,7,Äänekoski,11595,5823,767,4264,1,1
9,2,7,10,Jyväskmlk,29349,13727,1623,9230,0,1


- **Cluster Sampling (CLU)**, or when you divide a population into separate clusters and then randomly select one of those clusters - or multiple clusters or additional sampling methods on top of those clusters, depending on your approach - to be your sample.
*i.e., below, I get a random list of 1-4 based on the length of province 91, then shuffle that list of numbers and apply it to create cluster samples. You can see the division between the clusters in the display below.*

In [35]:
### Cluster Sampling
# Randomizing cluster values from 1 to 4
np.random.seed(123)
randomized_task = [1, 2, 3, 4]*(len(province91)//4)
np.random.shuffle(randomized_task)

# Creating cluster list in code
province91['Cluster_Sampling'] = randomized_task

# Showing that the clusters were successful.
print("Province 91 Cluster Sample (CLU)")
cluster_counts = province91['Cluster_Sampling'].value_counts()
print(cluster_counts)

Province 91 Cluster Sample (CLU)
Cluster_Sampling
4    8
2    8
3    8
1    8
Name: count, dtype: int64


### Survey Accuracy
It is not enough to simply grab a sample for your survey. It is also important to understand the information behind it, and to ensure that it is accurate. There are some other tools to do this:

- The estimators of **total**, **ratio**, **median**, and other descriptors. These are important for understanding the basics of your sample, and can be used in a variety of other ways.
*i.e., Getting the total can be based off of your weighted values. The example below does this with the simple random sampling with replacement. As one can see, the total value is different from the predicted total value simply due to the random nature of SRS. Creating an accurate survey method is an important part of survey sampling.*


In [36]:
### Weighted Total
prov91_weight = len(province91)/len(prov91_SRSWR)
total_ue91 = prov91_weight*prov91_SRSWR['UE91'].sum()
print(f"SRSWR's prediced total value: {total_ue91}")
print(f"Actual total value: {province91['UE91'].sum()}")

SRSWR's prediced total value: 10968.0
Actual total value: 15098


*Getting all descriptors, however, can prove more useful than just the total, such as standard deviation and the expected 25% and 75%tile values. Below is an example function made to show all the different clusters descriptors in a pandas dataframe.*


In [37]:
### Descriptors
def cluster_descr(df, cluster_column, sorting_column):
    """Given a Pandas dataframe, a cluster column, and a sorting column, find descriptors for all clusters. Implement random and limited cluster selection and working regardless of what value you consider a cluster in the future?"""
    cluster_descriptors = pd.DataFrame()
    clusters = sorted(df[cluster_column].unique())
    for cluster_num in range(1, len(clusters)+1):
        cluster_stats = df.loc[df[cluster_column] == cluster_num, sorting_column].describe()
        cluster_descriptors[f'Cluster {clusters[cluster_num-1]}'] = cluster_stats
    return cluster_descriptors

# Getting Cluster descriptors
cluster_descriptors = cluster_descr(province91, "Cluster_Sampling", "UE91")
print("Cluster Descriptors")
display(cluster_descriptors)

Cluster Descriptors


Unnamed: 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4
count,8.0,8.0,8.0,8.0
mean,320.625,332.875,290.25,943.5
std,274.370416,270.117137,244.243292,1378.191465
min,61.0,54.0,98.0,94.0
25%,138.5,160.0,125.0,177.75
50%,202.5,275.0,177.0,359.5
75%,436.0,425.25,364.0,905.25
max,767.0,874.0,760.0,4123.0


- **Variance** is the measure of the dispersion, or spread, in your data. **Sample Variance** is the dispersion in your sample, while **Population Variance** would be the variance for the dataset it comes from. This can be useful in understanding how well-knit your information is. Comparing the variances can get you the **design effect (deff)**, which relates how accurate one sample is relative to another.
*i.e., below is an example of the variance for the systematic sample, for the simple random sample, and the design effect of 5.92. Design effect is typically compared with SRS. Since the value is greater than 1, the SRS has less variance than the systematic sampling. A better way to check this would be to take samples with these methods a significant number of times, and then compare how the design effect looks afterward.*

In [38]:
### Variance and Design Effect

def system_var (main, sample, sample_key, is_sqrt = False):
    """"Calculates the system variance by manually calculating the information for design variance based on population sizes, given two pandas dataframes of the original information and the sample. Population variance calculated through Pandas var."""
    sqr = 0.5 if is_sqrt else 1
    s_var = ((len(main)**2)*(1-len(sample)/len(main))*(1/len(sample))*sample[sample_key].var(ddof=1))**sqr
    return s_var

## SRS Sample
s_var91_sys = system_var(province91, province91_spec, "UE91", is_sqrt = True)
s_var91_srs = system_var(province91, prov91_SRSWR, "UE91", is_sqrt = True)

## deff Calculation
s91_deff = s_var91_sys/s_var91_srs

## Display
display(pd.DataFrame({'Year': ["1991"],'SYS Var': [s_var91_sys],'SRS Var': [s_var91_srs],'Deff': [s91_deff]}))


Unnamed: 0,Year,SYS Var,SRS Var,Deff
0,1991,13549.356569,2285.459629,5.928504


## Task 15: Survey Science, Yesterday, Today, Tomorrow

For this task, we were to watch the videos at the BNU 2022 August Workshop and give a brief summary of what they were about and what it evokes after the course. Some points of discussion were:
- Randomized surveys continue to be important today, but suffer increasing rates of nonresponse and high cost, so questions over the validity of these statistics exist, as people move over to more real-time and less costly forms of data.
- Multiple data sources, or some shift in the approach to survey sampling, may be necessary to account for these changes.
- Sampling's public acceptance only emerged in the early-to-mid 20th century, and has generally stayed consistent since then. New methods have emerged without rejecting older ones. 1970s helped create greater cumulative change.

This brings to me the thought that survey sampling today may benefit from utilizing multiple survey analysis methods, like the machine learning approach that I watched last week, to better isolate a potentially complex but effective way of handling increasing nonresponse within surveys. There will very likely continue to be a use for direct surveys - they remain generally more effective at understanding public opinion then other, less costly methods - but there can be adjustments for the modern era.