# Week 4: Task 9-10
#### Name: Kai Ferragallo-Hawkins
#### Date: 12.2.2023

##Setup

I spent a few hours trying to understand the samplics package to do systematic sampling. What I've determined is that, while the package is interesting - especially for more weighted sampling methods - the tasks assigned here are a bit to simplistic to benefit from using samplics. Therefore, I decided to make my own function for systematic sampling in Pandas, which can be seen below.



In [66]:
import pandas as pd
import numpy as np
import samplics as smp
import seaborn as sns
import SurveySamplingFunctions as ssf

### Functions
def systematic_sampling(df, step):
    """Preforms basic systematic sampling on a python dataset, utilizing a dataframe and step value. Indexes the values with numpy.arrange, which takes evenly spaced values in an interval, and then locates those relevant indexes into a new dataframe."""
    indexes = np.arange(0, len(df), step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample

def system_var (main, sample, sample_key, Is_Sqrt = False):
    """"Calculates the system variance by manually calculating the information for design variance based on population sizes, given two pandas dataframes of the original information and the sample. Population variance calculated through Pandas var."""
    sqr = 0.5 if Is_Sqrt else 1
    s_var = ((len(main)**2)*(1-len(sample)/len(main))*(1/len(sample))*sample[sample_key].var(ddof=1))**sqr
    return s_var

### Province 91
## Importing Province91 Data
province91 = pd.read_csv("files/assignment1/province91.txt", delim_whitespace=True)

### Province 17
## Importing Province17 Data
province17 = pd.read_csv("files/assignment1/province17.txt", delimiter = '\t', encoding='latin-1')

## Task 9: Systematic Sampling & UE Totals

Systematic sampling resulted in, for me, a higher adjusted total for UE (for Province91, that would be 16464 -> 23580; for Province17, that would be 9625 -> 12175). This places the new population91 total above the actual value, and the population17 total below the actual value. This systematic sampling can also remain different, as the formula I made for the systematic sampling only goes based off the 1st index, and could change depending on what index you start with (i.e. 0-4 vs 1-5 vs 2-6 vs 3-7).

In [67]:
### Province 91
## Systematic Sampling
province91_sys = systematic_sampling(province91, step=4)
print("Province 91 Sample")
display(province91_sys)

## Calculating UE Total
prov91_weight = len(province91)/len(province91_sys)
total_ue91 = prov91_weight*province91_sys["UE91"].sum()
      
### Province 17
## Systematic Sampling
province17_sys = systematic_sampling(province17, step=3)
print("Province 17 Sample")
display(province17_sys)

## Calculating UE Total
prov17_weight = len(province17)/len(province17_sys)
total_ue17 = prov17_weight*province17_sys["UE17"].sum()
      
### Both
display(pd.DataFrame({'Year': ["1991", "2017"],'Total Unemployed, Absolute': [total_ue91, total_ue17]}))

Province 91 Sample


Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85
0,1,1,1,Jyväskylä,67200,33786,4123,26881,1
4,1,3,5,Saarijärvi,10774,4930,721,3730,1
8,2,6,9,Joutsa,4594,2069,194,1823,0
12,2,8,13,Kinnula,2324,927,129,675,0
16,2,1,17,Korpilahti,5181,2144,239,1793,0
20,2,6,21,Leivonmäki,1370,573,61,545,0
24,2,7,25,Petäjävesi,3800,1737,262,1352,0
28,2,1,29,Säynätsalo,3628,1615,166,1226,0


Province 17 Sample


Unnamed: 0,Municipality,id,POP17,LAB17,UE17,HOU17,URB17_new
0,Hankasalmi,1,5019,1829,222,2376,0
3,Jämsä,4,20877,7989,1175,10647,1
6,Keuruu,7,9919,3517,493,4970,1
9,Konnevesi,10,2748,1022,134,1339,0
12,Laukaa,13,18978,8082,827,7819,1
15,Muurame,16,10097,4749,419,4144,1
18,Saarijärvi,19,9589,3368,631,4738,1
21,Viitasaari,22,6411,2259,334,3273,1


Unnamed: 0,Year,"Total Unemployed, Absolute"
0,1991,23580.0
1,2017,12175.625


## Task 10: Statistical Selection

I've realized that my variance calculations may be off - when finding the variance for the SYS sample, I found the variance in the example given for the SRS sample. I'd be interested in knowing why that is the case.

However, if the deff value is correct (2.74), it shows that my SRS has significantly lower variance then the SYS variance, and that therefore the implicit stratification does not make the variance estimator more precise.

In [68]:
### Province 91
## Specific Sample
municipalities_to_select = ['Jyväskylä', 'Saarijärvi', 'Joutsa', 'Kinnula', 'Korpilahti', 'Leivonmäki', 'Petäjävesi', 'Säynätsalo']
province91_spec = province91[province91['Municipality'].isin(municipalities_to_select)]
display(province91_spec)

## Calculating the System Variance of UE91 SYS (sqrt)
# Done by manually calculating the previous information for design variance, multiplied by the population variance as given by Pandas. ddof=0 makes equation give population variance, but ddof=1 works for the example, so is used. Look more into the reason?
s_var91_sys = system_var(province91, province91_spec, "UE91", Is_Sqrt = True)

## SRS Sample
prov91_sample = province91.sample(n=8, replace=False, random_state=123456)
s_var91_srs = system_var(province91, prov91_sample, "UE91", Is_Sqrt = True)

## deff Calculation
s91_deff = s_var91_sys/s_var91_srs

## Display
display(pd.DataFrame({'Year': ["1991"],'SYS Var': [s_var91_sys],'SRS Var': [s_var91_srs],'Deff': [s91_deff]}))

Unnamed: 0,Stratum,Cluster,Id,Municipality,POP91,LAB91,UE91,HOU85,URB85
0,1,1,1,Jyväskylä,67200,33786,4123,26881,1
4,1,3,5,Saarijärvi,10774,4930,721,3730,1
8,2,6,9,Joutsa,4594,2069,194,1823,0
12,2,8,13,Kinnula,2324,927,129,675,0
16,2,1,17,Korpilahti,5181,2144,239,1793,0
20,2,6,21,Leivonmäki,1370,573,61,545,0
24,2,7,25,Petäjävesi,3800,1737,262,1352,0
28,2,1,29,Säynätsalo,3628,1615,166,1226,0


Unnamed: 0,Year,SYS Var,SRS Var,Deff
0,1991,13549.356569,4950.960426,2.736713
