# JCOIN Stigma Survey Protocol 2: Strata and PSU inputs provided by Amerispeak



- exploring psu and strata provided by Amerispeak for estimation of variance

1.	You fill find that PSU=1 and 2 and some other low numbers would be found in many strata. This is because SAS is lenient on repeated PSU ids, I don’t know if samplics is going to be, so you would be better off `df[‘ultimate_psu’]=df.groupby([‘vstrata’,’vpsu’]).ngroup()`` or df[‘ultimate_psu’]=df[‘vstrata’]*1000 + df[‘vpsu’]
2.	You will almost inevitably find that some strata or some PSU that have only one observation. At the stage of calculating standard errors, these will result in 0/0 (residual sum of squares / n-1 within stratum), and will likely result in NULLs or NAs in the output, and/or error messages that talk about singleton PSU or one PSU per stratum or something like that (unless Mamadou Diallo decided to invent his own terminology for this problem, like he did with some other concepts). The standard hack is to combine the strata so the one-PSU-per-stratum never happens. What I would do is to identify all the strata with a single PSU, and combine them into one fake stratum. This is a conservative step that increases the standard errors by a tiny amount – you would probably have 10 to 20 cases like that out of your 6K, so the impact they would have is basically a fraction of 20/6000. Better than having missing standard errors.


In [None]:
# import packages
import pandas as pd
import numpy as np
import pyreadstat
from ydata_profiling import ProfileReport

In [None]:
DATAPATH = "P:/3652/Common/HEAL/y3-task-c-collaborative-projects/jcoin-stigma/analyses/data/protocol2/"

DATA_FILE = DATAPATH+"3645_JCOIN_HEAL Initiative 2021_NORC_Jan2022_1.sav"
STRATA_FILE = DATAPATH+"VSTRAT_VPSU_Survey_2039_HEAL_MAIN_21_05_14.csv"

In [None]:
strata_df = pd.read_csv(STRATA_FILE)
strata_df.columns = strata_df.columns.str.lower()
strata_df.set_index("caseid")

In [None]:
# number of one obs strata -- need to combine these two one strata
strata_df.groupby(['vstrat32'])["caseid"].count().pipe(lambda df:df==1).agg(["sum","count"])

In [None]:
strata_df["vstrat32"].value_counts()>1

In [None]:
oneobs = strata_df.vstrat32.value_counts().loc[lambda s:s==1]
strata_df["vstrat32_corrected"] = strata_df["vstrat32"].where(cond=lambda s:~s.isin(oneobs.index),other=-1)

In [None]:
strata_df["vstrat32_corrected"].pipe(lambda s:s==-1).sum()

In [None]:
strata_df["vpsu32_corrected"] = strata_df.groupby(["vstrat32_corrected","vpsu32"]).ngroup()

In [None]:
# import data and metadata (data dictionaries)
df, meta = pyreadstat.read_sav(DATA_FILE,apply_value_formats=True)
df.columns = df.columns.str.lower()


In [None]:
ProfileReport(
    sub_df_1.filter(regex="^strata"),
    sensitive=True,
    samples=None,
    correlations=None,
    missing_diagrams=None,
    duplicates=None,
    interactions=None,)