# JCOIN Stigma Survey Protocol 2: Strata and Cluster Exploration

Goal is to identify sampling stratum in order to get inputs necessary for variance calculations, preferrably using bootstrap estimation using [samplics](https://samplics-org.github.io/samplics/pages/weight_replicates.html)


1. Sampling frame summary
2. Sample weighting methodlogy provided by Amerispeak (2 step process)
3. Identifying stratum variables in dataset


## Sampling frame

Our sampling frame will be a random sample of AmeriSpeak panelists (n= 1,000 completes each wave or 70% of 1,428 eligible participants to achieve a nationally representative sample and an oversampling in states with at least one JCOIN study site (n=200 per state; 16 states total without Puerto Rico) and six non-JCOIN matched comparison states (n=200 per state) to develop state-level estimates, making sure to include states from the South and Pacific NW. The total sample is projected to be 5,400 completed surveys.

## Sample weighting methodology (from Project Report PDF)

### Statistical Weighting
Statistical weights for the study eligible respondents were calculated using __panel base sampling weights__ to start. 

**Panel base sampling weights** for all sampled housing units are computed as the inverse of probability of selection 
from the NORC National Frame (the sampling frame that is used to sample housing units for AmeriSpeak) 
or address-based sample. The sample design and recruitment protocol for the AmeriSpeak Panel involves 
subsampling of initial non-respondent housing units. These subsampled non-respondent housing units are 
selected for an in-person follow-up. The subsample of housing units that are selected for the nonresponse 
follow-up (NRFU) have their panel base sampling weights inflated by the inverse of the subsampling rate. 
The base sampling weights are further adjusted to account for unknown eligibility and nonresponse among 
eligible housing units. The household-level nonresponse adjusted weights are then post-stratified to external 
counts for number of households obtained from the Current Population Survey. Then, these household-level 
post-stratified weights are assigned to each eligible adult in every recruited household. Furthermore, a 
person-level nonresponse adjustment accounts for nonresponding adults within a recruited household. 


Finally, panel weights are raked to external population totals associated with age, sex, education, 
race/Hispanic ethnicity, housing tenure, telephone status, and Census Division. The external population 
totals are obtained from the Current Population Survey. The weights adjusted to the external population 
totals are the final panel weights.

#### Panel Weighting Variables & the Variable Categories 
**Age**: 18-24, 25-29, 20-39, 40-49, 50-59, 60-64, and 65+  
**Gender**: Male and Female  
**Census Division**: New England, Middle Atlantic, East North Central, West North Central, South Atlantic, East South Central, West South Central, Mountain, and Pacific  
**Race/Ethnicity**: Non-Hispanic White, Non-Hispanic Black, Hispanic, and Non-Hispanic Other  
**Education**: Less than High School, High School/GED, Some College, and BA and Above  
**Housing Tenure**: Home Owner and Other  
**Household phone status**: Cell Phone-only, Dual User, and Landline-only/Phoneless

**Study-specific base sampling weights** are derived using a combination of the final panel weight and the probability 
of selection associated with the sampled panel member. Since not all sampled panel members respond to the 
survey interview, an adjustment is needed to account for and adjust for survey non-respondents. This 
adjustment decreases potential nonresponse bias associated with sampled panel members who did not 
complete the survey interview for the study. Thus, the nonresponse adjusted survey weights for the study are 
adjusted via a raking ratio method to general population age 18 and older population totals associated with 
the following topline socio-demographic characteristics: age, sex, education, race/Hispanic ethnicity, and 
Census Division, and the following socio-demographic interactions: age x gender, age x race/ethnicity, and 
race/ethnicity x gender.

#### Study-Specific Post-Stratification Weighting Variables & the Variable Categories 
**Age**: 18-24, 25-29, 30-39, 40-49, 50-59, 60-64, and 65+  
**Gender**: Male and Female  
**State Group**: Northeast, Midwest, South, West, Arizona, California, Colorado, Florida, Georgia, 
Illinois, Indiana, Kentucky, Massachusetts, Maryland, Minnesota, North Carolina, New Jersey, 
New York, Oregon, Pennsylvania, Texas, Virginia, Washington, Wisconsin  
**Education**: Less than High School, High School/GED, Some College, and BA and Above  
**Race/Ethnicity**: Non-Hispanic White, Non-Hispanic Black, Hispanic, and Non-Hispanic Other  
**Age x Gender**: 18-34 Male, 18-34 Female, 35-49 Male, 35-49 Female, 50-64 Male, 50-64 Female,
65+ Male, and 65+ Female  
**Age x Race/Ethnicity**: 18-34 Non-Hispanic White, 18-34 All Other, 35-49 Non-Hispanic 
White, 35-49 All Other, 50-64 All Other, 50-64 All Other, 65+ Non-Hispanic White, and 65+ 
All Other  
**Race/Ethnicity x Gender**: Non-Hispanic White Male, Non-Hispanic White Female, All Other Male, 
and All Other Female

The weights adjusted to the external population totals are the final study weights.

At the final stage of weighting, any extreme weights were trimmed based on a criterion of minimizing the 
mean squared error associated with key survey estimates, and then, weights re-raked to the same population 
totals.

Raking and re-raking is done during the weighting process such that the weighted demographic distribution 
of the survey completes resemble the demographic distribution in the target population. The assumption is 
that the key survey items are related to the demographics. Therefore, by aligning the survey respondent 
demographics with the target population, the key survey items should also be in closer alignment with the 
target population.

In [None]:
# import packages
import pandas as pd
import numpy as np
import pyreadstat
from ydata_profiling import ProfileReport

In [None]:
DATA_FILE = (
    "P:/3645/Common/Protocol 2 Custom Survey/"
    "Analysis/Data File/"
    "3645_JCOIN_HEAL Initiative 2021_NORC_Jan2022_1.sav"
)

In [None]:
# import data and metadata (data dictionaries)
df, meta = pyreadstat.read_sav(DATA_FILE,apply_value_formats=True)


In [None]:
# narrow down the dataset to only variables that are collected at each of the time-points

# standardize column names across datasets and metadatasets
for df in [df]:
    df.columns = df.columns.str.lower()

### Identifying sampling variables from dataset

In [None]:
# gender
# gender1 in metadata:
# What sex were you assigned at birth, on your original birth certificate?
# More of a sex but it appears this was used for sampling
# not sure what gender_re is (I think it was added for convenience)

# USE gender1
sub_df_1.filter(regex="gender[12|_re]").value_counts()

In [None]:
# education 
# 4 categories is what is in sampling methodology doc 
# (but no description in metadata -- only 5 cat variable is in)

# USE: educ4
sub_df_1.filter(regex="educ4|educ5").value_counts()

In [None]:
# race/ethnicity
# only raceethnicity in code book but 4 categories used in sampling methodology

# USE: race_4cat

sub_df_1.filter(regex="racethnicity|race_4cat").value_counts()

In [None]:
# age

# USE age7 (7 categories in sampling methodlogy)
sub_df_1.filter(regex="^age[4|7]").value_counts()

In [None]:
# In sampling report:
#State Group: Northeast, Midwest, South, West, Arizona, California, Colorado, Florida, Georgia, 
#Illinois, Indiana, Kentucky, Massachusetts, Maryland, Minnesota, North Carolina, New Jersey, 
#New York, Oregon, Pennsylvania, Texas, Virginia, Washington, Wisconsin

# The inclusion of census and state may be due to combo of oversample (state) or national sample (census)

sub_df_1.filter(regex="state$|region4|weight\d").drop_duplicates().head()

In [None]:
# create variable categories
sub_df_1["strata_state_or_census"] = np.where(
    sub_df_1.weight1.isna(),
    sub_df_1["region4"],
    sub_df_1["state"])
sub_df_1["stata_race"] = sub_df_1["race_4cat"]
sub_df_1["strata_age"] = sub_df_1["age7"]
sub_df_1["strata_gender"] = sub_df_1["gender1"]
sub_df_1["strata_agexgender"] = sub_df_1["age7"].astype(str) + "_x_" + sub_df_1["gender1"].astype(str)
sub_df_1["strata_agexrace"] = sub_df_1["age7"].astype(str) + "_x_" + sub_df_1["race_4cat"].astype(str)
sub_df_1["strata_racexgender"] = sub_df_1["race_4cat"].astype(str) + "_x_" + sub_df_1["gender1"].astype(str)

In [None]:
ProfileReport(
    sub_df_1.filter(regex="^strata"),
    sensitive=True,
    samples=None,
    correlations=None,
    missing_diagrams=None,
    duplicates=None,
    interactions=None,)