# Strata and Cluster Exploration

Bootstrap weighting


## Sample weighting methodology (from Project Report PDF)

### Statistical Weighting
Statistical weights for the study eligible respondents were calculated using panel base sampling weights to start. 

**Panel base sampling weights** for all sampled housing units are computed as the inverse of probability of selection 
from the NORC National Frame (the sampling frame that is used to sample housing units for AmeriSpeak) 
or address-based sample. The sample design and recruitment protocol for the AmeriSpeak Panel involves 
subsampling of initial non-respondent housing units. These subsampled non-respondent housing units are 
selected for an in-person follow-up. The subsample of housing units that are selected for the nonresponse 
follow-up (NRFU) have their panel base sampling weights inflated by the inverse of the subsampling rate. 
The base sampling weights are further adjusted to account for unknown eligibility and nonresponse among 
eligible housing units. The household-level nonresponse adjusted weights are then post-stratified to external 
counts for number of households obtained from the Current Population Survey. Then, these household-level 
post-stratified weights are assigned to each eligible adult in every recruited household. Furthermore, a 
person-level nonresponse adjustment accounts for nonresponding adults within a recruited household. 


Finally, panel weights are raked to external population totals associated with age, sex, education, 
race/Hispanic ethnicity, housing tenure, telephone status, and Census Division. The external population 
totals are obtained from the Current Population Survey. The weights adjusted to the external population 
totals are the final panel weights.

#### Panel Weighting Variables & the Variable Categories 
**Age**: 18-24, 25-29, 20-39, 40-49, 50-59, 60-64, and 65+  
**Gender**: Male and Female  
**Census Division**: New England, Middle Atlantic, East North Central, West North Central, South Atlantic, East South Central, West South Central, Mountain, and Pacific  
**Race/Ethnicity**: Non-Hispanic White, Non-Hispanic Black, Hispanic, and Non-Hispanic Other  
**Education**: Less than High School, High School/GED, Some College, and BA and Above  
**Housing Tenure**: Home Owner and Other  
**Household phone status**: Cell Phone-only, Dual User, and Landline-only/Phoneless

**Study-specific base sampling weights** are derived using a combination of the final panel weight and the probability 
of selection associated with the sampled panel member. Since not all sampled panel members respond to the 
survey interview, an adjustment is needed to account for and adjust for survey non-respondents. This 
adjustment decreases potential nonresponse bias associated with sampled panel members who did not 
complete the survey interview for the study. Thus, the nonresponse adjusted survey weights for the study are 
adjusted via a raking ratio method to general population age 18 and older population totals associated with 
the following topline socio-demographic characteristics: age, sex, education, race/Hispanic ethnicity, and 
Census Division, and the following socio-demographic interactions: age x gender, age x race/ethnicity, and 
race/ethnicity x gender.

#### Study-Specific Post-Stratification Weighting Variables & the Variable Categories 
**Age**: 18-24, 25-29, 30-39, 40-49, 50-59, 60-64, and 65+  
**Gender**: Male and Female  
**State Group**: Northeast, Midwest, South, West, Arizona, California, Colorado, Florida, Georgia, 
Illinois, Indiana, Kentucky, Massachusetts, Maryland, Minnesota, North Carolina, New Jersey, 
New York, Oregon, Pennsylvania, Texas, Virginia, Washington, Wisconsin  
**Education**: Less than High School, High School/GED, Some College, and BA and Above  
**Race/Ethnicity**: Non-Hispanic White, Non-Hispanic Black, Hispanic, and Non-Hispanic Other  
**Age x Gender**: 18-34 Male, 18-34 Female, 35-49 Male, 35-49 Female, 50-64 Male, 50-64 Female,
65+ Male, and 65+ Female  
**Age x Race/Ethnicity**: 18-34 Non-Hispanic White, 18-34 All Other, 35-49 Non-Hispanic 
White, 35-49 All Other, 50-64 All Other, 50-64 All Other, 65+ Non-Hispanic White, and 65+ 
All Other  
**Race/Ethnicity x Gender**: Non-Hispanic White Male, Non-Hispanic White Female, All Other Male, 
and All Other Female

The weights adjusted to the external population totals are the final study weights.

At the final stage of weighting, any extreme weights were trimmed based on a criterion of minimizing the 
mean squared error associated with key survey estimates, and then, weights re-raked to the same population 
totals.

Raking and re-raking is done during the weighting process such that the weighted demographic distribution 
of the survey completes resemble the demographic distribution in the target population. The assumption is 
that the key survey items are related to the demographics. Therefore, by aligning the survey respondent 
demographics with the target population, the key survey items should also be in closer alignment with the 
target population.

In [None]:
%matplotlib inline

In [None]:
# import packages
import os
import json
from pathlib import Path
import pandas as pd
import numpy as np
import pyreadstat
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from utils import *
# from ydata_profiling import ProfileReport
pd.set_option('mode.chained_assignment', None)

### Data cleaning/pre-processing

In [None]:
# inputs
STATE_ABBREVIATIONS = "state_abbrev_mappings.json"
DATA_FILE = (
    "P:/3645/Common/Protocol 2 Custom Survey/"
    "Analysis/Data File/"
    "3645_JCOIN_HEAL Initiative 2021_NORC_Jan2022_1.sav"
)

In [None]:
# import data and metadata (data dictionaries)
df, meta = pyreadstat.read_sav(DATA_FILE,apply_value_formats=True)


In [None]:
# narrow down the dataset to only variables that are collected at each of the time-points

# standardize column names across datasets and metadatasets
for df in [df]:
    df.columns = df.columns.str.lower()

In [None]:
vars_of_interest = ['p_over','weight1','weight2','stigma_scale_score','expanded_10item_stigma','state','age4','racethnicity','educ5','personaluse_ever','familyuse_ever','personalcrimjust_ever','familycrimjust_ever']
categorical_vars = ['p_over','state','age4','racethnicity','educ5',
    'personaluse_ever','familyuse_ever',
    'personalcrimjust_ever','familycrimjust_ever']

In [None]:
# narrow down the dataset to only a few interesting (and relatively clean, straightforward variables) - check for missingness and impute to fill in missing
#sub_df_1 = df[vars_of_interest]
sub_df_1 = df

In [None]:

sub_df_1.familycrimjust_ever.replace({0:"No",1:"Yes"},inplace=True)
sub_df_1.familyuse_ever.replace({" No":"No"},inplace=True)
sub_df_1.personalcrimjust_ever.replace({"Yes, ever arrested or incarcerated":"Yes", "No, never arrested or incarcerated":"No"},inplace=True)


In [None]:
# impute missing stigma scale score vals with median per timepoint, impute missing personaluse_ever with mode, "No"

# impute missing stigma scale score values as the median score by survey time-point
#sub_df_1['stigma_scale_score'].fillna(sub_df_1.groupby('time-point')['stigma_scale_score'].transform('median'),inplace=True)
sub_df_1['stigma_scale_score'].fillna(sub_df_1['stigma_scale_score'].median(),inplace=True)
sub_df_1['expanded_10item_stigma'].fillna(sub_df_1['expanded_10item_stigma'].median(),inplace=True)

In [None]:
# add df column with state 2 letter code
# https://pythonfix.com/code/us-states-abbrev.py/
# state name to two letter code dictionary
us_state_to_abbrev = json.loads(Path(STATE_ABBREVIATIONS).read_text())
state_cd = sub_df_1.state.replace(us_state_to_abbrev)
sub_df_1.insert(6,"state_cd",state_cd,True)

In [None]:
# Add jcoin information
jcoin_json = json.loads(Path("jcoin_states.json").read_text())

jcoin_df = (pd.DataFrame(jcoin_json)
    .assign(hub_types=lambda df:df["hub"]+"("+df["type"]+")")
    .groupby('states')
    # make a list of the name and type of hub/study and how many hubs are in that state
    .agg({"hub_types":lambda s:",".join(s),"hub":"count"})
    .reset_index()
    .rename(
        columns={"states":"state_cd",
        "hub":"jcoin_hub_count",
        "hub_types":"jcoin_hub_types"})
)

In [None]:
sub_df_1 = sub_df_1.merge(jcoin_df,on="state_cd",how="left")
sub_df_1["jcoin_hub_types"].fillna("not JCOIN",inplace=True)
sub_df_1["jcoin_hub_count"].fillna(0,inplace=True)
sub_df_1["is_jcoin_hub"] = np.where(sub_df_1["jcoin_hub_types"]=="not JCOIN","No","Yes")

### Identifying sampling variables from dataset

In [None]:
# gender
# gender1 in metadata:
# What sex were you assigned at birth, on your original birth certificate?
# More of a sex but it appears this was used for sampling
# not sure what gender_re is (I think it was added for convenience)

# USE gender1
sub_df_1.filter(regex="gender[12|_re]").value_counts()

In [None]:
# education 
# 4 categories is what is in sampling methodology doc 
# (but no description in metadata -- only 5 cat variable is in)

# USE: educ4
sub_df_1.filter(regex="educ4|educ5").value_counts()

In [None]:
# race/ethnicity
# only raceethnicity in code book but 4 categories used in sampling methodology

# USE: race_4cat

sub_df_1.filter(regex="racethnicity|race_4cat").value_counts()

In [None]:
# age

# USE age7 (7 categories in sampling methodlogy)
sub_df_1.filter(regex="^age[4|7]").value_counts()

In [None]:
# In sampling report:
#State Group: Northeast, Midwest, South, West, Arizona, California, Colorado, Florida, Georgia, 
#Illinois, Indiana, Kentucky, Massachusetts, Maryland, Minnesota, North Carolina, New Jersey, 
#New York, Oregon, Pennsylvania, Texas, Virginia, Washington, Wisconsin

# These are not mutually exclusive (eg Midwest and Illinois)...

sub_df_1.filter(regex="state$|region4")

In [None]:
# create interaction variables
sub_df_1["strata_age"] = sub_df_1["age7"]
sub_df_1["strata_gender"] = sub_df_1["gender1"]
sub_df_1["strata_agexgender"] = sub_df_1["age7"].astype(str) + "_x_" + sub_df_1["gender1"].astype(str)
sub_df_1["strata_agexrace"] = sub_df_1["age7"].astype(str) + "_x_" + sub_df_1["race_4cat"].astype(str)
sub_df_1["strata_racexgender"] = sub_df_1["race_4cat"].astype(str) + "_x_" + sub_df_1["gender1"].astype(str)