The goal of estweight is to estimate sampling weights for convenience samples using information from a representative sample. The estimated weights are applied in various applications to obtain more generalizable inference. First, estimated sampling weights are incorporated into weighted generalized linear models to estimate associations for a target population and analytic standard error estimates are reported. Second, estimated weights are used to isolate causal effects with propensity scores for a target population. This package provides analytic standard error (SE) estimates in both contexts that account for uncertainty from estimating the sampling weights and when using propensity scores, the SE estimates accounts for uncertainty from estimating the propensity scores.
You can install the development version of estweight from GitHub with:
# install.packages("devtools")
devtools::install_github("oliviabern/estweight")
Suppose you want to estimate a model η(Y) = Xβ where Y is the response, X is a matrix containing covariates and η is a link function. However, Y and X were only collected in a convenience sample. Fitting the model directly in the convenience sample may lead to bias from the unrepresentative sampling. Additionally, supppose you have a second dataset that is representative of the target population you want to obtain inference about, but it does not contain both Y and X. The representative sample can be used to estimate sampling weights for the convenience sample and these derived weights can be used to weight the outcome model of interest. This package uses a representative sample to estimate weights for the convenience sample and weight the outcome model of interest using the convenience sample.
This is a basic example which shows you how to use a representative sample to estimate propensity weights and fit a weighted GLM for a biased convenience sample. First we will estimate a sample drawn from our reference population.
library(estweight)
set.seed(10403)
# simulate population
n.pop = 50000
X1 = rbinom(n.pop, size = 1, prob = 0.5)
X2 = rbinom(n.pop, size = 1, prob = .3)
X3 = rbinom(n.pop, size = 1, prob = .3)
X4 = rbinom(n.pop, size = 1, prob = .1)
X5 = rbinom(n.pop, size = 1, prob = .05)
X6 = rbinom(n.pop, size = 1, prob = .1)
X7 = rbinom(n.pop, size = 1, prob = .5)
X8 = rnorm(n.pop, mean = 65, sd = 5)
repsample = data.frame(X1, X2, X3, X4,
X5, X6, X7, X8)
# create helper function
expit = function(x){exp(x)/(1+exp(x))}
Next we generate the known sampling probabilities and corresponding true propensity weight. We also generate the response Y. Our goal is to estimate the outcome model of interest, η(Y) = β0 + β1X4 + β2X5 + β3X6. In this simulation, we know Y for subjects in both samples, but in practice we assume Y was only collected in the convenience sample.
# Calculate probability of being oversampled and true HT weight
sampprob = expit(.5*(repsample$X1*.3 + repsample$X2*.4 + repsample$X3*.8 +
repsample$X4*2.7 + repsample$X5*.9 + (repsample$X3)*(repsample$X5)*2 +
repsample$X6*.1 + repsample$X6*repsample$X7*1.5 +
-.002*repsample$X8^2 + 8))
repsample$sampprob = sampprob
repsample$weight_true = (1/sampprob)/sum(1/sampprob)
# Calculation E(Y) where Y is the response
repsample$Ey = expit(1 + log(2)*(repsample$X4) + -log(3)*(repsample$X5) + log(2.5)*(repsample$X6) +
-log(2)*sampprob + log(4)*(repsample$X4)*sampprob +
log(4)*(repsample$X5)*sampprob + -log(3)*(repsample$X6)*sampprob)
Now, we draw a biased convenience sample where each subject is sampled according to their sampling probability and a representative sample where each subject has equal probability of being sampled. Each sample contains nsamp = 500 observations.
n.samp = 500
# draw biased sample
b.samp.id = sample(1:n.pop, n.samp, prob = repsample$sampprob)
b.samp = repsample[b.samp.id, ]
# draw representative sample
r.samp.id = sample(1:n.pop, n.samp)
r.samp = repsample[r.samp.id, ]
Next, we simulate response Y for each subject and create an indicator biased which is 1 for subjects in the biased sample and 0 otherwise.
# simulate response
b.samp$y = rbinom(n.samp,1,b.samp$Ey)
r.samp$y = rbinom(n.samp,1,r.samp$Ey)
# biased sample indicator
b.samp$biased = 1; r.samp$biased = 0
We need to manipulate the data into the form required for the convGLMfunction. We need to create a stacked dataset that combines the biased and representative samples. They should be stacked vertically and contain a unique subject identifier, ID}, in the first column, the indicator, biased}, should be in the last column, and any covariates collected in both datasets to be used for matching should be in the middle columns. All strings and factors should be of class factor}. Finally, we need to create a response matrix that contains 2 columns, ID, and the response, y.
# Combine bisaed and representative samples
Xcomb = data.frame(ID = 1:(2*n.samp),
rbind(b.samp, r.samp))
# Dataframe to pass to function
Xfit = Xcomb[,c("ID", paste0("X",1:8), "biased")]
# convert strings and indicators to factors
convert2factor = paste0("X",1:7)
for(i in 1:length(convert2factor)){
Xfit[,convert2factor[i]] = as.factor(Xfit[,convert2factor[i]])
}
# get response
response = Xcomb[Xcomb$biased==1,c("ID","y")]
Use the convGLM function to fit an propensity-weighted generalized linear model that utilizes logistic regression estimated weights.
# formulas for final scientific model
form.outcome = as.formula(y ~ X4 + X5 + X6)
# fit weighted model using a logistic propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
weight_model = "logistic", outcome_family = "quasibinomial")
#> coef analyticSE
#> (Intercept) 0.3695812 0.1111605
#> X41 1.5897785 0.4258854
#> X51 1.1122235 0.4653186
#> X61 0.3366221 0.3161996
Use a random forest propensity weight estimation method instead.
# fit weighted model using a random forest propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
weight_model = "randomForest", outcome_family = "quasibinomial")
#> coef analyticSE
#> (Intercept) 0.50048753 0.1477161
#> X41 1.70037738 0.4721582
#> X51 0.63636903 0.5422092
#> X61 0.03364452 0.3847607
Now try a covariate balancing propensity score weight estimation method.
# fit weighted model using a random forest propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
weight_model = "CBPS", outcome_family = "quasibinomial")
#> [1] "Finding ATT with T=0 as the treatment. Set ATT=1 to find ATT with T=1 as the treatment"
#> coef analyticSE
#> (Intercept) 0.3738329 0.1117208
#> X41 1.5960229 0.4198840
#> X51 1.0953650 0.4677181
#> X61 0.3098333 0.3172941
Finally, try a entropy balancing weight estimation method.
# fit weighted model using a random forest propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
weight_model = "entbal", outcome_family = "quasibinomial")
#> final value -0.049378
#> converged
#> coef analyticSE
#> (Intercept) 0.3619440 0.1122514
#> X41 1.6440069 0.4228829
#> X51 1.1122146 0.4694176
#> X61 0.3191363 0.3193139
If you want to estimate sampling weights and use them in your own downstream analysis, you can use the estweight function. It will return sampling weights that you can use in any model that allows for sampling weights such as functions from the survey package. Note that the standard errors returned from these methods won’t account for uncertainty from estimating the sampling weights. To illustrate estimating sampling weights, we will use the simulated data from above.
weights = estweight(data = Xfit, weight_model = "logistic")
head(weights)
#> ID htweight
#> 46267 1 0.0007810555
#> 2230 2 0.0014069895
#> 13812 3 0.0005241307
#> 16025 4 0.0014827449
#> 46457 5 0.0011350264
#> 20468 6 0.0007226848
Example of using sampling weights to isolate causal effects with propensity adjustment: the convPS() function
Now suppose that you want to use your convenience sample to estimate propensity scores and estimate a propensity score adjusted estimate of a treatment effect. We can again use a representative sample to estimate sampling weights, but the sampling weights should be used to (1) estimate the propensity score and (2) estimate the propensity-adjusted causal effect.
Let’s simulation a super population where the sampling is dependent on a variable K, but we have not measured K. Instead, we have measured a proxy for K, X_1. Both K and X_1 are binary covariates, but they are functions of continuous latent variables which have a correlation of 0.90.
library(MASS)
n.pop = 10000; n = 2000
Sigma = matrix(c(1,.9,.9,1),nrow = 2)
cont = mvrnorm(n.pop, mu = c(0,0), Sigma = Sigma)
K.full = ifelse(cont[,1] > 0, 1, 0)
x1.full = ifelse(cont[,2] > 0, 1, 0)
prob.sample = .6*K.full + .2
We start by drawing a convenience sample according to the sampling probability:
CSamp = sample(1:n.pop, size = n, prob = prob.sample)
x1.c = x1.full[CSamp]
K.c = K.full[CSamp]
Then draw a simple random sample (SRS):
SRS = sample(1:n.pop, size = n)
x1.s = x1.full[SRS]
K.s = K.full[SRS]
Next, we simulate more variables for the convenience sample including potential confounders, the probability of being in the treatment group, and the treatment indicators. Notice that the probability of receiving treatment is differential for individuals with K = 1 or K = 0.
x2.c = rnorm(n,0,2)
x3.c = rnorm(n,0,2)
pi.c = expit((log(1.3)*x2.c + log(.4)*x3.c)*(1-K.c) +
(log(2)*x2.c + log(1.5)*x3.c)*(K.c))
T.c = rbinom(n,1,pi.c)
Lastly, we simulate a response with treatment effect modification by K:
x4.c = rnorm(n,0,1)
x5.c = rnorm(n,0,1)
y.c = rnorm(n, mean = (0 + 1*T.c + 3*T.c*K.c + 1.5*x2.c - 2*x3.c^2 - 1*x4.c + 1.5*x5.c^3), sd = 1)
We can then format the convenience and representative samples to use the convPS() function:
convSamp = data.frame(x1 = x1.c, x2 = x2.c, x3 = x3.c,
x4 = x4.c, x5 = x5.c, Tx = T.c, y = y.c)
repSamp = data.frame(x1 = x1.s)
Now, we get estimate the propensity adjusted causal effect and account for the sampling scheme:
convPS(convSamp = convSamp, repSamp = repSamp,
sampwt_vars = "x1", PS_vars = paste0("x",1:4),
treatment_var = "Tx", response_var = "y")
#> $Treatment_effect_est
#> [1] 2.531667
#>
#> $Treatment_effect_SE
#> [1] 0.6746462
Which returns the estimated causal effect and corresponding standard error that accounts for uncertainty from estimating the sampling weights and propensity score.