Skip to content

oliviabern/estweight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

estweight

The goal of estweight is to estimate sampling weights for convenience samples using information from a representative sample. The estimated weights are applied in various applications to obtain more generalizable inference. First, estimated sampling weights are incorporated into weighted generalized linear models to estimate associations for a target population and analytic standard error estimates are reported. Second, estimated weights are used to isolate causal effects with propensity scores for a target population. This package provides analytic standard error (SE) estimates in both contexts that account for uncertainty from estimating the sampling weights and when using propensity scores, the SE estimates accounts for uncertainty from estimating the propensity scores.

Installation

You can install the development version of estweight from GitHub with:

# install.packages("devtools")
devtools::install_github("oliviabern/estweight")

Example of weighting a GLM for generalizable inference: the convGLM() function

Suppose you want to estimate a model η(Y) = Xβ where Y is the response, X is a matrix containing covariates and η is a link function. However, Y and X were only collected in a convenience sample. Fitting the model directly in the convenience sample may lead to bias from the unrepresentative sampling. Additionally, supppose you have a second dataset that is representative of the target population you want to obtain inference about, but it does not contain both Y and X. The representative sample can be used to estimate sampling weights for the convenience sample and these derived weights can be used to weight the outcome model of interest. This package uses a representative sample to estimate weights for the convenience sample and weight the outcome model of interest using the convenience sample.

This is a basic example which shows you how to use a representative sample to estimate propensity weights and fit a weighted GLM for a biased convenience sample. First we will estimate a sample drawn from our reference population.

library(estweight)

set.seed(10403)

# simulate population
n.pop = 50000
X1 = rbinom(n.pop, size = 1, prob = 0.5)
X2 = rbinom(n.pop, size = 1, prob = .3)
X3 = rbinom(n.pop, size = 1, prob = .3)
X4 = rbinom(n.pop, size = 1, prob = .1)
X5 = rbinom(n.pop, size = 1, prob = .05)
X6 = rbinom(n.pop, size = 1, prob = .1)
X7 = rbinom(n.pop, size = 1, prob = .5)
X8 = rnorm(n.pop, mean = 65, sd = 5)

repsample = data.frame(X1, X2, X3, X4,
                       X5, X6, X7, X8)

# create helper function
expit = function(x){exp(x)/(1+exp(x))}

Next we generate the known sampling probabilities and corresponding true propensity weight. We also generate the response Y. Our goal is to estimate the outcome model of interest, η(Y) = β0 + β1X4 + β2X5 + β3X6. In this simulation, we know Y for subjects in both samples, but in practice we assume Y was only collected in the convenience sample.

# Calculate probability of being oversampled and true HT weight
sampprob = expit(.5*(repsample$X1*.3 + repsample$X2*.4 + repsample$X3*.8 +
                       repsample$X4*2.7 + repsample$X5*.9 + (repsample$X3)*(repsample$X5)*2 +
                       repsample$X6*.1 + repsample$X6*repsample$X7*1.5 +
                       -.002*repsample$X8^2 + 8))
repsample$sampprob = sampprob
repsample$weight_true = (1/sampprob)/sum(1/sampprob)

# Calculation E(Y) where Y is the response
repsample$Ey = expit(1 + log(2)*(repsample$X4) + -log(3)*(repsample$X5) + log(2.5)*(repsample$X6) +
                        -log(2)*sampprob + log(4)*(repsample$X4)*sampprob +
                        log(4)*(repsample$X5)*sampprob + -log(3)*(repsample$X6)*sampprob)

Now, we draw a biased convenience sample where each subject is sampled according to their sampling probability and a representative sample where each subject has equal probability of being sampled. Each sample contains nsamp = 500 observations.

n.samp = 500

# draw biased sample
b.samp.id = sample(1:n.pop, n.samp, prob = repsample$sampprob)
b.samp = repsample[b.samp.id, ]

# draw representative sample
r.samp.id = sample(1:n.pop, n.samp)
r.samp = repsample[r.samp.id, ]

Next, we simulate response Y for each subject and create an indicator biased which is 1 for subjects in the biased sample and 0 otherwise.

# simulate response
b.samp$y = rbinom(n.samp,1,b.samp$Ey)
r.samp$y = rbinom(n.samp,1,r.samp$Ey)

# biased sample indicator
b.samp$biased = 1; r.samp$biased = 0

We need to manipulate the data into the form required for the convGLMfunction. We need to create a stacked dataset that combines the biased and representative samples. They should be stacked vertically and contain a unique subject identifier, ID}, in the first column, the indicator, biased}, should be in the last column, and any covariates collected in both datasets to be used for matching should be in the middle columns. All strings and factors should be of class factor}. Finally, we need to create a response matrix that contains 2 columns, ID, and the response, y.

# Combine bisaed and representative samples
Xcomb = data.frame(ID = 1:(2*n.samp),
                   rbind(b.samp, r.samp))

# Dataframe to pass to function
Xfit = Xcomb[,c("ID", paste0("X",1:8), "biased")]

# convert strings and indicators to factors
convert2factor = paste0("X",1:7)
for(i in 1:length(convert2factor)){
  Xfit[,convert2factor[i]] = as.factor(Xfit[,convert2factor[i]])
}

# get response
response = Xcomb[Xcomb$biased==1,c("ID","y")]

Use the convGLM function to fit an propensity-weighted generalized linear model that utilizes logistic regression estimated weights.

# formulas for final scientific model
form.outcome = as.formula(y ~ X4 + X5 + X6)

# fit weighted model using a logistic propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
        weight_model = "logistic", outcome_family = "quasibinomial")
#>                  coef analyticSE
#> (Intercept) 0.3695812  0.1111605
#> X41         1.5897785  0.4258854
#> X51         1.1122235  0.4653186
#> X61         0.3366221  0.3161996

Use a random forest propensity weight estimation method instead.

# fit weighted model using a random forest propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
        weight_model = "randomForest", outcome_family = "quasibinomial")
#>                   coef analyticSE
#> (Intercept) 0.50048753  0.1477161
#> X41         1.70037738  0.4721582
#> X51         0.63636903  0.5422092
#> X61         0.03364452  0.3847607

Now try a covariate balancing propensity score weight estimation method.

# fit weighted model using a random forest propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
        weight_model = "CBPS", outcome_family = "quasibinomial")
#> [1] "Finding ATT with T=0 as the treatment.  Set ATT=1 to find ATT with T=1 as the treatment"
#>                  coef analyticSE
#> (Intercept) 0.3738329  0.1117208
#> X41         1.5960229  0.4198840
#> X51         1.0953650  0.4677181
#> X61         0.3098333  0.3172941

Finally, try a entropy balancing weight estimation method.

# fit weighted model using a random forest propensity weight estimation method
convGLM(data = Xfit, outcome_formula = form.outcome, response = response,
        weight_model = "entbal", outcome_family = "quasibinomial")
#> final  value -0.049378 
#> converged
#>                  coef analyticSE
#> (Intercept) 0.3619440  0.1122514
#> X41         1.6440069  0.4228829
#> X51         1.1122146  0.4694176
#> X61         0.3191363  0.3193139

Estimating sampling weights only: the estweight() function

If you want to estimate sampling weights and use them in your own downstream analysis, you can use the estweight function. It will return sampling weights that you can use in any model that allows for sampling weights such as functions from the survey package. Note that the standard errors returned from these methods won’t account for uncertainty from estimating the sampling weights. To illustrate estimating sampling weights, we will use the simulated data from above.

weights = estweight(data = Xfit, weight_model = "logistic")
head(weights)
#>       ID     htweight
#> 46267  1 0.0007810555
#> 2230   2 0.0014069895
#> 13812  3 0.0005241307
#> 16025  4 0.0014827449
#> 46457  5 0.0011350264
#> 20468  6 0.0007226848

Example of using sampling weights to isolate causal effects with propensity adjustment: the convPS() function

Now suppose that you want to use your convenience sample to estimate propensity scores and estimate a propensity score adjusted estimate of a treatment effect. We can again use a representative sample to estimate sampling weights, but the sampling weights should be used to (1) estimate the propensity score and (2) estimate the propensity-adjusted causal effect.

Let’s simulation a super population where the sampling is dependent on a variable K, but we have not measured K. Instead, we have measured a proxy for K, X_1. Both K and X_1 are binary covariates, but they are functions of continuous latent variables which have a correlation of 0.90.

library(MASS)

n.pop = 10000; n = 2000

Sigma = matrix(c(1,.9,.9,1),nrow = 2)
cont = mvrnorm(n.pop, mu = c(0,0), Sigma = Sigma)

K.full = ifelse(cont[,1] > 0, 1, 0)
x1.full = ifelse(cont[,2] > 0, 1, 0)

prob.sample = .6*K.full + .2

We start by drawing a convenience sample according to the sampling probability:

CSamp = sample(1:n.pop, size = n, prob = prob.sample)
x1.c = x1.full[CSamp]
K.c = K.full[CSamp]

Then draw a simple random sample (SRS):

SRS = sample(1:n.pop, size = n)
x1.s = x1.full[SRS]
K.s = K.full[SRS]

Next, we simulate more variables for the convenience sample including potential confounders, the probability of being in the treatment group, and the treatment indicators. Notice that the probability of receiving treatment is differential for individuals with K = 1 or K = 0.

x2.c = rnorm(n,0,2)
x3.c = rnorm(n,0,2)
pi.c = expit((log(1.3)*x2.c + log(.4)*x3.c)*(1-K.c) +
(log(2)*x2.c + log(1.5)*x3.c)*(K.c))
T.c = rbinom(n,1,pi.c)

Lastly, we simulate a response with treatment effect modification by K:

x4.c = rnorm(n,0,1)
x5.c = rnorm(n,0,1)
y.c = rnorm(n, mean = (0 + 1*T.c + 3*T.c*K.c + 1.5*x2.c - 2*x3.c^2 - 1*x4.c + 1.5*x5.c^3), sd = 1)

We can then format the convenience and representative samples to use the convPS() function:

convSamp = data.frame(x1 = x1.c, x2 = x2.c, x3 = x3.c,
 x4 = x4.c, x5 = x5.c, Tx = T.c, y = y.c)
 
repSamp = data.frame(x1 = x1.s)

Now, we get estimate the propensity adjusted causal effect and account for the sampling scheme:

convPS(convSamp = convSamp, repSamp = repSamp,
       sampwt_vars = "x1", PS_vars = paste0("x",1:4),
       treatment_var = "Tx", response_var = "y")
#> $Treatment_effect_est
#> [1] 2.531667
#> 
#> $Treatment_effect_SE
#> [1] 0.6746462

Which returns the estimated causal effect and corresponding standard error that accounts for uncertainty from estimating the sampling weights and propensity score.

About

Contains functions to estimate weights for convenience samples using information from a representative sample.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Languages