Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better treatment indicator handling, NA values in match_on.formula #124

Closed
markmfredrickson opened this issue Feb 21, 2017 · 20 comments
Closed
Assignees

Comments

@markmfredrickson
Copy link
Owner

markmfredrickson commented Feb 21, 2017

Colin Hubbard sent in a bug report based on the following:

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> pairmatch(dose ~ Pred, data = tmp)
Error in toZ(z) : Treatment indicator must have exactly 2 levels not 1

At least as a work around, it would make sense to be able to set up an indicator variable, but that fails with NA values in the distances:

> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in s$solution : $ operator is invalid for atomic vectors
> match_on(dose2 ~ Pred)
Error in eval(expr, envir, enclos) : object 'dose2' not found
> match_on(dose2 ~ Pred, data = tmp)
An object of class "DenseMatrix"
         control
treatment  1  2  3  4
        5 NA NA NA NA
Slot "call":
match_on(x = dose2 ~ Pred, data = tmp)

I conjecture that the NA is due to the degenerate case of having only one treated unit, which makes the SD calculation fail when computing the SD of the treated group, but this is worth checking out in more depth.

@markmfredrickson
Copy link
Owner Author

I believe my conjecture was right. Including a second treated unit is sufficient to get a valid match. Then the workaround gets the right answer:

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3 # WORKAROUND
> pairmatch(dose2 ~ Pred, data = tmp)
   1    2    3    4    5 
 1.2 <NA>  1.1  1.1  1.2 

@benthestatistician
Copy link
Collaborator

Interesting. Thanks to Colin Hubbard for the tip.

I did a recover() on the example, here's what I'm seeing inside of makedist(z, data, compute_mahalanobis, within), the immediate parent of the failing call to toZ():

Browse[1]> ls()
[1] "data"       "distancefn" "within"     "z"         
Browse[1]> z
   1    2    3    4    5 
TRUE TRUE TRUE TRUE TRUE 

This seemed to me to suggests the error may be somewhere different from where @markmfredrickson was thinking. (But then Mark's last post popped up 15 seconds ago, so maybe I'm just not following...)

@josherrickson
Copy link
Collaborator

josherrickson commented Feb 21, 2017

> as.logical(c(2, 3))
[1] TRUE TRUE
setMethod("toZ", "numeric", function(x) as.logical(x))

Edit: Oops just looked at Mark's commit which already solved it!

@markmfredrickson
Copy link
Owner Author

@josherrickson is correct. See 4955bb9 for my changes to that line of code.

@josherrickson
Copy link
Collaborator

@markmfredrickson Might it not be better to enforce that numeric treatment vectors be 0/1? Either that or make it clear in the documentation that with other values we assume the lower is control.

@benthestatistician
Copy link
Collaborator

Interesting point by @josherrickson . If Colin Hubbard is watching, I'd appreciate hearing about his usage scenario.

Ordinarily I'd think that if some level of dose other than 0 were to serve as a control condition it would be best to have the function insist that that level be explicitly specified. Otherwise you might find that adding or removing a single observation totally changes the matching structure in unexpected ways, if the addition or removal of that obs resulted in a change to the lowest observed value of the treatment variable.

Perhaps I'm missing another usage scenario. Absent new info to that effect, I'm inclined to think Josh has a good point.

@benthestatistician
Copy link
Collaborator

I agree with @josherrickson, with the amendments that:

  1. I'd propose that the coercion mechanism applied to numeric Z's be function(x) {x>0}, rather than as.logical.
  2. The condition that we enforce be that there has to be at least one <= 0 entry and at least one >= 0.

@benthestatistician
Copy link
Collaborator

Not clearly germane to the issue at hand, but if anyone understood why this errored out

> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in s$solution : $ operator is invalid for atomic vectors   

but this did not --

> match_on(dose2 ~ Pred, data = tmp)

-- I'd be interested to know. (Also your thoughts about fixability, if any.)

@josherrickson
Copy link
Collaborator

I don't see that error.

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
   1    2    3    4    5 
 1.2 <NA>  1.1  1.1  1.2

@josherrickson
Copy link
Collaborator

@benthestatistician Are you implying that c(0, 1, 2) is a valid treatment vector indicating that observations 2 and 3 are treatment? Or are you saying that the treatment vector is limited to two unique non-NA values, one of which must be 0 and the other must be > 0?

@benthestatistician
Copy link
Collaborator

Weird, I'm certainly getting the error.

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                  Pred = c(0.08494142, -0.02784262,
+                           -.05284574, -0.07428686,
+                           0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in pairmatch(dose2 ~ Pred, data = tmp) : 
  could not find function "pairmatch"
> packageVersion("optmatch")
[1] ‘0.9.7’
> R.version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.0       

@josherrickson are you using optmatch 0.9-7 or development optmatch? Maybe the problem already got fixed. (If so, good to note so that we don't re-break it.)

Getting back to the main issue, I think that c(0,1,2) should be a valid treatment vector indicating that observations 1 and 2 are treatment while observation 0 is a control. The use case I have in mind is that there are multiple flavors of treatment (perhaps different dosages) and a single type of control. The analysis stage might separate out the treatment flavors (dosages), but for matching we're willing to glom them all together.

@josherrickson
Copy link
Collaborator

josherrickson commented Jul 11, 2017

I'm going to push back a bit on this. My primary view is that we should accept two (and only two) versions of the treatment vector: 0/1/NA and TRUE/FALSE/NA. (I'm willing to concede accepting "0"/"1"/"NA" as well, though I think we should not).

My rationale is that by only accepting a very strict set of treatment vectors, we force users to be explicit in their handling of treatment and hopefully avoid surprises. While c(0, 1, 2) may make sense as a treatment vector in certain contexts, I would hypothesize that in most contexts, it indicates an error in the data (or the user's understanding of the data)[1]. By forcing the user who wants to use c(0, 1, 2) to define treatment2 = treatment > 0, we add a trivial amount of work, but protect all other users.

[1] The most common example I see of this is from survey data, where 1 = "no", 2 = "yes", 3 = "did not respond". If a client does the common trick of treatment = qualtricsdata - 1, they'll end up with a vector with some 2's, and perhaps never realize it, thus bundling "yes" and "did not respond" into the treatment group.

@benthestatistician
Copy link
Collaborator

Excellent points, @josherrickson . I'm persuaded.

In effect, this is an API change, so let's be sure to provide those users who'll all of a sudden be getting new errors with a clear trail of crumbs toward a solution,
either through the text of the error message, additional explanation in a help page that the error message points to, or both.

@benthestatistician
Copy link
Collaborator

It occurred to me that we should update makeOptmatch() so that the contrast.group attribute being set here isn't just TRUE's and FALSE's but also, where appropriate, NAs. @josherrickson do you agree?

Not a heavy lift, I think...

@josherrickson
Copy link
Collaborator

josherrickson commented Jul 24, 2017

I agree. Having trouble compiling right now so I can't address it at the moment.

Something like this should work:

cg <- rep(NA, length(names(optmatch.obj)))
cg[names(optmatch.obj) %in% treated] <- 1
cg[names(optmatch.obj) %in% colnames(distance)] <- 0
attr(optmatch.obj, "contrast.group") <- as.logical(cg)

subproblem should be fine, but it should be double-checked as well.

I will take a look when I figure out why my computer is refusing to compile optmatch (I think it is an Rcpp issue:

Error in dyn.load(dllfile) :
  unable to load shared object '/Users/josh/repositories/r/optmatch/src/optmatch.so':
  dlopen(/Users/josh/repositories/r/optmatch/src/optmatch.so, 6): Symbol not found: _optmatch_ismOps
  Referenced from: /Users/josh/repositories/r/optmatch/src/optmatch.so
  Expected in: flat namespace
 in /Users/josh/repositories/r/optmatch/src/optmatch.so
Calls: <Anonymous> -> load_all -> load_dll -> library.dynam2 -> dyn.load

but I'm not sure. I feel like everytime I try and build a package these days, Rcpp modifies R/RCppExports.R and src/RcppExports.cpp without my say-so.), but feel free to make the changes sooner if you're trying to get a new version pushed out.

@benthestatistician
Copy link
Collaborator

I agree on each point, @josherrickson . Re compile problems, I think @markmfredrickson reported having similar trouble, then after some effort pushing through it. Mark, is this correct? Can you share some wisdom?

@josherrickson
Copy link
Collaborator

My laptop still builds fine. I'll have this updated shortly.

@josherrickson
Copy link
Collaborator

Updated in b4908b3. subproblem just needed a proper naming.

@josherrickson
Copy link
Collaborator

This has been completed I believe.

@benthestatistician
Copy link
Collaborator

Thanks @josherrickson!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants