Better treatment indicator handling, NA values in `match_on.formula` #124

markmfredrickson · 2017-02-21T15:00:15Z

Colin Hubbard sent in a bug report based on the following:

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> pairmatch(dose ~ Pred, data = tmp)
Error in toZ(z) : Treatment indicator must have exactly 2 levels not 1

At least as a work around, it would make sense to be able to set up an indicator variable, but that fails with NA values in the distances:

> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in s$solution : $ operator is invalid for atomic vectors
> match_on(dose2 ~ Pred)
Error in eval(expr, envir, enclos) : object 'dose2' not found
> match_on(dose2 ~ Pred, data = tmp)
An object of class "DenseMatrix"
         control
treatment  1  2  3  4
        5 NA NA NA NA
Slot "call":
match_on(x = dose2 ~ Pred, data = tmp)

I conjecture that the NA is due to the degenerate case of having only one treated unit, which makes the SD calculation fail when computing the SD of the treated group, but this is worth checking out in more depth.

The text was updated successfully, but these errors were encountered:

markmfredrickson · 2017-02-21T17:58:30Z

I believe my conjecture was right. Including a second treated unit is sufficient to get a valid match. Then the workaround gets the right answer:

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3 # WORKAROUND
> pairmatch(dose2 ~ Pred, data = tmp)
   1    2    3    4    5 
 1.2 <NA>  1.1  1.1  1.2

benthestatistician · 2017-02-21T17:59:35Z

Interesting. Thanks to Colin Hubbard for the tip.

I did a recover() on the example, here's what I'm seeing inside of makedist(z, data, compute_mahalanobis, within), the immediate parent of the failing call to toZ():

Browse[1]> ls()
[1] "data"       "distancefn" "within"     "z"         
Browse[1]> z
   1    2    3    4    5 
TRUE TRUE TRUE TRUE TRUE

This seemed to me to suggests the error may be somewhere different from where @markmfredrickson was thinking. (But then Mark's last post popped up 15 seconds ago, so maybe I'm just not following...)

josherrickson · 2017-02-21T18:24:47Z

> as.logical(c(2, 3))
[1] TRUE TRUE

setMethod("toZ", "numeric", function(x) as.logical(x))

Edit: Oops just looked at Mark's commit which already solved it!

markmfredrickson · 2017-02-21T18:26:08Z

@josherrickson is correct. See 4955bb9 for my changes to that line of code.

josherrickson · 2017-02-21T18:28:39Z

@markmfredrickson Might it not be better to enforce that numeric treatment vectors be 0/1? Either that or make it clear in the documentation that with other values we assume the lower is control.

benthestatistician · 2017-02-21T20:20:37Z

Interesting point by @josherrickson . If Colin Hubbard is watching, I'd appreciate hearing about his usage scenario.

Ordinarily I'd think that if some level of dose other than 0 were to serve as a control condition it would be best to have the function insist that that level be explicitly specified. Otherwise you might find that adding or removing a single observation totally changes the matching structure in unexpected ways, if the addition or removal of that obs resulted in a change to the lowest observed value of the treatment variable.

Perhaps I'm missing another usage scenario. Absent new info to that effect, I'm inclined to think Josh has a good point.

benthestatistician · 2017-07-11T13:38:27Z

I agree with @josherrickson, with the amendments that:

I'd propose that the coercion mechanism applied to numeric Z's be function(x) {x>0}, rather than as.logical.
The condition that we enforce be that there has to be at least one <= 0 entry and at least one >= 0.

benthestatistician · 2017-07-11T13:45:27Z

Not clearly germane to the issue at hand, but if anyone understood why this errored out

> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in s$solution : $ operator is invalid for atomic vectors

but this did not --

> match_on(dose2 ~ Pred, data = tmp)

-- I'd be interested to know. (Also your thoughts about fixability, if any.)

josherrickson · 2017-07-11T13:51:30Z

I don't see that error.

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
   1    2    3    4    5 
 1.2 <NA>  1.1  1.1  1.2

josherrickson · 2017-07-11T13:51:37Z

@benthestatistician Are you implying that c(0, 1, 2) is a valid treatment vector indicating that observations 2 and 3 are treatment? Or are you saying that the treatment vector is limited to two unique non-NA values, one of which must be 0 and the other must be > 0?

benthestatistician · 2017-07-11T14:50:05Z

Weird, I'm certainly getting the error.

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                  Pred = c(0.08494142, -0.02784262,
+                           -.05284574, -0.07428686,
+                           0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in pairmatch(dose2 ~ Pred, data = tmp) : 
  could not find function "pairmatch"
> packageVersion("optmatch")
[1] ‘0.9.7’
> R.version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.0

@josherrickson are you using optmatch 0.9-7 or development optmatch? Maybe the problem already got fixed. (If so, good to note so that we don't re-break it.)

Getting back to the main issue, I think that c(0,1,2) should be a valid treatment vector indicating that observations 1 and 2 are treatment while observation 0 is a control. The use case I have in mind is that there are multiple flavors of treatment (perhaps different dosages) and a single type of control. The analysis stage might separate out the treatment flavors (dosages), but for matching we're willing to glom them all together.

josherrickson · 2017-07-11T15:15:05Z

I'm going to push back a bit on this. My primary view is that we should accept two (and only two) versions of the treatment vector: 0/1/NA and TRUE/FALSE/NA. (I'm willing to concede accepting "0"/"1"/"NA" as well, though I think we should not).

My rationale is that by only accepting a very strict set of treatment vectors, we force users to be explicit in their handling of treatment and hopefully avoid surprises. While c(0, 1, 2) may make sense as a treatment vector in certain contexts, I would hypothesize that in most contexts, it indicates an error in the data (or the user's understanding of the data)[1]. By forcing the user who wants to use c(0, 1, 2) to define treatment2 = treatment > 0, we add a trivial amount of work, but protect all other users.

[1] The most common example I see of this is from survey data, where 1 = "no", 2 = "yes", 3 = "did not respond". If a client does the common trick of treatment = qualtricsdata - 1, they'll end up with a vector with some 2's, and perhaps never realize it, thus bundling "yes" and "did not respond" into the treatment group.

benthestatistician · 2017-07-11T16:51:49Z

Excellent points, @josherrickson . I'm persuaded.

In effect, this is an API change, so let's be sure to provide those users who'll all of a sudden be getting new errors with a clear trail of crumbs toward a solution,
either through the text of the error message, additional explanation in a help page that the error message points to, or both.

benthestatistician · 2017-07-21T13:18:56Z

It occurred to me that we should update makeOptmatch() so that the contrast.group attribute being set here isn't just TRUE's and FALSE's but also, where appropriate, NAs. @josherrickson do you agree?

Not a heavy lift, I think...

josherrickson · 2017-07-24T13:46:05Z

I agree. Having trouble compiling right now so I can't address it at the moment.

Something like this should work:

cg <- rep(NA, length(names(optmatch.obj)))
cg[names(optmatch.obj) %in% treated] <- 1
cg[names(optmatch.obj) %in% colnames(distance)] <- 0
attr(optmatch.obj, "contrast.group") <- as.logical(cg)

subproblem should be fine, but it should be double-checked as well.

I will take a look when I figure out why my computer is refusing to compile optmatch (I think it is an Rcpp issue:

Error in dyn.load(dllfile) :
  unable to load shared object '/Users/josh/repositories/r/optmatch/src/optmatch.so':
  dlopen(/Users/josh/repositories/r/optmatch/src/optmatch.so, 6): Symbol not found: _optmatch_ismOps
  Referenced from: /Users/josh/repositories/r/optmatch/src/optmatch.so
  Expected in: flat namespace
 in /Users/josh/repositories/r/optmatch/src/optmatch.so
Calls: <Anonymous> -> load_all -> load_dll -> library.dynam2 -> dyn.load

but I'm not sure. I feel like everytime I try and build a package these days, Rcpp modifies R/RCppExports.R and src/RcppExports.cpp without my say-so.), but feel free to make the changes sooner if you're trying to get a new version pushed out.

benthestatistician · 2017-07-24T14:09:41Z

I agree on each point, @josherrickson . Re compile problems, I think @markmfredrickson reported having similar trouble, then after some effort pushing through it. Mark, is this correct? Can you share some wisdom?

josherrickson · 2017-07-24T17:44:19Z

My laptop still builds fine. I'll have this updated shortly.

josherrickson · 2017-07-24T17:52:57Z

Updated in b4908b3. subproblem just needed a proper naming.

josherrickson · 2017-08-03T12:20:43Z

This has been completed I believe.

benthestatistician · 2017-08-03T15:09:34Z

Thanks @josherrickson!

markmfredrickson closed this as completed in 4955bb9 Feb 21, 2017

josherrickson mentioned this issue Jul 11, 2017

NAs not allowed in treatment indicator #123

Closed

benthestatistician assigned benthestatistician and josherrickson Jul 21, 2017

benthestatistician reopened this Jul 21, 2017

josherrickson mentioned this issue Jul 31, 2017

register native routines #127

Closed

josherrickson closed this as completed Aug 3, 2017

benthestatistician added a commit that referenced this issue Aug 3, 2017

Dropping some breadcrumbs for users inconvenienced by #124

149e42b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better treatment indicator handling, NA values in `match_on.formula` #124

Better treatment indicator handling, NA values in `match_on.formula` #124

markmfredrickson commented Feb 21, 2017 •

edited

markmfredrickson commented Feb 21, 2017

benthestatistician commented Feb 21, 2017

josherrickson commented Feb 21, 2017 •

edited

markmfredrickson commented Feb 21, 2017

josherrickson commented Feb 21, 2017

benthestatistician commented Feb 21, 2017

benthestatistician commented Jul 11, 2017

benthestatistician commented Jul 11, 2017

josherrickson commented Jul 11, 2017

josherrickson commented Jul 11, 2017

benthestatistician commented Jul 11, 2017

josherrickson commented Jul 11, 2017 •

edited

benthestatistician commented Jul 11, 2017

benthestatistician commented Jul 21, 2017

josherrickson commented Jul 24, 2017 •

edited

benthestatistician commented Jul 24, 2017

josherrickson commented Jul 24, 2017

josherrickson commented Jul 24, 2017

josherrickson commented Aug 3, 2017

benthestatistician commented Aug 3, 2017

Better treatment indicator handling, NA values in match_on.formula #124

Better treatment indicator handling, NA values in match_on.formula #124

Comments

markmfredrickson commented Feb 21, 2017 • edited

markmfredrickson commented Feb 21, 2017

benthestatistician commented Feb 21, 2017

josherrickson commented Feb 21, 2017 • edited

markmfredrickson commented Feb 21, 2017

josherrickson commented Feb 21, 2017

benthestatistician commented Feb 21, 2017

benthestatistician commented Jul 11, 2017

benthestatistician commented Jul 11, 2017

josherrickson commented Jul 11, 2017

josherrickson commented Jul 11, 2017

benthestatistician commented Jul 11, 2017

josherrickson commented Jul 11, 2017 • edited

benthestatistician commented Jul 11, 2017

benthestatistician commented Jul 21, 2017

josherrickson commented Jul 24, 2017 • edited

benthestatistician commented Jul 24, 2017

josherrickson commented Jul 24, 2017

josherrickson commented Jul 24, 2017

josherrickson commented Aug 3, 2017

benthestatistician commented Aug 3, 2017

Better treatment indicator handling, NA values in `match_on.formula` #124

Better treatment indicator handling, NA values in `match_on.formula` #124

markmfredrickson commented Feb 21, 2017 •

edited

josherrickson commented Feb 21, 2017 •

edited

josherrickson commented Jul 11, 2017 •

edited

josherrickson commented Jul 24, 2017 •

edited