{{ message }}

# better operators for interval-to-interval relations#6

Closed
opened this issue Dec 6, 2016 · 11 comments
Closed

# better operators for interval-to-interval relations#6

opened this issue Dec 6, 2016 · 11 comments
Labels

### mgacc0 commented Dec 6, 2016

It would be really great to implement some new operators for interval-to-interval overlapping.

Instead of having the `%[o]%` operator for checking the overlapping between the closed intervals [a, b] and [x, y]
expressed as `c(a, b) %[o]% c(x, y)`,
it would be great having the `%[]o[]%` operator
expressed as `c(a, b) %[]o[]% c(x, y)`.

And then, for example:

``````c(1, 2) %[]o[]% c(2, 3)
# TRUE

c(2, 3) %[]o[]% c(1, 2)
# TRUE

c(1, 2) %()o()% c(2, 3)
# FALSE

c(2, 3) %()o()% c(1, 2)
# FALSE
``````

Consequently, that would imply creating 32 operators (for all the combinations of open/closed endpoints):

interval-to-interval operator
`%[]o[]%`
`%[]o[)%`
`%[]o(]%`
`%[]o()%`
`%[)o[]%`
`%[)o[)%`
`%[)o(]%`
`%[)o()%`
... (continues)

### psolymos commented Dec 7, 2016

 It crossed my mind to implement these, however, here is why I did not in the 1st go round: I found the `%[]o[]%` notation too long (almost looks like a TIE fighter), but it is kind of unavoidable. It is a lot of functions (4x4=16) that needs a lot of unit-tests. I did not see an immediate use case. Out of these, point 3 is the most important, the other 2 are just stylistic/technical details. Do you have any suggestions on point 3? The symmetric cases could be shortened, e.g. `%[]o[]%`=`%[o]%`, `%()o()%`=`%(o)%`.

### mgacc0 commented Dec 7, 2016 • edited

 The notation would be a bit long but, anyway, is intuitive (and quite equivalent to the mathematical notation). So that would make it easy to adopt. They are 16 functions. (I don't know why I wrote 32 previously...) In administrative data processing we use a lot of datetime ranges. And it's very usual to write them as: "from 01.12.2016 to 08.12.2016", meaning the semiopen interval ["2016-12-01 00:00:00", "2016-12-08 00:00:00"). It would be a bit tedious: having to write ["2016-12-01 00:00:00", "2016-12-07 23:59:59"] or having to make two comparisons: one for overlapping with closed ranges and the next to check for the coincidence of endpoints

### mgacc0 commented Dec 7, 2016 • edited

 I propose something like this: ``````.intrval3 <- function(interval1, type1, interval2, type2) { ab <- .get_intrval(interval1) cd <- .get_intrval(interval2) type1 <- match.arg(type1, c("[]", "[)", "(]", "()")) type2 <- match.arg(type2, c("[]", "[)", "(]", "()")) if (ab\$a > cd\$a) { tmp <- ab ab <- cd cd <- tmp tmp <- type1 type1 <- type2 type2 <- tmp } ! .lssthan(ab\$b, cd, ifelse(substr(type1, 2L, 2L)=="]" & substr(type2, 1L, 1L)=="[", "[", "(")) } .intrval3(c(1, 2), "()", c(2, 3), "()") # [1] FALSE .intrval3(c(1, 2), "[]", c(2, 3), "[)") # [1] TRUE ``````

### psolymos commented Dec 7, 2016

 OK, I will make a branch and start development with your proposal.

### psolymos commented Dec 8, 2016

 @mgacc0 see here what I have so far in `overlap-operators` branch: https://github.com/psolymos/intrval/blob/overlap-operators/extras/playground.R

### psolymos commented Dec 8, 2016

 @mgacc0 I could test your function and it passed all the comparisons I came up with. The only issue right now is that it works for simple case but not yet vectorized (i.e. cannot accept matrices, lists because of the `if (iv1\$a > iv2\$a) {...}` statement).

### mgacc0 commented Dec 8, 2016

 Firstly, a necessary correction: checking for empty intervals. ``````is_empty_interval <- function(interval, type) { interval[[1]]==interval[[2]] & type=="()" } overlaps <- function (interval1, interval2, type1, type2) { stopifnot(is.vector(interval1) && length(interval1)<=2) stopifnot(is.vector(interval2) && length(interval2)<=2) iv1 <- .get_intrval(interval1) iv2 <- .get_intrval(interval2) type1 <- match.arg(type1, c("[]", "[)", "(]", "()")) type2 <- match.arg(type2, c("[]", "[)", "(]", "()")) if (is_empty_interval(iv1, type1) | is_empty_interval(iv2, type2)) { return(FALSE) } if (iv1\$a > iv2\$a) { tmp <- iv1 iv1 <- iv2 iv2 <- tmp tmp <- type1 type1 <- type2 type2 <- tmp } !.lssthan(iv1\$b, iv2, ifelse(substr(type1, 2L, 2L)=="]" && substr(type2, 1L, 1L)=="[", "[", "(")) } `````` I would rename the `.intrval3` function to `overlaps` and make it visible. For easier testing, I would recommend the testthat package: ``````library(testthat) expect_false(overlaps(c(2, 4), c(2, 2), "[]", "()")) expect_false(overlaps(c(2, 2), c(2, 4), "()", "[]")) `````` For calling over dataframes, you could do something like this: ``````library(dplyr, purrr) mm <- expand.grid( iv1_a=c(1, 2, 3, 4, 5), iv1_b=c(1, 2, 3, 4, 5), iv2_a=c(1, 2, 3, 4, 5), iv2_b=c(1, 2, 3, 4, 5), type1=c("[]", "[)", "(]", "()"), type2=c("[]", "[)", "(]", "()"), stringsAsFactors=FALSE) nrow(mm) # [1] 10000 mm %>% sample_n(4) # iv1_a iv1_b iv2_a iv2_b type1 type2 # 694 4 4 3 1 [) [] # 612 2 3 5 5 [] [] # 1320 5 4 3 1 (] [] # 6273 3 5 1 1 (] (] mm %>% sample_n(10) %>% by_row(~ overlaps(c(.[[1]], .[[2]]), c(.[[3]], .[[4]]), .[[5]], .[[6]]), .collate="cols") # # A tibble: 10 × 7 # iv1_a iv1_b iv2_a iv2_b type1 type2 .out # # 1 5 2 2 2 (] [] TRUE # 2 1 4 4 5 () () FALSE # 3 3 5 2 5 () [] TRUE # 4 4 2 4 3 (] (] TRUE # 5 5 3 2 1 (] (] FALSE # 6 2 3 2 4 (] [] TRUE # 7 2 1 4 2 () [) FALSE # 8 5 2 1 4 (] (] TRUE # 9 4 1 2 5 [) [] TRUE # 10 4 2 3 5 (] [) TRUE mm %>% .[, 1:4] %>% sample_n(10) %>% by_row(~ c(.[[1]], .[[2]]) %[]o[]% c(.[[3]], .[[4]]), .collate="cols") # # A tibble: 10 × 5 # iv1_a iv1_b iv2_a iv2_b .out # # 1 5 3 3 4 TRUE # 2 3 1 2 2 TRUE # 3 3 2 5 4 FALSE # 4 3 2 2 4 TRUE # 5 2 3 4 2 TRUE # 6 3 4 2 5 TRUE # 7 3 3 3 3 TRUE # 8 3 3 4 1 TRUE # 9 1 5 2 3 TRUE # 10 2 5 2 4 TRUE `````` The notation for single vectors (intervals) is intuitive. But I think that the `intrval` library doesn't need to provide an operator for data.frames. You could place an example (on the README) of using the function `purrr::by_row` and `overlaps`. Anyway, if you want to know how a function for data.frames would be: ``````intrval4 <- function (df1) { stopifnot(is.data.frame(df1)) stopifnot(nrow(df1)==1) iv1 <- .get_intrval(c(df1\$iv1_a, df1\$iv1_b)) iv2 <- .get_intrval(c(df1\$iv2_a, df1\$iv2_b)) type1 <- match.arg(df1\$type1, c("[]", "[)", "(]", "()")) type2 <- match.arg(df1\$type2, c("[]", "[)", "(]", "()")) if (is_empty_interval(iv1, type1) | is_empty_interval(iv1, type1)) { return (FALSE) } if (iv1\$a > iv2\$a) { tmp <- iv1 iv1 <- iv2 iv2 <- tmp tmp <- type1 type1 <- type2 type2 <- tmp } !.lssthan(iv1\$b, iv2, ifelse(substr(type1, 2L, 2L)=="]" && substr(type2, 1L, 1L)=="[", "[", "(")) } mm %>% sample_n(20) %>% by_row(intrval4, .collate=c("cols")) ``````

### psolymos commented Dec 10, 2016

 Thanks. I just wanted to make the operators consistent with the `%[o]%` cases that can accept different formats on both ends and recycle values if needed. I agree it is not absolutely necessary for all use cases, but still nice to have. I like your testing approach, but want to avoid too much dependencies. Even if I ignore `/tests` for CRAN versions, CI will complain about missing `Suggests` declarations.

### psolymos commented Dec 10, 2016

 This function seems to do the job: ```.intrval3 <- function(interval1, interval2, type1, type2) { iv1 <- .get_intrval(interval1) iv2 <- .get_intrval(interval2) type1 <- match.arg(type1, c("[]", "[)", "(]", "()")) type2 <- match.arg(type2, c("[]", "[)", "(]", "()")) b1 <- ifelse(iv1\$a < iv2\$a, iv1\$b, iv2\$b) a2 <- ifelse(iv1\$a < iv2\$a, iv2\$a, iv1\$a) type1v <- ifelse(iv1\$a < iv2\$a, substr(type1, 2L, 2L), substr(type2, 2L, 2L)) type2v <- ifelse(iv1\$a < iv2\$a, substr(type2, 1L, 1L), substr(type1, 1L, 1L)) ifelse(type1v == "]" & type2v == "[", b1 >= a2, b1 > a2) }``` The idea is similar to your proposal: sort intervals based on lower endpoints (a1 < a2) compare b1 to a2 based on open/closed endpoints.

### mgacc0 commented Dec 10, 2016

 Could you have a look at ../pull/13 ?

### psolymos commented Dec 22, 2016

 @mgacc0 I merged the overlap-operators branch, see comments on #15 Next steps include more testing and improving code coverage (it dropped because right now only internals are tested).

closed this Jan 22, 2017