Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better operators for interval-to-interval relations #6

Closed
mgacc0 opened this issue Dec 6, 2016 · 11 comments
Closed

better operators for interval-to-interval relations #6

mgacc0 opened this issue Dec 6, 2016 · 11 comments

Comments

@mgacc0
Copy link
Contributor

mgacc0 commented Dec 6, 2016

It would be really great to implement some new operators for interval-to-interval overlapping.

Instead of having the %[o]% operator for checking the overlapping between the closed intervals [a, b] and [x, y]
expressed as c(a, b) %[o]% c(x, y),
it would be great having the %[]o[]% operator
expressed as c(a, b) %[]o[]% c(x, y).

And then, for example:

c(1, 2) %[]o[]% c(2, 3)
# TRUE

c(2, 3) %[]o[]% c(1, 2)
# TRUE

c(1, 2) %()o()% c(2, 3)
# FALSE

c(2, 3) %()o()% c(1, 2)
# FALSE

Consequently, that would imply creating 32 operators (for all the combinations of open/closed endpoints):

interval-to-interval operator
%[]o[]%
%[]o[)%
%[]o(]%
%[]o()%
%[)o[]%
%[)o[)%
%[)o(]%
%[)o()%
... (continues)
@psolymos
Copy link
Owner

psolymos commented Dec 7, 2016

It crossed my mind to implement these, however, here is why I did not in the 1st go round:

  1. I found the %[]o[]% notation too long (almost looks like a TIE fighter), but it is kind of unavoidable.
  2. It is a lot of functions (4x4=16) that needs a lot of unit-tests.
  3. I did not see an immediate use case.

Out of these, point 3 is the most important, the other 2 are just stylistic/technical details. Do you have any suggestions on point 3?

The symmetric cases could be shortened, e.g. %[]o[]%=%[o]%, %()o()%=%(o)%.

@mgacc0
Copy link
Contributor Author

mgacc0 commented Dec 7, 2016

  1. The notation would be a bit long but, anyway, is intuitive (and quite equivalent to the mathematical notation).
    So that would make it easy to adopt.

  2. They are 16 functions. (I don't know why I wrote 32 previously...)

  3. In administrative data processing we use a lot of datetime ranges.
    And it's very usual to write them as:
    "from 01.12.2016 to 08.12.2016",
    meaning the semiopen interval ["2016-12-01 00:00:00", "2016-12-08 00:00:00").

    It would be a bit tedious:

    • having to write ["2016-12-01 00:00:00", "2016-12-07 23:59:59"]
    • or having to make two comparisons: one for overlapping with closed ranges and the next to check for the coincidence of endpoints

@mgacc0
Copy link
Contributor Author

mgacc0 commented Dec 7, 2016

I propose something like this:

.intrval3 <-
  function(interval1, type1, interval2, type2)
  {
    ab <- .get_intrval(interval1)
    cd <- .get_intrval(interval2)
    
    type1 <- match.arg(type1, c("[]", "[)", "(]", "()"))
    type2 <- match.arg(type2, c("[]", "[)", "(]", "()"))

    if (ab$a > cd$a) {
      tmp <- ab
      ab <- cd
      cd <- tmp

      tmp <- type1
      type1 <- type2
      type2 <- tmp
    }

    ! .lssthan(ab$b, cd,
               ifelse(substr(type1, 2L, 2L)=="]" & substr(type2, 1L, 1L)=="[",
                      "[", "("))
  }

.intrval3(c(1, 2), "()", c(2, 3), "()")
#  [1] FALSE

.intrval3(c(1, 2), "[]", c(2, 3), "[)")
#  [1] TRUE

@psolymos
Copy link
Owner

psolymos commented Dec 7, 2016

OK, I will make a branch and start development with your proposal.

@psolymos
Copy link
Owner

psolymos commented Dec 8, 2016

@mgacc0 see here what I have so far in overlap-operators branch: https://github.com/psolymos/intrval/blob/overlap-operators/extras/playground.R

@psolymos
Copy link
Owner

psolymos commented Dec 8, 2016

@mgacc0 I could test your function and it passed all the comparisons I came up with. The only issue right now is that it works for simple case but not yet vectorized (i.e. cannot accept matrices, lists because of the if (iv1$a > iv2$a) {...} statement).

@mgacc0
Copy link
Contributor Author

mgacc0 commented Dec 8, 2016

Firstly, a necessary correction: checking for empty intervals.

is_empty_interval <- function(interval, type) {
  interval[[1]]==interval[[2]] & type=="()"
}

overlaps <- function (interval1, interval2, type1, type2)
{
    stopifnot(is.vector(interval1) && length(interval1)<=2)
    stopifnot(is.vector(interval2) && length(interval2)<=2)

    iv1 <- .get_intrval(interval1)
    iv2 <- .get_intrval(interval2)

    type1 <- match.arg(type1, c("[]", "[)", "(]", "()"))
    type2 <- match.arg(type2, c("[]", "[)", "(]", "()"))

    if (is_empty_interval(iv1, type1) | is_empty_interval(iv2, type2)) {
      return(FALSE)
    }

    
    if (iv1$a > iv2$a) {
        tmp <- iv1
        iv1 <- iv2
        iv2 <- tmp

        tmp <- type1
        type1 <- type2
        type2 <- tmp
    }

    !.lssthan(iv1$b, iv2,
        ifelse(substr(type1, 2L, 2L)=="]" && substr(type2, 1L, 1L)=="[",
        "[", "("))
}

I would rename the .intrval3 function to overlaps and make it visible.


For easier testing, I would recommend the testthat package:

library(testthat)
expect_false(overlaps(c(2, 4), c(2, 2), "[]", "()"))
expect_false(overlaps(c(2, 2), c(2, 4), "()", "[]"))

For calling over dataframes, you could do something like this:

library(dplyr, purrr)

mm <- expand.grid(
  iv1_a=c(1, 2, 3, 4, 5),
  iv1_b=c(1, 2, 3, 4, 5),
  iv2_a=c(1, 2, 3, 4, 5),
  iv2_b=c(1, 2, 3, 4, 5),
  type1=c("[]", "[)", "(]", "()"),
  type2=c("[]", "[)", "(]", "()"),
  stringsAsFactors=FALSE)
nrow(mm)
# [1] 10000

mm %>%
  sample_n(4)
#      iv1_a iv1_b iv2_a iv2_b type1 type2
# 694      4     4     3     1    [)    []
# 612      2     3     5     5    []    []
# 1320     5     4     3     1    (]    []
# 6273     3     5     1     1    (]    (]

mm %>%
  sample_n(10) %>%
  by_row(~ overlaps(c(.[[1]], .[[2]]), c(.[[3]], .[[4]]),
                    .[[5]], .[[6]]), .collate="cols")
# # A tibble: 10 × 7
#    iv1_a iv1_b iv2_a iv2_b type1 type2  .out
#    <dbl> <dbl> <dbl> <dbl> <chr> <chr> <lgl>
# 1      5     2     2     2    (]    []  TRUE
# 2      1     4     4     5    ()    () FALSE
# 3      3     5     2     5    ()    []  TRUE
# 4      4     2     4     3    (]    (]  TRUE
# 5      5     3     2     1    (]    (] FALSE
# 6      2     3     2     4    (]    []  TRUE
# 7      2     1     4     2    ()    [) FALSE
# 8      5     2     1     4    (]    (]  TRUE
# 9      4     1     2     5    [)    []  TRUE
# 10     4     2     3     5    (]    [)  TRUE

mm %>%
  .[, 1:4] %>%
  sample_n(10) %>%
  by_row(~ c(.[[1]], .[[2]]) %[]o[]% c(.[[3]], .[[4]]), .collate="cols")
# # A tibble: 10 × 5
#    iv1_a iv1_b iv2_a iv2_b  .out
#    <dbl> <dbl> <dbl> <dbl> <lgl>
# 1      5     3     3     4  TRUE
# 2      3     1     2     2  TRUE
# 3      3     2     5     4 FALSE
# 4      3     2     2     4  TRUE
# 5      2     3     4     2  TRUE
# 6      3     4     2     5  TRUE
# 7      3     3     3     3  TRUE
# 8      3     3     4     1  TRUE
# 9      1     5     2     3  TRUE
# 10     2     5     2     4  TRUE

The notation for single vectors (intervals) is intuitive.
But I think that the intrval library doesn't need to provide an operator for data.frames.
You could place an example (on the README) of using the function purrr::by_row and overlaps.


Anyway, if you want to know how a function for data.frames would be:

intrval4 <- function (df1)
{
  stopifnot(is.data.frame(df1))
  stopifnot(nrow(df1)==1)

  iv1 <- .get_intrval(c(df1$iv1_a, df1$iv1_b))
  iv2 <- .get_intrval(c(df1$iv2_a, df1$iv2_b))

  type1 <- match.arg(df1$type1, c("[]", "[)", "(]", "()"))
  type2 <- match.arg(df1$type2, c("[]", "[)", "(]", "()"))

  if (is_empty_interval(iv1, type1) | is_empty_interval(iv1, type1)) {
    return (FALSE)
  }

  if (iv1$a > iv2$a) {
    tmp <- iv1
    iv1 <- iv2
    iv2 <- tmp

    tmp <- type1
    type1 <- type2
    type2 <- tmp
  }

  !.lssthan(iv1$b, iv2,
            ifelse(substr(type1, 2L, 2L)=="]" && substr(type2, 1L, 1L)=="[",
                   "[", "("))
}

mm %>%
  sample_n(20) %>%
  by_row(intrval4, .collate=c("cols"))

@psolymos
Copy link
Owner

Thanks. I just wanted to make the operators consistent with the %[o]% cases that can accept different formats on both ends and recycle values if needed. I agree it is not absolutely necessary for all use cases, but still nice to have.

I like your testing approach, but want to avoid too much dependencies. Even if I ignore /tests for CRAN versions, CI will complain about missing Suggests declarations.

@psolymos
Copy link
Owner

This function seems to do the job:

.intrval3 <-
function(interval1, interval2, type1, type2)
{
    iv1 <- .get_intrval(interval1)
    iv2 <- .get_intrval(interval2)

    type1 <- match.arg(type1, c("[]", "[)", "(]", "()"))
    type2 <- match.arg(type2, c("[]", "[)", "(]", "()"))

    b1 <- ifelse(iv1$a < iv2$a, iv1$b, iv2$b)
    a2 <- ifelse(iv1$a < iv2$a, iv2$a, iv1$a)
    type1v <- ifelse(iv1$a < iv2$a, substr(type1, 2L, 2L), substr(type2, 2L, 2L))
    type2v <- ifelse(iv1$a < iv2$a, substr(type2, 1L, 1L), substr(type1, 1L, 1L))

    ifelse(type1v == "]" & type2v == "[",
        b1 >= a2,
        b1 > a2)
}

The idea is similar to your proposal:

  1. sort intervals based on lower endpoints (a1 < a2)
  2. compare b1 to a2 based on open/closed endpoints.

@mgacc0
Copy link
Contributor Author

mgacc0 commented Dec 10, 2016

Could you have a look at ../pull/13 ?

@psolymos
Copy link
Owner

@mgacc0 I merged the overlap-operators branch, see comments on #15

Next steps include more testing and improving code coverage (it dropped because right now only internals are tested).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants