New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for window function #100

Merged
merged 12 commits into from May 3, 2018

Conversation

Projects
None yet
2 participants
@corwinjoy
Contributor

corwinjoy commented Jun 3, 2015

I propose that a window function be added to xts so that users can have binary lookup over a range of dates.
This function is the same as in the base package zoo, that is window.zoo, however we use binary search over the start and end date.

corwinjoy added some commits Jun 3, 2015

Add support for window function
I propose that a window function be added to xts so that users can have binary lookup over a range of dates.
This function is the same as in the base package zoo, that is window.zoo, however we use binary search over the start and end date.
Correct timezone for window function
Also, window_bin probably can't be easily reused in .subset.xts - too much overhead relative to a two line binary search.
If I wasn't lazy - it could actually be done as follows:
1. Have window_bin take a .index parameter like window.zoo and return a set of index numbers (possibly NULL for no match).
Then .subset.xts could re-use.
Even better, subset.xts should be rewritten to do binary search when it is passed a set of dates.  How?  Actually, R has a nice built-in function for this findInterval.  This uses binary search, but even better, it starts at the last matched value on subsequent searches.  This ends up being O(n) on the number of dates to lookup.  Much faster than O(n*N) with the current match logic.
Use .index in window function
We need to make sure we are accessing the internal POSIXct index, so switch the call to .index.  The index function (potentially) converts to an end-user index.
@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Jun 5, 2015

Owner

Thank you very much for taking the initiative on this! A couple things I would like to see done (either by you or me; let me know your preference):

  1. Import stats::window, rename window_bin to window.xts and register it as a S3 method
  2. Remove all the test code from R/xts.methods.R
    • xts does have RUnit-based tests, but they've not been run in ages. I would like to to get them running again.
  3. Write man/window.xts.Rd
  4. Add index. argument to match window.zoo definition (since xts extends zoo)
  5. Possibly be more defensive about potential values for start and end
    • I'm thinking about situations where start or end are already POSIXct, but have a different timezone. Or if they're some other time-based class (e.g. timeDate).
Owner

joshuaulrich commented Jun 5, 2015

Thank you very much for taking the initiative on this! A couple things I would like to see done (either by you or me; let me know your preference):

  1. Import stats::window, rename window_bin to window.xts and register it as a S3 method
  2. Remove all the test code from R/xts.methods.R
    • xts does have RUnit-based tests, but they've not been run in ages. I would like to to get them running again.
  3. Write man/window.xts.Rd
  4. Add index. argument to match window.zoo definition (since xts extends zoo)
  5. Possibly be more defensive about potential values for start and end
    • I'm thinking about situations where start or end are already POSIXct, but have a different timezone. Or if they're some other time-based class (e.g. timeDate).
@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Jun 5, 2015

Contributor

@joshuaulrich wrote a number of nice comments as reproduced below. Basically I'm open to all of these ideas but would like to iron out specifics. Here are my thoughts:

  1. Import stats::window, rename window_bin to window.xts and register it as a S3 method.
    • The renaming makes sense. In terms of registering an S3 method, I'm not sure what is needed. zoo already defines window as a generic - does window.xts need to be added to the package export list?
  2. Remove all the test code from R/xts.methods.R
    xts does have RUnit-based tests, but they've not been run in ages. I would like to to get them running again.
    • Will do - let me get closer to the final form before I rip this out since I want to be careful and not make a mistake here. I do like mini-tests to help make sure the obvious cases are covered.
  3. Write man/window.xts.Rd
    • Can I pass the buck here? I think this is mostly a case of copying the zoo docs.
  4. Add index. argument to match window.zoo definition (since xts extends zoo)
    • Maybe? I'd like some feedback on how this would work since I don't have a deep understanding of zoo and xts. What does it mean if you pass in an index that is a different time index than what is used in xts? Does it have to follow the rules for the index in the xts constructor. (Unique, ordered, time-based class). If we get away from this life gets hard (duplicates should return duplicates?), (non-ordered means no binary search / or do we sort?) I suppose xts has conversion functions to the native index. For a first pass I left it out since I was afraid of getting this wrong.
  5. Possibly be more defensive about potential values for start and end
    I'm thinking about situations where start or end are already POSIXct, but have a different timezone. Or if they're some other time-based class (e.g. timeDate).
    • This is worth checking carefully. If we were in C the function could take only POSIXct dates and all would be well. R users are a bit spoiled. I know that I often like to use window with a pair of string dates to inspect a series. window.zoo is pretty loose and says only that the index must be comparable to the start and end dates (which may be character if the index support character comparisons). I went for a standard more in line with the subset operator and said the dates needed to be covertible to POSIXct.

The .subset.xts function has the following code:

if (timeBased(i)) {
      if(inherits(i, "POSIXct")) {
        i <- which(!is.na(match(.index(x), i)))
      } else if(inherits(i, "Date")) {
        i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
      } else {
        # force all other time classes to be POSIXct
        i <- which(!is.na(match(.index(x), as.POSIXct(i,tz=indexTZ(x)))))
      }
      i[is.na(i)] <- 0
    }

Instead, I simply ask for a conversion to POSIXct and fail if there is not an unambiguous conversion.
start <- as.POSIXct(start, tz=indexTZ(x)) # If start is already POSIXct this does nothing.

I think this is cleaner, but let's look at the cases. First, imagine they pass in a POSIXct date with a different time zone. I think this is OK because the internal representation will be the same. For example:

x <- as.POSIXct("2015-06-05", tz = "GMT")
y <- x
attr(y, "tzone") = ""

# The timezone changes the pretty printing
> x
[1] "2015-06-05 GMT"

# My default time zone is PDT
> y
[1] "2015-06-04 17:00:00 PDT"

# The time zone attribute does not change the underlying representation because this is always in UTC.
# From the docs for POSIXct:
# Class "POSIXct" represents the (signed) number of seconds since the beginning of 1970 (in the UTC timezone) as a numeric vector. 

> unclass(x)
[1] 1433462400
attr(,"tzone")
[1] "GMT"

> unclass(y)
[1] 1433462400
attr(,"tzone")
[1] ""

Conversion from a Date and conversion from other time based classes tries to trust the conversion operator. Here I think

start <- as.POSIXct(start, tz=indexTZ(x))

is safer than

 else if(inherits(i, "Date")) {
        i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
      } 

What if as.character(i) converts the date to one time zone (the default) then as.POSIXct reads this as another? Then I think we get the wrong date.

Really this time zone, indexTZ, may only be used if start is a string.

Finally, another option would be to emulate window.zoo and call the .binsearch function. Looking at the code this compares index > key, index < key and index != key. This would be a lower standard, but I don't like it. These comparisons may be slow and if we can't get an unambiguous conversion to our index type I think we should make the user clarify.

Contributor

corwinjoy commented Jun 5, 2015

@joshuaulrich wrote a number of nice comments as reproduced below. Basically I'm open to all of these ideas but would like to iron out specifics. Here are my thoughts:

  1. Import stats::window, rename window_bin to window.xts and register it as a S3 method.
    • The renaming makes sense. In terms of registering an S3 method, I'm not sure what is needed. zoo already defines window as a generic - does window.xts need to be added to the package export list?
  2. Remove all the test code from R/xts.methods.R
    xts does have RUnit-based tests, but they've not been run in ages. I would like to to get them running again.
    • Will do - let me get closer to the final form before I rip this out since I want to be careful and not make a mistake here. I do like mini-tests to help make sure the obvious cases are covered.
  3. Write man/window.xts.Rd
    • Can I pass the buck here? I think this is mostly a case of copying the zoo docs.
  4. Add index. argument to match window.zoo definition (since xts extends zoo)
    • Maybe? I'd like some feedback on how this would work since I don't have a deep understanding of zoo and xts. What does it mean if you pass in an index that is a different time index than what is used in xts? Does it have to follow the rules for the index in the xts constructor. (Unique, ordered, time-based class). If we get away from this life gets hard (duplicates should return duplicates?), (non-ordered means no binary search / or do we sort?) I suppose xts has conversion functions to the native index. For a first pass I left it out since I was afraid of getting this wrong.
  5. Possibly be more defensive about potential values for start and end
    I'm thinking about situations where start or end are already POSIXct, but have a different timezone. Or if they're some other time-based class (e.g. timeDate).
    • This is worth checking carefully. If we were in C the function could take only POSIXct dates and all would be well. R users are a bit spoiled. I know that I often like to use window with a pair of string dates to inspect a series. window.zoo is pretty loose and says only that the index must be comparable to the start and end dates (which may be character if the index support character comparisons). I went for a standard more in line with the subset operator and said the dates needed to be covertible to POSIXct.

The .subset.xts function has the following code:

if (timeBased(i)) {
      if(inherits(i, "POSIXct")) {
        i <- which(!is.na(match(.index(x), i)))
      } else if(inherits(i, "Date")) {
        i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
      } else {
        # force all other time classes to be POSIXct
        i <- which(!is.na(match(.index(x), as.POSIXct(i,tz=indexTZ(x)))))
      }
      i[is.na(i)] <- 0
    }

Instead, I simply ask for a conversion to POSIXct and fail if there is not an unambiguous conversion.
start <- as.POSIXct(start, tz=indexTZ(x)) # If start is already POSIXct this does nothing.

I think this is cleaner, but let's look at the cases. First, imagine they pass in a POSIXct date with a different time zone. I think this is OK because the internal representation will be the same. For example:

x <- as.POSIXct("2015-06-05", tz = "GMT")
y <- x
attr(y, "tzone") = ""

# The timezone changes the pretty printing
> x
[1] "2015-06-05 GMT"

# My default time zone is PDT
> y
[1] "2015-06-04 17:00:00 PDT"

# The time zone attribute does not change the underlying representation because this is always in UTC.
# From the docs for POSIXct:
# Class "POSIXct" represents the (signed) number of seconds since the beginning of 1970 (in the UTC timezone) as a numeric vector. 

> unclass(x)
[1] 1433462400
attr(,"tzone")
[1] "GMT"

> unclass(y)
[1] 1433462400
attr(,"tzone")
[1] ""

Conversion from a Date and conversion from other time based classes tries to trust the conversion operator. Here I think

start <- as.POSIXct(start, tz=indexTZ(x))

is safer than

 else if(inherits(i, "Date")) {
        i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
      } 

What if as.character(i) converts the date to one time zone (the default) then as.POSIXct reads this as another? Then I think we get the wrong date.

Really this time zone, indexTZ, may only be used if start is a string.

Finally, another option would be to emulate window.zoo and call the .binsearch function. Looking at the code this compares index > key, index < key and index != key. This would be a lower standard, but I don't like it. These comparisons may be slow and if we can't get an unambiguous conversion to our index type I think we should make the user clarify.

corwinjoy added some commits Jun 10, 2015

1. Add support for .index parameter to window.xts. Upgrade subset to use
window.xts. Make .index efficient by using findInterval.
2. Get unit tests running again. Had to disable the reclass tests for now
because the attributes don't match in checkIdentical. Disabled with the
prefix no_test.
3. Fixed class order for POSIXct in timeBasedSeq.R so the result matches
POSIXct (and passes unit tests).
4. Added window.xts to the namespace.
1. Fix a bug in window.xts. Make sure that .index parameter still works
when the xts series has repeated dates in series.
2. Add support for dates of the form I(character dates).  The default
conversion did not work.
3. Get a clean R CMD check result.  To do this I had to fix the line
endings for the source files (via dos2unix) and makefiles.  Check also
complained about missing Author and Maintainer fields in the DESCRIPTION file.
4. Add documentation for window.xts
@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Jun 11, 2015

Contributor

Josh,
As requested I've gone ahead and made these changes. I think the new
window function is ready for you to review. Here is what I have done:

  1. Add support for the .index parameter and make subset use window for sets
    of dates.
  2. Add unit tests for window.xts and subset.xts. Rename the old
    runit.xts.methods.R to runit.bind.R.
  3. Get all the unit tests working again. For the moment, I had to disable
    the reclass tests since they seem to be broken.
  4. Get R CMD check to run cleanly. This involved fixing line endings to
    be Unix style and adding a couple missing fields to the DESCRIPTION file.
  5. Write man/window.xts.Rd.

I think the window code is pretty solid now and ready for you to review
when you get a chance. Also, in the process of working on this I noticed a
feature of subset that surprised me a bit. That is, returned subsets are
always forced to be in ascending order, not the order given by the user.

> x <- xts(1:1000, as.Date("2000-01-01")+1:1000)

> x[c(3,2,1),]  # Result is sorted
           [,1]
2000-01-02    1
2000-01-03    2
2000-01-04    3

> x[c(1,1,2,3),]  # Duplicates are allowed
           [,1]
2000-01-02    1
2000-01-02    1
2000-01-03    2
2000-01-04    3

Now, you are probably thinking "Of course, you idiot, the returned subset
must be an xts series and therefore the dates must be sorted.". That's
fair, but this was still a bit of a surprise to me since it differs from
normal subsetting. Anyway, I might not be the only idiot out there and I'm
wondering if we should add something explicit to the subset documentation
about this.

Thanks again. I ended up touching a bit more code than I planned on for
this feature, but I wanted a clean build and I think the changes make sense.

Corwin

Contributor

corwinjoy commented Jun 11, 2015

Josh,
As requested I've gone ahead and made these changes. I think the new
window function is ready for you to review. Here is what I have done:

  1. Add support for the .index parameter and make subset use window for sets
    of dates.
  2. Add unit tests for window.xts and subset.xts. Rename the old
    runit.xts.methods.R to runit.bind.R.
  3. Get all the unit tests working again. For the moment, I had to disable
    the reclass tests since they seem to be broken.
  4. Get R CMD check to run cleanly. This involved fixing line endings to
    be Unix style and adding a couple missing fields to the DESCRIPTION file.
  5. Write man/window.xts.Rd.

I think the window code is pretty solid now and ready for you to review
when you get a chance. Also, in the process of working on this I noticed a
feature of subset that surprised me a bit. That is, returned subsets are
always forced to be in ascending order, not the order given by the user.

> x <- xts(1:1000, as.Date("2000-01-01")+1:1000)

> x[c(3,2,1),]  # Result is sorted
           [,1]
2000-01-02    1
2000-01-03    2
2000-01-04    3

> x[c(1,1,2,3),]  # Duplicates are allowed
           [,1]
2000-01-02    1
2000-01-02    1
2000-01-03    2
2000-01-04    3

Now, you are probably thinking "Of course, you idiot, the returned subset
must be an xts series and therefore the dates must be sorted.". That's
fair, but this was still a bit of a surprise to me since it differs from
normal subsetting. Anyway, I might not be the only idiot out there and I'm
wondering if we should add something explicit to the subset documentation
about this.

Thanks again. I ended up touching a bit more code than I planned on for
this feature, but I wanted a clean build and I think the changes make sense.

Corwin

@joshuaulrich joshuaulrich changed the base branch from develop to master Apr 20, 2018

@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Apr 20, 2018

Owner

Ugh, I think I closed this because I was working on merging it... then got distracted and completely lost track of it. Reopening because these are useful contributions.

Owner

joshuaulrich commented Apr 20, 2018

Ugh, I think I closed this because I was working on merging it... then got distracted and completely lost track of it. Reopening because these are useful contributions.

@joshuaulrich joshuaulrich reopened this Apr 20, 2018

@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Apr 20, 2018

Contributor
Contributor

corwinjoy commented Apr 20, 2018

joshuaulrich added a commit that referenced this pull request Apr 22, 2018

Add subset tests to prep for binsearch refactor
Lines 62-74, 94-104, and 114-116 had no test coverage, and even basic
coverage is better than nothing when refactoring. These tests pass on
the current code, so they should indicate any serious errors with
Corwin's binary search/subset refactor.

See #100.

joshuaulrich added a commit that referenced this pull request Apr 22, 2018

Merge branch 'corwinjoy/patch-1' (window.xts)
There were conflicts in basically every file, largely due to the time
between the last branch commit and the merge. The branch also contained
LF (Windows) line endings, plus the archival of the 'its' package, and
the xts PR that ensured all headers had "GPL 2" contributed to number
of conflicts.

I also reverted all the changes to unit tests that were made to stop
them from running. The unit tests were not being run at all when Corwin
started his patch, but are running successfully now.

Other than what's mentioned above, I merged the branch as-is, so this
merge commit will be followed by some cleanup commits.

Fixes #100.
@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Apr 23, 2018

Owner

Bummer... looks like there's performance regression in at least one subsetting case. It looks like the bulk of it is from not having an integer binary search branch (i.e. the as.double() calls).

I will investigate and patch, unless anyone else wants to get to it quickly.

I would also appreciate any contributed benchmark cases (not just for subsetting)!

> cat xts/inst/benchmarks/benchmark.subset.R
x <- xts::.xts(1:2e7, 1.0*seq(86400*365*20, 86400*365*40, length.out = 2e7))
# warmup
for(i in 1:2) {
  x["1990/2000"]
}
# profile
Rprof()
for(i in 1:25) {
  x["1990/2000"]
}
Rprof(NULL)
(srp <- summaryRprof())

> # Using current master (1716103af1d3c7ede69e6abda986daf2bc74e90c)
> Rscript xts/inst/benchmarks/benchmark.subset.R
$by.self
           self.time self.pct total.time total.pct
".Call"         1.86    47.69       1.86     47.69
"c"             1.68    43.08       1.68     43.08
"seq.int"       0.34     8.72       0.34      8.72
"strptime"      0.02     0.51       0.02      0.51

$by.total
                total.time total.pct self.time self.pct
"["                   3.90    100.00      0.00     0.00
"[.xts"               3.90    100.00      0.00     0.00
".Call"               1.86     47.69      1.86    47.69
"c"                   1.68     43.08      1.68    43.08
"isOrdered"           0.76     19.49      0.00     0.00
"seq.int"             0.34      8.72      0.34     8.72
"strptime"            0.02      0.51      0.02     0.51
"<Anonymous>"         0.02      0.51      0.00     0.00
"as.POSIXct"          0.02      0.51      0.00     0.00
"as.POSIXlt"          0.02      0.51      0.00     0.00
"do.call"             0.02      0.51      0.00     0.00
"ISOdatetime"         0.02      0.51      0.00     0.00
".parseISO8601"       0.02      0.51      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 3.9

> # Using current window-xts HEAD (bfedeb00cc5c35e31fb7f71519c840a36fd85437)
> Rscript xts/inst/benchmarks/benchmark.subset.R
            self.time self.pct total.time total.pct
"as.double"      3.00    34.97       3.00     34.97
"[.xts"          2.90    33.80       8.58    100.00
".Call"          1.46    17.02       1.46     17.02
"c"              0.80     9.32       0.80      9.32
"seq.int"        0.38     4.43       0.38      4.43
"attr"           0.02     0.23       0.02      0.23
"gsub"           0.02     0.23       0.02      0.23

$by.total
                     total.time total.pct self.time self.pct
"[.xts"                    8.58    100.00      2.90    33.80
"["                        8.58    100.00      0.00     0.00
"index_bsearch"            3.38     39.39      0.00     0.00
"as.double"                3.00     34.97      3.00    34.97
"binsearch"                3.00     34.97      0.00     0.00
".Call"                    1.46     17.02      1.46    17.02
"c"                        0.80      9.32      0.80     9.32
"seq.int"                  0.38      4.43      0.38     4.43
"isOrdered"                0.38      4.43      0.00     0.00
".parseISO8601"            0.04      0.47      0.00     0.00
"attr"                     0.02      0.23      0.02     0.23
"gsub"                     0.02      0.23      0.02     0.23
"as.list"                  0.02      0.23      0.00     0.00
"as_numeric"               0.02      0.23      0.00     0.00
"as.POSIXct"               0.02      0.23      0.00     0.00
"as.POSIXct.POSIXlt"       0.02      0.23      0.00     0.00
"as.POSIXlt"               0.02      0.23      0.00     0.00
"do.call"                  0.02      0.23      0.00     0.00
"parse.side"               0.02      0.23      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 8.58
Owner

joshuaulrich commented Apr 23, 2018

Bummer... looks like there's performance regression in at least one subsetting case. It looks like the bulk of it is from not having an integer binary search branch (i.e. the as.double() calls).

I will investigate and patch, unless anyone else wants to get to it quickly.

I would also appreciate any contributed benchmark cases (not just for subsetting)!

> cat xts/inst/benchmarks/benchmark.subset.R
x <- xts::.xts(1:2e7, 1.0*seq(86400*365*20, 86400*365*40, length.out = 2e7))
# warmup
for(i in 1:2) {
  x["1990/2000"]
}
# profile
Rprof()
for(i in 1:25) {
  x["1990/2000"]
}
Rprof(NULL)
(srp <- summaryRprof())

> # Using current master (1716103af1d3c7ede69e6abda986daf2bc74e90c)
> Rscript xts/inst/benchmarks/benchmark.subset.R
$by.self
           self.time self.pct total.time total.pct
".Call"         1.86    47.69       1.86     47.69
"c"             1.68    43.08       1.68     43.08
"seq.int"       0.34     8.72       0.34      8.72
"strptime"      0.02     0.51       0.02      0.51

$by.total
                total.time total.pct self.time self.pct
"["                   3.90    100.00      0.00     0.00
"[.xts"               3.90    100.00      0.00     0.00
".Call"               1.86     47.69      1.86    47.69
"c"                   1.68     43.08      1.68    43.08
"isOrdered"           0.76     19.49      0.00     0.00
"seq.int"             0.34      8.72      0.34     8.72
"strptime"            0.02      0.51      0.02     0.51
"<Anonymous>"         0.02      0.51      0.00     0.00
"as.POSIXct"          0.02      0.51      0.00     0.00
"as.POSIXlt"          0.02      0.51      0.00     0.00
"do.call"             0.02      0.51      0.00     0.00
"ISOdatetime"         0.02      0.51      0.00     0.00
".parseISO8601"       0.02      0.51      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 3.9

> # Using current window-xts HEAD (bfedeb00cc5c35e31fb7f71519c840a36fd85437)
> Rscript xts/inst/benchmarks/benchmark.subset.R
            self.time self.pct total.time total.pct
"as.double"      3.00    34.97       3.00     34.97
"[.xts"          2.90    33.80       8.58    100.00
".Call"          1.46    17.02       1.46     17.02
"c"              0.80     9.32       0.80      9.32
"seq.int"        0.38     4.43       0.38      4.43
"attr"           0.02     0.23       0.02      0.23
"gsub"           0.02     0.23       0.02      0.23

$by.total
                     total.time total.pct self.time self.pct
"[.xts"                    8.58    100.00      2.90    33.80
"["                        8.58    100.00      0.00     0.00
"index_bsearch"            3.38     39.39      0.00     0.00
"as.double"                3.00     34.97      3.00    34.97
"binsearch"                3.00     34.97      0.00     0.00
".Call"                    1.46     17.02      1.46    17.02
"c"                        0.80      9.32      0.80     9.32
"seq.int"                  0.38      4.43      0.38     4.43
"isOrdered"                0.38      4.43      0.00     0.00
".parseISO8601"            0.04      0.47      0.00     0.00
"attr"                     0.02      0.23      0.02     0.23
"gsub"                     0.02      0.23      0.02     0.23
"as.list"                  0.02      0.23      0.00     0.00
"as_numeric"               0.02      0.23      0.00     0.00
"as.POSIXct"               0.02      0.23      0.00     0.00
"as.POSIXct.POSIXlt"       0.02      0.23      0.00     0.00
"as.POSIXlt"               0.02      0.23      0.00     0.00
"do.call"                  0.02      0.23      0.00     0.00
"parse.side"               0.02      0.23      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 8.58
@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Apr 23, 2018

Contributor
Contributor

corwinjoy commented Apr 23, 2018

@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Apr 24, 2018

Owner

Appears we were both wrong. Removing the as.double() call and adding an integer branch only took ~2s off the run time. The main culprit was always calling i <- i[i != 0] instead of only calling it if there was at least one zero in i.

Owner

joshuaulrich commented Apr 24, 2018

Appears we were both wrong. Removing the as.double() call and adding an integer branch only took ~2s off the run time. The main culprit was always calling i <- i[i != 0] instead of only calling it if there was at least one zero in i.

@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Apr 24, 2018

Contributor
Contributor

corwinjoy commented Apr 24, 2018

joshuaulrich added a commit that referenced this pull request Apr 29, 2018

Add int branch to binsearch
The binsearch C function now branches on double and integer types, so
no conversion is done if 'key' and 'vec' are the same type. Cast both
to double if they're not the same type.

Create a struct that contains int and double fields, so we can pass
it to a generic compare function. Create 4 compare functions, one for
each type/start combination. All these components are static inline,
with the hope the compiler can optimize them.

Removing the as.double() call means ISO-8601 range-based subsetting
is nearly twice as fast for integer vectors and keys.

See #100.

joshuaulrich added a commit that referenced this pull request Apr 29, 2018

Add int branch to binsearch
The binsearch C function now branches on double and integer types, so
no conversion is done if 'key' and 'vec' are the same type. Cast both
to double if they're not the same type.

Remove the bound() function and call lower_bound() and upper_bound()
directly from binsearch(). Also move the edge case checks into the
respective *_bound() functions, since we need to do the checks inside
the switch() statement.

Removing the as.double() call means ISO-8601 range-based subsetting
is nearly twice as fast for integer vectors and keys.

See #100.

joshuaulrich added a commit that referenced this pull request Apr 29, 2018

Restore zero-index check in [.xts
Commit 988d513 removed this check.
It isn't strictly necessary, but it's considerably faster than always
trying to remove 'i' equal to zero at any location in 'i'.

Also add benchmark for subsetting by ISO8601 range. The benchmark
results before this commit are:

  Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
   x[rng, ] 371.8408 384.7454 409.4091 399.3384 426.9196 486.1258    10

After restoring the check, the benchmark results are:

  Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
   x[rng, ] 280.5653 287.1048 304.1365 303.6579 316.7202 333.2147    10

See #100.

joshuaulrich added a commit that referenced this pull request Apr 30, 2018

Add int branch to binsearch
The binsearch C function now branches on double and integer types, so
no conversion is done if 'key' and 'vec' are the same type. Cast both
to double if they're not the same type.

Create a struct that contains int and double fields, so we can pass
it to a generic compare function. Create 4 compare functions, one for
each type/start combination. All these components are static inline,
with the hope the compiler can optimize them.

Removing the as.double() call means ISO-8601 range-based subsetting
is nearly twice as fast for integer vectors and keys.

See #100.
@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Apr 30, 2018

Owner

Progress update. Restoring the zero-search before trying to remove i equal to zero gets the run time down to ~2.5 seconds.

> # from commit 7463da7a636e87f0958add9e64ef27f6c929d077
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
            self.time self.pct total.time total.pct
"as.double"      1.42    56.35       1.42     56.35
".Call"          0.60    23.81       0.60     23.81
"c"              0.34    13.49       0.34     13.49
"seq.int"        0.16     6.35       0.16      6.35

$by.total
                total.time total.pct self.time self.pct
"["                   2.52    100.00      0.00     0.00
"[.xts"               2.52    100.00      0.00     0.00
"as.double"           1.42     56.35      1.42    56.35
"binsearch"           1.42     56.35      0.00     0.00
"index_bsearch"       1.30     51.59      0.00     0.00
".Call"               0.60     23.81      0.60    23.81
"c"                   0.34     13.49      0.34    13.49
"seq.int"             0.16      6.35      0.16     6.35
"isOrdered"           0.14      5.56      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.52

Adding the integer branch gets it down to ~1s.

> # from branch 'one' or 'two' (results are similar)
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
             self.time self.pct total.time total.pct
".Call"           0.58    53.70       0.58     53.70
"c"               0.34    31.48       0.34     31.48
"seq.int"         0.14    12.96       0.14     12.96
"as.numeric"      0.02     1.85       0.02      1.85

$by.total
                total.time total.pct self.time self.pct
"["                   1.08    100.00      0.00     0.00
"[.xts"               1.08    100.00      0.00     0.00
".Call"               0.58     53.70      0.58    53.70
"c"                   0.34     31.48      0.34    31.48
"isOrdered"           0.16     14.81      0.00     0.00
"seq.int"             0.14     12.96      0.14    12.96
"index_bsearch"       0.14     12.96      0.00     0.00
"as.numeric"          0.02      1.85      0.02     1.85
"as.list"             0.02      1.85      0.00     0.00
"as_numeric"          0.02      1.85      0.00     0.00
"as.POSIXlt"          0.02      1.85      0.00     0.00
"do.call"             0.02      1.85      0.00     0.00
".parseISO8601"       0.02      1.85      0.00     0.00
"parse.side"          0.02      1.85      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 1.08

I have implemented the integer branch two different ways. One took @corwinjoy's code and essentially moved stuff around so the integer/double logic could live in one place. The other is a much heavier refactor, but I wanted to try and reduce the code duplication inside the switch() statements and while() loops.

I would appreciate any reviews, thoughts, comments, etc.

Owner

joshuaulrich commented Apr 30, 2018

Progress update. Restoring the zero-search before trying to remove i equal to zero gets the run time down to ~2.5 seconds.

> # from commit 7463da7a636e87f0958add9e64ef27f6c929d077
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
            self.time self.pct total.time total.pct
"as.double"      1.42    56.35       1.42     56.35
".Call"          0.60    23.81       0.60     23.81
"c"              0.34    13.49       0.34     13.49
"seq.int"        0.16     6.35       0.16      6.35

$by.total
                total.time total.pct self.time self.pct
"["                   2.52    100.00      0.00     0.00
"[.xts"               2.52    100.00      0.00     0.00
"as.double"           1.42     56.35      1.42    56.35
"binsearch"           1.42     56.35      0.00     0.00
"index_bsearch"       1.30     51.59      0.00     0.00
".Call"               0.60     23.81      0.60    23.81
"c"                   0.34     13.49      0.34    13.49
"seq.int"             0.16      6.35      0.16     6.35
"isOrdered"           0.14      5.56      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.52

Adding the integer branch gets it down to ~1s.

> # from branch 'one' or 'two' (results are similar)
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
             self.time self.pct total.time total.pct
".Call"           0.58    53.70       0.58     53.70
"c"               0.34    31.48       0.34     31.48
"seq.int"         0.14    12.96       0.14     12.96
"as.numeric"      0.02     1.85       0.02      1.85

$by.total
                total.time total.pct self.time self.pct
"["                   1.08    100.00      0.00     0.00
"[.xts"               1.08    100.00      0.00     0.00
".Call"               0.58     53.70      0.58    53.70
"c"                   0.34     31.48      0.34    31.48
"isOrdered"           0.16     14.81      0.00     0.00
"seq.int"             0.14     12.96      0.14    12.96
"index_bsearch"       0.14     12.96      0.00     0.00
"as.numeric"          0.02      1.85      0.02     1.85
"as.list"             0.02      1.85      0.00     0.00
"as_numeric"          0.02      1.85      0.00     0.00
"as.POSIXlt"          0.02      1.85      0.00     0.00
"do.call"             0.02      1.85      0.00     0.00
".parseISO8601"       0.02      1.85      0.00     0.00
"parse.side"          0.02      1.85      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 1.08

I have implemented the integer branch two different ways. One took @corwinjoy's code and essentially moved stuff around so the integer/double logic could live in one place. The other is a much heavier refactor, but I wanted to try and reduce the code duplication inside the switch() statements and while() loops.

I would appreciate any reviews, thoughts, comments, etc.

@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Apr 30, 2018

Contributor
Contributor

corwinjoy commented Apr 30, 2018

@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich May 1, 2018

Owner

Thanks for the feedback! I agree that section was quite terse and not particularly easy to follow. I've incorporated your changes and comments. I also forgot to include a bunch of specific binsearch unit tests that I wrote. I'm going to amend the commit and add them, so here's the diff for posterity:

diff --git a/src/binsearch.c b/src/binsearch.c
index 67c0632..cb3b2eb 100644
--- a/src/binsearch.c
+++ b/src/binsearch.c
@@ -84,23 +91,35 @@ SEXP binsearch(SEXP key, SEXP vec, SEXP start)
     }
   }

-  /* handle edge cases where item may be at the lo/hi end of array */
+  /* 'lo' contains the smallest index where cmp_func() is true, but we need
+   * to handle edge cases where 'lo' is at the max/min end of the vector.
+   */
   if (use_start) {
-    /* cmp_func is: vec[lo] >= k */
-    if (cmp_func(data, lo)) {
-      lo++;
-    } else {
-      lo = NA_INTEGER;
+    /* cmp_func() := vector[index] >= key when start == true, and we need
+     * to return the smallest index subject to vector[index] >= key.
+     */
+    if (!cmp_func(data, length(vec)-1)) {
+      /* entire vector < key */
+      return ScalarInteger(NA_INTEGER);
     }
   } else {
-    if (lo > 0 && cmp_func(data, lo)) lo--;
-    /* cmp_func is: vec[lo] > k */
+    /* cmp_func() := vector[index] > key when start == false, and we need
+     * to return the largest index subject to vector[index] <= key.
+     */
     if (cmp_func(data, lo)) {
-      lo = NA_INTEGER;
-    } else {
-      lo++;
+      /* previous index value must satisfy vector[index] <= key, unless
+       * current index value is zero.
+       */
+      lo--;
+      if (lo < 0) {
+        /* entire vector > key */
+        return ScalarInteger(NA_INTEGER);
+      }
     }
   }

+  /* Convert from 0-based index to 1-based index */
+  lo++;
+
   return ScalarInteger(lo);
 }
Owner

joshuaulrich commented May 1, 2018

Thanks for the feedback! I agree that section was quite terse and not particularly easy to follow. I've incorporated your changes and comments. I also forgot to include a bunch of specific binsearch unit tests that I wrote. I'm going to amend the commit and add them, so here's the diff for posterity:

diff --git a/src/binsearch.c b/src/binsearch.c
index 67c0632..cb3b2eb 100644
--- a/src/binsearch.c
+++ b/src/binsearch.c
@@ -84,23 +91,35 @@ SEXP binsearch(SEXP key, SEXP vec, SEXP start)
     }
   }

-  /* handle edge cases where item may be at the lo/hi end of array */
+  /* 'lo' contains the smallest index where cmp_func() is true, but we need
+   * to handle edge cases where 'lo' is at the max/min end of the vector.
+   */
   if (use_start) {
-    /* cmp_func is: vec[lo] >= k */
-    if (cmp_func(data, lo)) {
-      lo++;
-    } else {
-      lo = NA_INTEGER;
+    /* cmp_func() := vector[index] >= key when start == true, and we need
+     * to return the smallest index subject to vector[index] >= key.
+     */
+    if (!cmp_func(data, length(vec)-1)) {
+      /* entire vector < key */
+      return ScalarInteger(NA_INTEGER);
     }
   } else {
-    if (lo > 0 && cmp_func(data, lo)) lo--;
-    /* cmp_func is: vec[lo] > k */
+    /* cmp_func() := vector[index] > key when start == false, and we need
+     * to return the largest index subject to vector[index] <= key.
+     */
     if (cmp_func(data, lo)) {
-      lo = NA_INTEGER;
-    } else {
-      lo++;
+      /* previous index value must satisfy vector[index] <= key, unless
+       * current index value is zero.
+       */
+      lo--;
+      if (lo < 0) {
+        /* entire vector > key */
+        return ScalarInteger(NA_INTEGER);
+      }
     }
   }

+  /* Convert from 0-based index to 1-based index */
+  lo++;
+
   return ScalarInteger(lo);
 }
@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy May 1, 2018

Contributor
Contributor

corwinjoy commented May 1, 2018

joshuaulrich added a commit that referenced this pull request May 1, 2018

Add integer branch to binsearch
The binsearch C function now branches on double and integer types, so
no conversion is done if 'key' and 'vec' are the same type. Cast both
to double if they're not the same type.

Create a struct that contains int and double fields, so we can pass
it to a generic compare function. Create 4 compare functions, one for
each type/start combination. All these components are static inline,
with the hope the compiler can optimize them.

Removing the as.double() call means ISO-8601 range-based subsetting
is nearly twice as fast for integer vectors and keys. And the integer
branch is faster still.

See #100.
@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich May 1, 2018

Owner

I waffled back and forth between those two alternatives. I think the former is clearer for that specific case, but the latter is more general. I couldn't think of a test that would fail the former and pass the latter. I may end up changing it back anyway.

Speaking of tests, quite a few of the binsearch unit tests are commented because the new function doesn't do what the old function did in various edge cases. I'm more inclined to the new function's results, because the old results seem unusual and/or wrong. Any feedback you have there would be phenomenal, as would be any tests you add.

There is no need to hope that your changes have improved things. I've already seen at least one case (ISO8601 subsetting) where this has improved performance. And your thoughts and review are very valuable! It's extremely hard to write good code without input from others (at least for me).

Owner

joshuaulrich commented May 1, 2018

I waffled back and forth between those two alternatives. I think the former is clearer for that specific case, but the latter is more general. I couldn't think of a test that would fail the former and pass the latter. I may end up changing it back anyway.

Speaking of tests, quite a few of the binsearch unit tests are commented because the new function doesn't do what the old function did in various edge cases. I'm more inclined to the new function's results, because the old results seem unusual and/or wrong. Any feedback you have there would be phenomenal, as would be any tests you add.

There is no need to hope that your changes have improved things. I've already seen at least one case (ISO8601 subsetting) where this has improved performance. And your thoughts and review are very valuable! It's extremely hard to write good code without input from others (at least for me).

@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy May 1, 2018

Contributor
Contributor

corwinjoy commented May 1, 2018

@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich May 1, 2018

Owner

The latest is in the window-xts branch, commit 6ef8014. The tests are in inst/unitTests/runit.binsearch.R.

All the cases where start = NULL are commented, since the new function doesn't support that value.
Best I can understand, start = NULL was used to force binsearch() to return NA if an exact match couldn't be found.

Some cases where the key is outside the vector are also commented, depending on the value of start. The original binsearch() would return an index value outside the bounds of the search vector, which seems very undesirable. I think returning NA makes more sense.

There are several tests with FIXME / TODO because it isn't clear what the correct / best behavior is (e.g. what should happen if either the key or vector are zero-length?).

One thing to keep in mind is that binsearch() is unexported and is likely to remain that way, so we only need to worry about our internal use-cases.

Owner

joshuaulrich commented May 1, 2018

The latest is in the window-xts branch, commit 6ef8014. The tests are in inst/unitTests/runit.binsearch.R.

All the cases where start = NULL are commented, since the new function doesn't support that value.
Best I can understand, start = NULL was used to force binsearch() to return NA if an exact match couldn't be found.

Some cases where the key is outside the vector are also commented, depending on the value of start. The original binsearch() would return an index value outside the bounds of the search vector, which seems very undesirable. I think returning NA makes more sense.

There are several tests with FIXME / TODO because it isn't clear what the correct / best behavior is (e.g. what should happen if either the key or vector are zero-length?).

One thing to keep in mind is that binsearch() is unexported and is likely to remain that way, so we only need to worry about our internal use-cases.

joshuaulrich added a commit to corwinjoy/xts that referenced this pull request May 3, 2018

squash! Fix up edge cases in binsearch unit tests
Fix up edge cases in binsearch unit tests

Remove the start = NULL test cases, since it will no longer be
supported. Always return NA if key is NA or zero-length. Fix compiler
warning in binsearch.c ('static' is unnecessary as the struct is
already local to the file).

See #100.

joshuaulrich added a commit to corwinjoy/xts that referenced this pull request May 3, 2018

Fix up edge cases in binsearch unit tests
Remove the start = NULL test cases, since it will no longer be
supported. Always return NA if key is NA or zero-length. Fix compiler
warning in binsearch.c ('static' is unnecessary as the struct is
already local to the file).

See #100.
@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich May 3, 2018

Owner

Thanks for the fixes in #240. Now that it's merged, I'm going to run reverse dependency checks to ensure nothing downstream is broken. I don't usually do this until preparing for a CRAN release, but these changes are fairly significant. I'd prefer to address any issues now, while the code is still fresh in my mind. Assuming those reverse dependency checks go well, I think this is ready to merge.

Owner

joshuaulrich commented May 3, 2018

Thanks for the fixes in #240. Now that it's merged, I'm going to run reverse dependency checks to ensure nothing downstream is broken. I don't usually do this until preparing for a CRAN release, but these changes are fairly significant. I'd prefer to address any issues now, while the code is still fresh in my mind. Assuming those reverse dependency checks go well, I think this is ready to merge.

@joshuaulrich joshuaulrich merged commit 508d679 into joshuaulrich:master May 3, 2018

@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich May 3, 2018

Owner

And merged! You win the award for most patient PR ever. ;-)

Owner

joshuaulrich commented May 3, 2018

And merged! You win the award for most patient PR ever. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment