Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-interval set operators #179

Merged
merged 96 commits into from
May 31, 2022
Merged

Conversation

haberdashPI
Copy link
Contributor

@haberdashPI haberdashPI commented Oct 21, 2021

This defines a set of functions for easily computing intersections, unions etc... of a set of intervals (passed as an array).

cc: @omus, @ericphanson, @kleinschmidt

This is based off of a closed PR to TimeSpans (beacon-biosignals/TimeSpans.jl#11)

@haberdashPI haberdashPI changed the title wip: initial implementation wip: initial set implementations Oct 22, 2021
@haberdashPI haberdashPI marked this pull request as ready for review October 22, 2021 18:14
@haberdashPI haberdashPI changed the title wip: initial set implementations Multi-interval set operators Oct 22, 2021
src/interval.jl Outdated Show resolved Hide resolved
@omus
Copy link
Collaborator

omus commented Nov 9, 2021

Closing/re-opening to trigger CI

@omus omus closed this Nov 9, 2021
@omus omus reopened this Nov 9, 2021
Copy link
Collaborator

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite have time for a full review. I do suspect the half-open concept doesn't need to be a special type and could just be a simple ishalfopen function a call. I'll try to dive deep into this tomorrow. I'll mention the tests for this package a pretty extensive so be sure to lean on them

src/endpoint.jl Outdated
struct DirectionBound{T} end
const LeftClosed = DirectionBound{:LeftClosed}()
const RightClosed = DirectionBound{:RightClosed}()
struct HalfOpenEndpoint{T, B} <: AbstractEndpoint{T}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intervals are half-open/closed and not endpoints of intervals. This seems to me like maybe you want is a HalfOpenInterval type. I'll need to take a closer look into how this is used though

Copy link
Contributor Author

@haberdashPI haberdashPI Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is as follows: during the process of "unbunching" intervals into endpoints, and then "bunching" them back into intervals, I want to preserve the fact that the original array has an all left-closed or all right-closed intervals. That's because

1.) said form leads to type stable arrays
2.) said form can ignore the checks for edge types in mergesets (hence the track_endpoint flag).

Maybe there's a better name for this. I did strugle to a way to describe what it is.

I guess another approach could be to have a flag and/or special container that denotes the
"all closed in one direction" property, and use the existing Endpoint type in all cases. It's not obvious to me, at the moment, how to write that in a type stable way though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the design as I see it is that endpoints are dealing with the left/right side of the interval without knowing anything about the other endpoint in the interval. I'll be able to speak better to alternatives once I read through the rest of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me, at the moment, how to write that in a type stable way though.

I'd suggest putting code readability over type stability to start with as otherwise you can optimize yourself into a corner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the design as I see it is that endpoints are dealing with the left/right side of the interval without knowing anything about the other endpoint in the interval

My idea was to create a subtype of endpoint that is only used in those cases where you do know something about the other endpoints. I do also feel that there is probably a better design, combing back to this now (after some time away). I'll think it over some more and wait for whatever further comments you have.

src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated
Comment on lines 532 to 534
#=@show=# t = first_is_less(x, y) ? first(x) : first(y)
#=@show=# x_isless = first_is_less(x, y)
#=@show=# x_equal = first_is_equal(x, y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call iterate or Iterators.peel and then do these checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes one of the iterables is empty; I find this form more readible than introducing the branches to check for nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the other point to make here is that we only want to peel the values when they are "first" (i.e. less than or equal to the value in the other list), so you don't want to peel before that.

@haberdashPI
Copy link
Contributor Author

I'll try to dive deep into this tomorrow. I'll mention the tests for this package a pretty extensive so be sure to lean on them

I'm still working out a few bits with the larger unit test (outside the tests for sets). Might make sense to wait to review until that happens. I'll mark it as draft again and mark it ready for review when that's resolved.

@haberdashPI haberdashPI marked this pull request as draft November 10, 2021 13:59
@omus
Copy link
Collaborator

omus commented Nov 10, 2021

I'm still working out a few bits with the larger unit test (outside the tests for sets). Might make sense to wait to review until that happens. I'll mark it as draft again and mark it ready for review when that's resolved.

Thanks for letting me know. I'll try to keep an eye on this as I think some early course correction could save a bunch of work.

@codecov
Copy link

codecov bot commented Nov 12, 2021

Codecov Report

Merging #179 (525d3ae) into master (bb193ee) will increase coverage by 2.90%.
The diff coverage is 95.76%.

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
+ Coverage   81.73%   84.63%   +2.90%     
==========================================
  Files          11       12       +1     
  Lines         624      794     +170     
==========================================
+ Hits          510      672     +162     
- Misses        114      122       +8     
Impacted Files Coverage Δ
src/Intervals.jl 100.00% <ø> (ø)
src/endpoint.jl 98.11% <ø> (ø)
src/interval.jl 96.01% <ø> (-0.35%) ⬇️
src/interval_sets.jl 95.76% <95.76%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb193ee...525d3ae. Read the comment docs.

@haberdashPI haberdashPI marked this pull request as ready for review November 12, 2021 15:59
@haberdashPI
Copy link
Contributor Author

Looks like there are a few bits I need to add test coverage for, but it should be in good shape to review.

@omus omus mentioned this pull request Nov 15, 2021
@haberdashPI
Copy link
Contributor Author

haberdashPI commented Dec 15, 2021

@omus: Can you help me understand the remaining errors I'm getting in "Julia 1" (I think that is 1.7). It looks like there's something broken about printing that is unrelated to the changes I've made.

In other news, I think this is otherwise ready for review.

@omus
Copy link
Collaborator

omus commented Dec 15, 2021

@omus: Can you help me understand the remaining errors I'm getting in "Julia 1" (I think that is 1.7). It looks like there's something broken about printing that is unrelated to the changes I've made.

There was an unrelated CI failure for "Julia 1" so I restarted the CI jobs

In other news, I think this is otherwise ready for review.

Thanks for letting me know. I don't quite have time to review such a big PR this week but I'll make time next week.

@haberdashPI
Copy link
Contributor Author

I don't quite have time to review such a big PR this week but I'll make time next week.

Thanks! I understand, it's a bit large. I don't think there's any great urgency on my end; I'm making use of the branch directly where I need it in some analyses

@haberdashPI
Copy link
Contributor Author

Just a small bump here: just wanted to check in about when a review would be possible and/or if I should do something to make reviewing this easier (break it up?).

@omus
Copy link
Collaborator

omus commented Jan 13, 2022

@haberdashPI breaking up a PR would make it easier to review but it may not be necessary. I've been rather swamped lately but I did have this on the agenda for next week. Sorry about the delay on review

Copy link
Collaborator

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't quite finished reviewing yet but I can already see this code needs additional iteration. In it's current state it'll be rather difficult to maintain. I'll do more review soon

src/endpoint.jl Outdated Show resolved Hide resolved
src/endpoint.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/endpoint.jl Outdated
struct DirectionBound{T} end
const LeftClosed = DirectionBound{:LeftClosed}()
const RightClosed = DirectionBound{:RightClosed}()
struct HalfOpenEndpoint{T, B} <: AbstractEndpoint{T}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me, at the moment, how to write that in a type stable way though.

I'd suggest putting code readability over type stability to start with as otherwise you can optimize yourself into a corner

src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
@haberdashPI haberdashPI marked this pull request as draft February 8, 2022 16:30
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
test/sets.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
src/interval.jl Outdated Show resolved Hide resolved
@haberdashPI haberdashPI marked this pull request as ready for review February 15, 2022 20:57
@haberdashPI
Copy link
Contributor Author

haberdashPI commented May 23, 2022

Alright! I think that addresses the outstanding issues.

The one open question is whether you buy my arguments above @omus for switching some of the tests to use issetequal

@haberdashPI haberdashPI requested a review from omus May 23, 2022 19:19
@omus
Copy link
Collaborator

omus commented May 24, 2022

However, your tests assume that union([0,1), [2, 3]) returns [0, 3]

Which tests assume this? This would definitely be incorrect. The code at the moment only should combine these intervals if you do a union of [0, 2) and [2, 3]. In your particular example 1 is not included so these wouldn't be combined. Even in the case where we union do [0, 1] and [2, 3] these would not merge these intervals into one as although no integer exists between 1 and 2 detecting this for various types (e.g. Float64) is challenging so we treat these as disjoint.

I think this test on line 594 of comparisions.jl is not quite right... There are a number of similar issues in subsequent lines. I've changed these to always use closed bounds on both sides, where appropriate.

You are definitely correct in I made a mistake for this case. I'll review the other corrections

@haberdashPI
Copy link
Contributor Author

However, your tests assume that union([0,1), [2, 3]) returns [0, 3]

Which tests assume this? This would definitely be incorrect

My bad: I meant something different

However, your tests assume that union([0,1), [1, 2]) returns [0, 2], while symdiff([0,1), [1, 2]) returns {[0, 1), [1,2]}, even though these two results are setequal. Arguably the right answer in both cases is [0, 2]

@omus
Copy link
Collaborator

omus commented May 26, 2022

However, your tests assume that union([0,1), [1, 2]) returns [0, 2], while symdiff([0,1), [1, 2]) returns {[0, 1), [1,2]}, even though these two results are setequal. Arguably the right answer in both cases is [0, 2]

Thanks for the clarification. I agree with you that both answers are valid but I think for our implementation we should ideally return an interval set where all intervals in the returned vector are disjoint.

In the future I can see us making an IntervalSet collection which works like Set but makes use of the special properties intervals provide. For this collection I'd want it to only include disjoint intervals.

As the current state of this PR returns the ideal interval set I'm inclined to update the comparison tests to use == rather than issetequal as this implicitly captures the length aspect of the tests.

Copy link
Collaborator

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last changes

src/interval_sets.jl Outdated Show resolved Hide resolved
src/interval_sets.jl Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
test/comparisons.jl Outdated Show resolved Hide resolved
test/comparisons.jl Outdated Show resolved Hide resolved
test/comparisons.jl Outdated Show resolved Hide resolved
test/comparisons.jl Outdated Show resolved Hide resolved
test/comparisons.jl Outdated Show resolved Hide resolved
test/comparisons.jl Outdated Show resolved Hide resolved
test/comparisons.jl Outdated Show resolved Hide resolved
@haberdashPI haberdashPI requested a review from omus May 26, 2022 20:26
@haberdashPI
Copy link
Contributor Author

Alright, placed some unions on the appropriate expected_xor's and that seems to have cleared up the remaining failed tests.

Copy link
Collaborator

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haberdashPI for sticking with this PR. It definitely took some time and thought to get here but I think we're now in a good place with the functional interface. I do think there are some internal improvements to be made here yet but now we're in a good place to make those changes without having to worry about deprecations.

@omus
Copy link
Collaborator

omus commented May 27, 2022

I'll wait until Monday to squash and merge this PR. This should allow anyone who was waiting for the dust to settle on this PR the chance to do a final review before we merge this and release version 1.7.

@haberdashPI
Copy link
Contributor Author

I do think there are some internal improvements to be made here yet

Agreed! I have had a few thoughts in that respect, that I've been holding back on until this is merged.

@omus omus merged commit d9923d8 into invenia:master May 31, 2022
@omus
Copy link
Collaborator

omus commented May 31, 2022

Registration PR: JuliaRegistries/General#61415

@rofinn
Copy link
Member

rofinn commented May 31, 2022

NOTE: This release actively breaks our code base which relies heavily on the current union and intersect behaviour. Since this PR intentionally changes the behaviour of those methods operating on existing types (ie: vector of intervals) I'm rolling this change back and it should be made as a breaking change. @omus If you're okay with that I'll plan to just yank this release from the registry while try to move these changes into a 2.0 release.

@rofinn
Copy link
Member

rofinn commented Jun 1, 2022

To clarify the specific change that's breaking is that you could previously perform a normal Base.intersect over a vector of intervals, like so.

julia> intersect([1..2, 2..3, 3..4, 4..5], [2..3, 3..4])
2-element Vector{Interval{Int64, Closed, Closed}}:
 Interval{Int64, Closed, Closed}(2, 3)
 Interval{Int64, Closed, Closed}(3, 4)

and this would behave as expected

Now when you make the same call you get

julia> intersect([1..2, 2..3, 3..4, 4..5], [2..3, 3..4])
1-element Vector{Interval{Int64, Closed, Closed}}:
 Interval{Int64, Closed, Closed}(2, 4)

To clarify, the latter is probably more correct, but since we don't distinguish an IntervalSet from a Vector{<:Interval} there is no way for us to distinguish this ambiguous case. This become particularly problematic when you start dealing with vectors of HourEnding intervals which have very different behaviour depending on whether we're using a regular date or interval type.

Assuming we don't want to dedicate a type for it then I think we should make these separate functionBase.intersect and Intervals.intersect. This allows the end user to decide what behaviour they want.

@haberdashPI
Copy link
Contributor Author

haberdashPI commented Jun 1, 2022

Hi @rofinn,

First off, sounds like this was pretty disruptive; so sorry for that. A few thoughts on how to move forward:

but since we don't distinguish an IntervalSet from a Vector{<:Interval} there is no way for us to distinguish this ambiguous case

I think this is not quite right. You can pass the objects as Sets to get the old behavior: e.g.

julia> intersect(Set([1..2, 2..3, 3..4, 4..5]), Set([2..3, 3..4]))
Set{Interval{Int64, Closed, Closed}} with 2 elements:
  Interval{Int64, Closed, Closed}(2, 3)
  Interval{Int64, Closed, Closed}(3, 4)

So perhaps, in concept, what is needed here is a deprecation like this?

@deprecate intersect(a::AbstractArray, b::AbstractArray) intersect(Set(a), Set(b))

I guess a potential issue there would be that there might be a performance hit for first converting to Set objects?

In the meantime, perhaps the new behavior can be made available by creating a simple version of the IntervalSet object and dispatching Base.intersect (and friends) over that.

@rofinn
Copy link
Member

rofinn commented Jun 1, 2022

Yeah, I think at minimum we should either define a minimal IntervalSet or just use the package namespace to distinguish the behaviour. We can always talk about adding a deprecation in a future minor release, but would need to leave making this the default behaviour with the base function until a major release. I'll revert this MR and tag to minimize impact to the ecosystem. I'll then start a new MR with your changes where we can flush out how to make it non-breaking, but still usable for your use cases.

rofinn added a commit that referenced this pull request Jun 1, 2022
@haberdashPI
Copy link
Contributor Author

haberdashPI commented Jun 1, 2022

Also, mostly as a note to myself: would be good to add some tests in here for your use case as well.

@omus
Copy link
Collaborator

omus commented Jun 1, 2022

@omus If you're okay with that I'll plan to just yank this release from the registry while try to move these changes into a 2.0 release.

Very reasonable. I apologize for missing this breaking behaviour change. Sounds like we have some additional tests that need to be added to the Intervals.jl test to ensure we don't break expected this behaviour again. Is there a better way to communicate when large changes like this are going to be released? Seems like the typical people-who-care-are-watching isn't working

@omus
Copy link
Collaborator

omus commented Jun 1, 2022

Yeah, I think at minimum we should either define a minimal IntervalSet or just use the package namespace to distinguish the behaviour.

I'm in favour of making a minimal IntervalSet type

@rofinn
Copy link
Member

rofinn commented Jun 1, 2022

would be good to add some tests in here for your use case as well

Yeah, this is a bit of a weird edge case where we'd need to test for a behaviour that wasn't defined here. I've become rather skittish of extending functions if there's concern that the new method doesn't align with the original concept/definition in Base.

Seems like the typical people-who-care-are-watching isn't working

Yeah, I don't think anyone was ever assigned to keep an eye on this repo after you left. I guess for now it'd probably be best to just assign myself or @fchorney to larger reviews for now as we're the next biggest contributors. I'll raise this maintenance concern more broadly to see if there's a more sustainable option.

I'm in favour of making a minimal IntervalSet type

I was hoping that namespacing would be the simpler solution, but given how many things we already extend in base I think you're probably correct. Moving things from the Base to Intervals namespaces would likely be more breaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants