Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vv_intersect dim-check error #4

Closed
mohawk2 opened this issue Mar 12, 2022 · 16 comments
Closed

vv_intersect dim-check error #4

mohawk2 opened this issue Mar 12, 2022 · 16 comments

Comments

@mohawk2
Copy link
Collaborator

mohawk2 commented Mar 12, 2022

As shown in https://stackoverflow.com/a/71446817/3857002, there is what I believe is a logic error in the PMCode for the otherwise-excellent vv_intersect; the "dimension mismatch" error comes from the two dim(-2) being different, while it should just be dim(0) being compared. This error comes to light when broadcasting (formerly known as "threading").

There is also arguably an error in that the first element being returned should be an empty ndarray (vectorlength x 0) on no intersection, while the current implementation does a slice which gets this wrong. Also using $nc->sclr will cause an error with current PDL if broadcasting, since it will be a multi-element ndarray.

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 12, 2022

I believe the solution for the second paragraph there is to slice on "0:".($nc->maxover-1), and you'd want to mv 1, not -1.

@moocow-the-bovine
Copy link
Owner

moocow-the-bovine commented Mar 12, 2022

oof. Thanks @mohawk2 for the report (and for your stackoverflow answer!) -- this does indeed seem to be a bug, but I can't get my head around implicit threading/broadcasting for binary operations like vv_intersect which only need to match on a single dimension (signaturea(M,NA); b(M,NB); [o]c(M,NC); int [o]nc() --> dimension-matching only required for M).

Perhaps I can ask a stupid question to help my understanding: where are implicit thread dimensions created in the dimension list (are they pushed onto the back or unshifted onto the front)? I think my use of dim(-2) was assuming that implicit thread dimensions would be unshifted onto the front, but your statement that "it should just be dim(0) being compared" seems to suggest otherwise.

Taking the "needle+haystack" example, I can imagine implicit threading being useful for searching for multiple needle-vectors in a single haystack-set, or to search for a single needle-vector in multiple haystack-sets, or even both. So given a $needle(M) and a $haystack(M,N) with $dummy=$needle->slice(",*1"), I had been expecting something like following:

  • 1 needle, 1 haystack: vv_intersect($dummy, $haystack) --> $c(M,NC)
  • K needles, 1 haystack:: vv_intersect($dummy->slice("*$k,,,"), $haystack-slice("*1,,,")) --> $c(K,M,NC)
  • 1 needle, K haystacks: vv_intersect($dummy->slice("*1,,,"), $haystack->slice("*$k,,,")) --> $c(K,M,NC)
  • K needles, K haystacks (aligned): vv_intersect($dummy->slice("*$k,,,"), $haystack->slice("*$k,,,")) --> $c(K,M,NC)

... but that doesn't work (and may well never have worked, probably due to my misunderstanding). Am I correct in assuming now that the proper (PDL-ish) signatures for this sort of thing would arise from adding the implicit thread dimension (K) to the end of the dimension-lists like so:

  • K needles, 1 haystack:: vv_intersect($dummy->slice(",,,*$k"), $haystack) --> $c(M,NC,K)
  • 1 needle, K haystacks: vv_intersect($dummy, $haystack->slice(",,,*$k")) --> $c(M,NC,K)
  • K needles, K haystacks (aligned): vv_intersect($dummy->slice(",,,*$k"), $haystack->slice(",,,*$k")) --> $c(M,NC,K)

?

arguably an error in that the first element being returned should be an empty ndarray

Yep, $c should definitely should be empty in that case.

Note to self: this may also affect vv_setdiff, and maybe even v_intersect and v_setdiff.

@moocow-the-bovine
Copy link
Owner

@mohawk2 based on the assumptions above (= implicitly thread over final dimensions rather than initial ones), I've overhauled the vv_* set operations, added some tests, & uploaded to CPAN as PDL::VectorValued v1.0.16. Please feel free to re-open this issue if the problem persists.

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 12, 2022

Lots of changes! I'm not sure we're there yet, but please see if you agree (I added # comment):

pdl> use PDL::VectorValued::Utils
pdl> p $toto = pdl([1,2,3], [4,5,6])
[
 [1 2 3]
 [4 5 6]
]
pdl> p $titi = pdl(1,2,3)
[1 2 3]
pdl> p $notin = pdl(7,8,9)
[7 8 9]
pdl> p $titi->dummy(1)
[
 [1 2 3]
]
pdl> p $notin->dummy(1)
[
 [7 8 9]
]
pdl> p vv_intersect($titi, $toto)
Empty[3x0] [0] # wrong - the built-in broadcasting engine would insert a dim of 1, you need to mimic that
pdl> p vv_intersect($notin, $toto)
Empty[3x0] [0] # right
pdl> p vv_intersect($titi->dummy(1), $toto)
[
 [1 2 3]
]
 [1] # right
pdl> p vv_intersect($notin->dummy(1), $toto)
[
 [0 0 0] # this should be empty
]
 [0]

The non-dummied results aren't right; the possibility of no dim being there needs to be accounted for by adding a "1" dim. The non-intersecting dummied result should give an empty as indicated.

I am sure that my previous point about using maxover not max was correct; I see you went with max, which collapses the entire ndarray rather than just the 0-th dim, which if one had a 5-dimensional thing would be obviously wrong; you're aiming to find the largest quantity along exactly and only the 0-th dim.

I found the broadcasting thing a bit head-bending when I last-minute decided to add to that answer with the "multi" thing, so it's no surprise it's taking us a couple of goes. Your note above only allows for K needles and K haystacks, instead of L haystacks, which in my experience shows one is not there yet (I've battled this broadcasting thing quite a bit, and carry the psychological scars ;-). Here is my concept of this, hope it makes sense. My notation uses * to indicate an inserted dummy dim. The vector-length will be universally 3, and that quantity will be otherwise avoided to minimise the risk of errors:

needle0_in: 3
needle0_notin: 3
needle_in: 3x1
needle_notin: 3x1
needles: 3x2 # a "cat" of both needle0
haystack: 3x2
# these dimensional outputs reflect the "result"
# the "nc" is appended with a ","
# nc is notated not with dims but as an ndarray with [], collapsing the first two dims to 1
needle0_in /\ haystack: 3x1, [1] # broadcasted the input up to 3x1
needle0_notin /\ haystack: 3x0, [0]
needle_in /\ haystack: 3x1, [1]
needle_notin /\ haystack: 3x0, [0]
needles /\ haystack: 3x1, [1]
# note the "nc" results here are probably unnecessarily inflated with an extra dim
# I believe if you just pass a `null` like in my SO answer the broadcasting engine will agree

# now we want to know whether each needle is "in" one by one, not really
# a normal intersect, so we insert a dummy in haystack in order to broadcast
# the "nc" needs to come back as a 4x2
needles8: 3x4x2
haystack: 3x2
# need to manipulate above into suitable inputs for intersect to get right output
needles8': 3x*x4x2 # dummy of size 1 inserted in dim 1
# haystack no changes needed; don't need same number of dims, broadcast engine will add dummy/1s at top
needles8' /\ haystack: 3x1x4x2, (a 1x4x2 result - as noted above, collapses first two dims of "results" into a 1, probably unnecessarily inflated as noted above)

I would observe that the tests you added appear superficially very thorough, but that using lots of slices and glues for both the inputs and the "expected" data is a flawed approach because one risks copying a flawed implementation into flawed tests for a false positive. For instance, around line 114 of the setops test, I honestly don't know what the slice/glue expression would produce. I'd suggest instead using explicit test data at least for outputs.

Final observation; using PMCode to mangle input dimensions is very difficult, and it's easier in RedoDimsCode, especially with the 2.075 change that lets you access $SIZE(x) directly rather than all the conditional-operators-to allow-for-insufficient-dims stuff. Mangling of outputs is probably easiest in PMCode, although it ought to be made easier.

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 12, 2022

Please note that I am unable to re-open the issue. I assume the repo settings are set to forbid that.

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 13, 2022

Thanks for reopening! (I gather it's a long-complained-about feature of GH that if the maintainer closes an issue, the OP cannot reopen it, only the maintainer).

I thought of an actual (albeit dumb) application for the multi, "needles" with dummying: if the vector is a 3-vector of a colour, you can produce a mask of an image, of pixels with exactly those values.

@moocow-the-bovine
Copy link
Owner

moocow-the-bovine commented Mar 13, 2022

thanks for the illustrative reply! I'm not sure I've understood everything yet, so please be patient:

pdl> p vv_intersect($titi, $toto)
Empty[3x0] [0] # wrong - the built-in broadcasting engine would insert a dim of 1, you need to mimic that

I think I've got this kinda working locally with RedoDimsCode (thanks for the hint!)


pdl> p vv_intersect($notin->dummy(1), $toto)
[
 [0 0 0] # this should be empty
]

I believe this was already fixed in v1.0.16.


I am sure that my previous point about using maxover not max was correct;

I see the logic of your suggestion and agree that it's correct, but I don't see how to make work both efficiently & correctly without replacing the slice() with a dice_axis() or similar. My brain is also still reeling from using PDLs to specify shapes in the dim-list. Say we've got:

$needles = pdl([ [[1,2,3],[-1,-1,-1]],
                           [[-2,-2,-2],[-3,-3,-3]] ]);
# 3x2x2

$haystack = pdl([[1,2,3],[4,5,6]]);
# 3x2

($c, $nc) = vv_intersect($needles, $haystack);
# 3x2x2, 2, with $nc = pdl(1,0): so far, so good

The scalar-context convenience slicing in PMCode would reshape that using maxover as $c->mv(1,0)->slice("0:".($nc->maxover-1))->mv(0,1), and return a $c of shape 3x1x2:

[
 [[1 2 3]],
 [[0 0 0]],
]

... so the second (threaded) logical intersection of $needles->slice(",,(1)") (=[[-2-2-2],[-3-3-3]]) with $haystack would appear in the trimmed $c as an all-zero 3x1 ndarray rather than an empty 3x0 vector-set. I don't believe that there's any good way around this in PDL because each ndarray needs a constant shape ... that's why $nc is there in the first place. The implicit slicing to $nc->max-1 (or $nc->maxover-1) really only makes sense to me in the base case where we're intersecting exactly 2 sets and not threading over multiple intersections in a single call.

Using maxover also breaks the trimming slice expression in higher-order cases, e.g.:

$needles4 = $needles->slice(",,,*4");
($c4,$nc4) = vv_intersect($needles4, $haystack);
# 3x2x2x4, 2x4

print $nc4->maxover;
[1 1 1 1]

... but we can't slice("0:".($nc4->maxover-1)) because [0 0 0 0] isn't a slice-expression. My only idea for a generalized variant would be to use a temporary indx-ndarray somehow derived from $nc4->maxover and perform the convenience-trimming with dice_axis, but (a) that would likely be a large and klunky temporary, (b) I don't see how to derive it in the general case, (c) it doesn't solve the problem of variable-sized result sets, and (e) I fervently wish to believe that anyone who finds a real use-case for this can either live with the (overgeneralized) $nc->max trimming or go ahead and get what they need from $nc itself.

To invoke some Perl dogma: the "easy things" (here, intersecting exactly 2 sets) should be easy (as is the case with $nc->max and slice()), and the "harder things" (like threaded multi-intersections) should be possible (e.g. by giving the user the option to refer to $nc directly).


I still have to look deeper into it; just some thoughts for now...

@moocow-the-bovine
Copy link
Owner

I would observe that the tests you added appear superficially very thorough, but that using lots of slices and glues for both the inputs and the "expected" data is a flawed approach ...

Yep, you're right, it is, I'm just lazy.

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 13, 2022

For the intersection between a 3x2 and a 3x2x2x4, the $nc will be a 2x4. Each of the values will be ranged from 0..2 (here, 2 because it's the lower (EDIT not higher, for an intersection) of the needle's dim(1) and the haystack's dim(1)).

I think you may be right that actually what's needed to find the size of the $result (which would be 3x$Ax2x4) would be a "global" max. You are quite right that ndarrays have to be rectangular, which would mean that some of the results that were shorter than others would indeed have zeroes, but that's fine because the $nc entry for that result would tell you how many to take into account.

I believe this was already fixed in v1.0.16

The result I showed was from 1.0.16. Please try it ;-)

If you think this stuff is confusing, I've just (FINALLY) fixed the gremlins in the TriD demos (PDLPorters/pdl#337). I decided I'd make the code generating the graphs' axes use ndarrays rather than Perl scalars and for-loops. Using some creative slicing, clumping, and dataflow, it all looked like it would work great. Segfault (the reverse dataflow destination has a NULL data pointer for some reason). More debugging! (luckily it's 100% reproducible)

@moocow-the-bovine
Copy link
Owner

moocow-the-bovine commented Mar 13, 2022

Thanks for your patience :-) ... I think pretty much everything (with the exception of max vs. maxover) is now working in d84f43b .

With respect to:

pdl> p vv_intersect($notin->dummy(1), $toto)
[
 [0 0 0] # this should be empty
]
[0]

I believe this was already fixed in v1.0.16

The result I showed was from 1.0.16. Please try it ;-)

I had (and I just did again) ;-) ... I think there may be a misunderstanding. The implicit ("convenience") trimming code which returns an Empty[3x0] for this call is context-sensitive (only evaluated in scalar context), as it says in the docs. In list context, vv_intersect always returns an un-trimmed pair ($c, $nc). This is intentional, and I'm disinclined to change it (-> why bother with max & slice if the user wants/needs to access $nc directly anyways?). Current master [EDIT: really in master now] behaves the same way with respect to context-sensitivity. What I meant when I said I believed it to be "fixed" in v1.0.16 is that in scalar context, vv_intersect does indeed return an Empty[3x0] ndarray for this call there:

pdl> p scalar vv_intersect($notin->dummy(1), $toto)
Empty[3x0]

If that's returning non-empty on your setup in scalar context, there's likely something else going on. The 3x1 shape of the first element returned in list context is expected behavior.

@moocow-the-bovine
Copy link
Owner

I won't have much time for this next week, so I've uploaded the current status quo to CPAN as PDL-VectorValued v1.0.17. Leaving the issue open (and omitting sensitive keywords from commit messages ;-)

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 13, 2022

It's a pity the slice only happens in scalar (EDIT not list) context. append doesn't make such a distinction. If you feel my SO answer can actually be rewritten based on an updated CPAN version (and not have to use the _*_int function), please either update it yourself, or put here a sketch of the updated version and I'll do so :-)

@moocow-the-bovine
Copy link
Owner

moocow-the-bovine commented Mar 14, 2022

It's a pity the slice only happens in list context.

(presumably a typo): only in scalar context, not in list context.

append doesn't make such a distinction.

True, but append also doesn't have to deal with variable-sized result sets, so it doesn't have (or need) a second return value like $nc.

If you feel my SO answer can actually be rewritten based on an updated CPAN version

I think that v1,0.17 ought to work without the _int function:

sub vv_in {
  require PDL::VectorValued::Utils;
  my ($needle, $haystack) = @_;
  my ($c, $nc) = $needle->vv_intersect($haystack);
  $nc;
}
pdl> p $titi = pdl(1,2,3)
[1 2 3]
pdl> p $toto = pdl([1,2,3], [4,5,6])
[
 [1 2 3]
 [4 5 6]
]
pdl> p $notin = pdl(7,8,9)
[7 8 9]
pdl> p vv_in($titi, $toto)
1
pdl> p vv_in($notin, $toto)
0

[EDIT for posterity: corrected broken example code to something which actually works as intended]

As I added to the (3-year-old?) SO thread yesterday, I think a "better" way to do it would be to use vsearchvec, which also handles multiple searches more gracefully, and doesn't require $needle to be lexicographically sorted (nb vsearchvec is what PDL::CCS mostly uses under the hood):

sub vv_in_vsearch {
  require PDL::VectorValued::Utils;
  my ($needle, $haystack) = @_;
  return ($haystack->dice_axis(1, $needle->vsearchvec($haystack)) == $needle)->bandover;
}
pdl> p vv_in_vsearch($titi, $toto)
[1]
pdl> p vv_in_vsearch($notin, $toto)
[0]
pdl> p vv_in_vsearch($toto, $toto)
[1 1]
pdl> p vv_in_vsearch($notin->cat($titi), $toto)
[0 1]

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 15, 2022

Thank you! I think your SO answer is better than mine, so if you feel this matter is resolved, please feel free to close this issue :-)

@moocow-the-bovine
Copy link
Owner

Thanks again for the report (and for the upvote :-); I've thrown out the redundant mv(1,0)->...->mv(0,1) from the trimming expressions in v1.0.18 (which I think was your suggestion, but can't find it above), and closing.

@mohawk2
Copy link
Collaborator Author

mohawk2 commented Mar 16, 2022

Thanks again for the report (and for the upvote :-); I've thrown out the redundant mv(1,0)->...->mv(0,1) from the trimming expressions in v1.0.18 (which I think was your suggestion, but can't find it above), and closing.

I added a comment on the GH commit, since that was a convenient place to do so as I was reading it with interest. Thanks for your updates!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants