Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a method for HashJoin pipes #1

Merged
merged 1 commit into from
Aug 20, 2012
Merged

Conversation

blinsay
Copy link

@blinsay blinsay commented Aug 20, 2012

I added Assembly#hash_join like I threatened on twitter. Tests are included, and (if it's worth anything) I smoke tested locally.

The only thing that kept this from being a 10 minute change was a bit of foolery with scopes - I was expecting the outgoing Scope generated by the HashJoin to be identical to the one generated by CoGroup, but this didn't end up being the case. That makes sense in hindsight, since CoGroup is doing grouping and HashJoin isn't. If that makes sense to you, commit away. If not, holler back.

@travisbot
Copy link

This pull request passes (merged 29b5a79 into 5dfb77e).

@mrwalker mrwalker merged commit 29b5a79 into mrwalker:master Aug 20, 2012
@mrwalker
Copy link
Owner

blinsay, thanks very much for this addition, in particular your attention to testing. That rocks.

The only thing I notice that makes me the slightest bit nervous is the way in which you delegate to apply_aggregations. That was meant to handle all the cases of join, group_by, and union for applying a block of Everies and potentially replacing the entire structure with a composite AggregateBy, if possible. One component of determining if it's possible to use the AggregateBy optimization is to ask the tail pipe is_group_by -- one of your notes mentions that HashJoin is implemented as a Group, so I'd like to look into this to make sure the AggregateBy optimization isn't accidentally being tripped off. To avoid all this, we can simply route around apply_aggregations entirely (since hash_join can't take a block and should never apply aggregations).

But take all that with a grain of salt since I haven't looked into HashJoin's implementation and am just speculating about possible concerns.

Thanks again for the patch!

@blinsay
Copy link
Author

blinsay commented Aug 20, 2012

I was just using apply_aggregations to make sure I didn't miss anything. :P I think it was part of realizing that the output grouping fields were different in HashJoin than they are in CoGroup.

Good concern though. I just dug into Splice.java, and it looks like HashJoin is a Kind.Join after all.

/* cascading.pipe.Splice.java - Line 812 */
  private void setKind()
    {
    if( this instanceof GroupBy )
      kind = Kind.GroupBy;
    else if( this instanceof CoGroup )
      kind = Kind.CoGroup;
    else if( this instanceof Merge )
      kind = Kind.Merge;
    else
      kind = Kind.Join;
    }

And since is_group_by just returns kind == Kind.GroupBy, this should be safe.

Speaking of testing, do I owe you a Spec too?

@mrwalker
Copy link
Owner

Ah, perfect. Then can_aggregate_by? will be false and we avoid having to worry about swapping in a composite AggregateBy.

No, definitely not. The specs are older tests built around the time I added field propagation using Scope. I've been attempting to migrate off them onto Test::Unit because they swallow exceptions in a very annoying way which makes writing new tests difficult. At the moment, I keep them around because they have some actual input/output coverage on running jobs that I haven't ported into the unit tests yet.

@mrwalker
Copy link
Owner

I tweaked the hash_join tests ever so slightly to make them line up side-by-side with the join tests. I also replaced the comment about HashJoin's impl with basically a WTF? I do not currently understand why it should report only the lhs grouping fields despite being sent the same array of fields from all inputs that we give to CoGroup.

There is likely a reason for this, but I haven't dug into it enough to understand. In the meantime, if it works, this is not really a big deal, but you may have uncovered a bug to submit to Cascading =)

@blinsay
Copy link
Author

blinsay commented Aug 20, 2012

It sure looks like it works. I've been screwing around with it in Local mode and things look ok. I'd expect a failure like that to pop out during the planning phase, too.

It really doesn't look like it should be different, but it's coming straight from the Java and not the Ruby. Like I mentioned in my note, I can rationalize the difference if you make me - a HashJoin doesn't do any grouping and a CoGroup does, but they're both Splices, so I can see just specifying one of the join keys as the group field. It's a part of code I'm sure Chris didn't think would see the light of day. :P I didn't manage to find a reason for it in the Java, but I didn't look super hard.

Maybe it's time to take this to the mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants