Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Lead function which is similar to Oracle's same name window function #88

Closed
wants to merge 1 commit into from

Conversation

coderplay
Copy link

Lead is an analytic function like Oracle's Lead function. It provides access to more than one tuple of a bag at the same time without a self join. Given a bag of tuple returned from a query, LEAD provides access to a tuple at a given physical offset beyond that position. Generates pairs of all items in a bag.

If you do not specify offset, then its default is 1. Null is returned if the offset goes beyond the scope of the bag.

Example 1:

   register ba-pig-0.1.jar

   define Lead datafu.pig.bags.Lead('2');

   -- INPUT: ({(1),(2),(3),(4)})
   data = LOAD 'input' AS (data: bag {T: tuple(v:INT)});
   describe data;

   -- OUTPUT:  ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)})
   -- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: int),elem2: (v: int))}}
   data2 = FOREACH data GENERATE Lead(data);
   describe data2;
   DUMP data2;

Example 2

   register  ba-pig-0.1.jar

   define Lead datafu.pig.bags.Lead();

   -- INPUT: ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})})
   data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: tuple(v2:INT)})});
   --describe data;

   -- OUPUT: ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)})
   data2 = FOREACH data GENERATE Lead(data);
   --describe data2;
   DUMP data2;

@matthayes
Copy link
Contributor

It looks like this differs from how Lead behaves in Oracle. For example in one of your test cases, with Lead('2'):

({(1),(2),(3),(4)})

becomes:

((1),(2),(3))
((2),(3),(4))
((3),(4),)
((4),,)

Should it not produce this instead?

((1),(3))
((2),(4))
((3),)
((4),)

@coderplay
Copy link
Author

You are right. I will resubmit a patch.

@matthayes
Copy link
Contributor

By the way, DataFu has been accepted into Apache Incubator. Can you file a JIRA at https://issues.apache.org/jira/browse/DATAFU and submit the patches through JIRA instead? We also have a review board set up here: https://reviews.apache.org/groups/DataFu/ . Also please update license headers as documented here: http://www.apache.org/legal/src-headers.html .

There are two ways to go about creating the patches:

  1. Clone the new repo from git://git.apache.org/incubator-datafu.git and add the files there. Create the patch using git's format-patch command (tips: http://ariejan.net/2009/10/26/how-to-create-and-apply-a-patch-with-git/ ).

  2. Alternatively you can generate the patch from your existing repo like so:

git remote add linkedin https://github.com/linkedin/datafu.git
git fetch linkedin
git pull linkedin master
git format-patch linkedin/master

@matthayes
Copy link
Contributor

I've filed this JIRA: https://issues.apache.org/jira/browse/DATAFU-12

Please submit a patch there, thanks!

@matthayes matthayes closed this Jan 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants