Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantile and StreamingQuantile don't work - 'can't work with argument null' #27

Closed
rjurney opened this issue Jan 29, 2013 · 6 comments
Closed

Comments

@rjurney
Copy link

rjurney commented Jan 29, 2013

file ./topics.pig, line 31, column 62> Failed to generate logical plan. Nested exception: java.lang.RuntimeException: could not instantiate 'datafu.pig.stats.Quantile' with arguments 'null'

When:

quantiles = foreach (group token_counts all) generate FLATTEN(datafu.pig.stats.Quantile('0.10', '0.90')) as (low_ten, high_ten);

@matthayes
Copy link
Contributor

What do you expect to happen? Shouldn't you define Quantile first with '0.10', '0.90'?

@rjurney
Copy link
Author

rjurney commented Jan 29, 2013

That code was junk. Now I'm trying this, but I can't seem to use the
quantiles to filter... I am trying to implement your suggestion, removing
top and bottom 10%. I can't figure out how to access them.

emails = load '/me/Data/test_mbox' using AvroStorage();
just_id_body = foreach emails generate message_id, body;

token_records = foreach just_id_body generate message_id,
FLATTEN(TokenizeText(body)) as token;
token_counts = foreach (group token_records by token) generate
(chararray)group as token:chararray, COUNT_STAR(token_records) as total;
quantiles = foreach (group token_counts all) generate
FLATTEN(Quantile(token_counts.total)) as (low_filter, high_filter);
token_filter = filter token_counts by total > quantiles::low_filter and
total < quantiles::high_filter;

I get this:

<line 11, column 46> Invalid field projection. Projected field
[quantiles::low_filter] does not exist in schema:
token:chararray,total:long.

or

<file ./topics.pig, line 6, column 15> Invalid scalar projection:
token_records : A column needs to be projected from a relation for it to be
used as a scalar

On Mon, Jan 28, 2013 at 11:00 PM, Matt Hayes notifications@github.comwrote:

What do you expect to happen? Shouldn't you define Quantile first with
'0.10', '0.90'?


Reply to this email directly or view it on GitHubhttps://github.com//issues/27#issuecomment-12822916.

Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

@rjurney
Copy link
Author

rjurney commented Jan 29, 2013

/* Data Fu /
REGISTER /me/Software/datafu/dist/datafu-0.0.9-SNAPSHOT.jar
REGISTER /me/Software/datafu/lib/
.jar /* */

DEFINE Quantile datafu.pig.stats.Quantile('0.11','0.89');

set default_parallel 5
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false

rmf /tmp/tf_idf_scores.txt

import 'tfidf.macro';

emails = load '/me/Data/test_mbox' using AvroStorage();
just_id_body = foreach emails generate message_id, body;

token_records_a = foreach just_id_body generate message_id, FLATTEN(TokenizeText(body)) as token;
token_counts = foreach (group token_records_a by token) generate (chararray)group as token:chararray,
COUNT_STAR(token_records_a) as total;
quantiles = foreach (group token_counts all) generate FLATTEN(Quantile(token_counts.total)) as (low_filter, high_filter);

This returns 1.0,1.0... it is confusing.

@matthayes
Copy link
Contributor

Hmm seems like you are trying to get the distribution of token counts, right? Shouldn't you do a GROUP ALL and then pass in the total as a bag to Quantile? Also make sure you sort the totals before passing into Quantile.

@rjurney
Copy link
Author

rjurney commented Jan 29, 2013

thanks, I did that and it works!

On Tue, Jan 29, 2013 at 10:00 AM, Matt Hayes notifications@github.comwrote:

Hmm seems like you are trying to get the distribution of token counts,
right? Shouldn't you do a GROUP ALL and then pass in the total as a bag to
Quantile? Also make sure you sort the totals before passing into Quantile.


Reply to this email directly or view it on GitHubhttps://github.com//issues/27#issuecomment-12848245.

Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

@matthayes
Copy link
Contributor

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants