Skip to content

Commit

Permalink
notes on pigstorefunc, need to fix outputformat so the options arent …
Browse files Browse the repository at this point in the history
…namespaced with "wonderdog" in any way
  • Loading branch information
Jacob Perkins committed Jan 24, 2011
1 parent 302f2e4 commit b25d04c
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 0 deletions.
45 changes: 45 additions & 0 deletions notes/pigstorefunc.pig
@@ -0,0 +1,45 @@
--
-- Doesn't work at the moment, just some notes on how the storefunc might look.
--


--
-- Right now the ElasticSearchOutputFormat gets all its options from the
-- Job object. We can use the call to setStoreLocation in the storefunc
-- to set the required parameters. Need to make sure the following are
-- set:
--
-- wonderdog.index.name - should be set by the storefunc constructor
-- wonderdog.bulk.size - should be set by the storefunc constructor
-- wonderdog.field.names - should be set by the call to checkSchema
-- wonderdog.id.field - should be set by the storefunc constructor
-- wonderdog.object.type - should be set by the storefunc constructor
-- wonderdog.plugins.dir - should be set by call to setStoreLocation
-- wonderdog.config - should be set by call to setStoreLocation
--
-- FIXME: options used in the ElasticSearchOutputFormat should NOT be
-- namespaced with 'wonderdog'

%default INDEX 'es_index'
%default OBJ 'text_obj'


records = LOAD '$DATA' AS (text_field:chararray);
records_with_id = LOAD '$IDDATA' AS (id_field:int, text_field:chararray);

-- Here we would use the elasticsearch index name as the uri, pass in a
-- comma separated list of field names as the first arg, the id field
-- as the second arg and the bulk size as the third.
--
-- and so on.
STORE records INTO '$INDEX/$OBJ' USING ElasticSearchStorage('my_text_field', '-1', '1000');


-- but it would be really nice to duplicate what's in WonderDog.java in that,
-- should a bulk request fail, the failed records are written to hdfs. The
-- user should have some control of this. Also, it should be possible to generate
-- the field names directly from the pig schema? (We'd have to be VERY explicit in the
-- docs about this as it would be a point of headscratching/swearing...) In this
-- case we might have something like:
named_records = FOREACH records GENERATE text_field AS text_field_name;
STORE records INTO '/path/to/failed_requests' USING ElasticSearchStorage('$INDEX/$OBJ', '-1', '1000');
Expand Up @@ -130,6 +130,7 @@ public RecordWriter<NullWritable, MapWritable> getRecordWriter(final TaskAttempt
return new ElasticSearchRecordWriter();
}

// FIXME: options used here should NOT be namespaced with 'wonderdog'
public void setConf(Configuration conf) {
this.indexName = conf.get("wonderdog.index.name");
this.bulkSize = Integer.parseInt(conf.get("wonderdog.bulk.size"));
Expand Down

0 comments on commit b25d04c

Please sign in to comment.