Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Added a bulkload example

  • Loading branch information...
commit 3ee02813ca4e7f9e935492e6b01cb1864915d70e 1 parent fd54065
Philip (flip) Kromer authored
Showing with 173 additions and 0 deletions.
  1. +60 −0 Gemfile.lock
  2. +43 −0 README.md
  3. +70 −0 examples/bulkload_wp_pageviews.pig
View
60 Gemfile.lock
@@ -0,0 +1,60 @@
+PATH
+ remote: .
+ specs:
+ wonderdog (0.1.0)
+ wukong-hadoop (= 0.1.0)
+
+GEM
+ remote: http://rubygems.org/
+ specs:
+ configliere (0.4.18)
+ highline (>= 1.5.2)
+ multi_json (>= 1.1)
+ diff-lcs (1.2.1)
+ eventmachine (1.0.1)
+ forgery (0.5.0)
+ gorillib (0.4.2)
+ configliere (>= 0.4.13)
+ json
+ multi_json (>= 1.1)
+ highline (1.6.15)
+ json (1.7.7)
+ log4r (1.1.10)
+ multi_json (1.6.1)
+ rake (0.9.6)
+ redcarpet (2.2.2)
+ rspec (2.13.0)
+ rspec-core (~> 2.13.0)
+ rspec-expectations (~> 2.13.0)
+ rspec-mocks (~> 2.13.0)
+ rspec-core (2.13.0)
+ rspec-expectations (2.13.0)
+ diff-lcs (>= 1.1.3, < 2.0)
+ rspec-mocks (2.13.0)
+ uuidtools (2.1.3)
+ vayacondios-client (0.1.2)
+ configliere (>= 0.4.16)
+ gorillib (~> 0.4.2)
+ multi_json (~> 1.1)
+ wukong (3.0.0)
+ configliere (>= 0.4.18)
+ eventmachine
+ forgery
+ gorillib (>= 0.4.2)
+ log4r
+ multi_json (>= 1.3.6)
+ uuidtools
+ vayacondios-client (>= 0.1.2)
+ wukong-hadoop (0.1.0)
+ wukong (= 3.0.0)
+ yard (0.8.5.2)
+
+PLATFORMS
+ ruby
+
+DEPENDENCIES
+ rake (~> 0.9)
+ redcarpet
+ rspec (~> 2)
+ wonderdog!
+ yard
View
43 README.md
@@ -172,3 +172,46 @@ bin/estool snapshot -c <elasticsearch_host> --index <index_name>
```
bin/estool delete -c <elasticsearch_host> --index <index_name>
```
+
+
+## Bulk Loading Tips for the Risk-seeking Dangermouse
+
+The file examples/bulkload_pageviews.pig shows an example of bulk loading elasticsearch, including preparing the index.
+
+### Elasticsearch Setup
+
+Some tips for an industrial-strength cluster, assuming exclusive use of machines and no read load during the job:
+
+* use multiple machines with a fair bit of ram (7+GB). Heap doesn't help too much for loading though, so you don't have to go nuts: we do fine with amazon m1.large's.
+* Allocate a sizeable heap, setting min and max equal, and
+ - turn `bootstrap.mlockall` on, and run `ulimit -l unlimited`.
+ - For example, for a 3GB heap: `-Xmx3000m -Xms3000m -Delasticsearch.bootstrap.mlockall=true`
+ - Never use a heap above 12GB or so, it's dangerous (STW compaction timeouts).
+ - You've succeeded if the full heap size is resident on startup: that is, in htop both the VMEM and RSS are 3000 MB or so.
+* temporarily increase the `index_buffer_size`, to say 40%.
+
+### Temporary Bulk-load settings for an index
+
+To prepare a database for bulk loading, the following settings may help. They are
+*EXTREMELY* aggressive, and include knocking the replication factor back to 1 (zero replicas). One
+false step and you've destroyed Tokyo.
+
+Actually, you know what? Never mind. Don't apply these, they're too crazy.
+
+ curl -XPUT 'localhost:9200/wikistats/_settings?pretty=true' -d '{"index": {
+ "number_of_replicas": 0, "refresh_interval": -1, "gateway.snapshot_interval": -1,
+ "translog": { "flush_threshold_ops": 50000, "flush_threshold_size": "200mb", "flush_threshold_period": "300s" },
+ "merge.policy": { "max_merge_at_once": 30, "segments_per_tier": 30, "floor_segment": "10mb" },
+ "store.compress": { "stored": true, "tv": true } } }'
+
+To restore your settings, in case you didn't destroy Tokyo:
+
+ curl -XPUT 'localhost:9200/wikistats/_settings?pretty=true' -d ' {"index": {
+ "number_of_replicas": 2, "refresh_interval": "60s", "gateway.snapshot_interval": "3600s",
+ "translog": { "flush_threshold_ops": 5000, "flush_threshold_size": "200mb", "flush_threshold_period": "300s" },
+ "merge.policy": { "max_merge_at_once": 10, "segments_per_tier": 10, "floor_segment": "10mb" },
+ "store.compress": { "stored": true, "tv": true } } }'
+
+If you did destroy your database, please send your resume to jobs@infochimps.com as you begin your
+job hunt. It's the reformed sinner that makes the best missionary.
+
View
70 examples/bulkload_wp_pageviews.pig
@@ -0,0 +1,70 @@
+SET mapred.map.tasks.speculative.execution false;
+
+-- path to wikipedia pageviews data
+%default PAGEVIEWS 's3n://bigdata.chimpy.us/data/results/wikipedia/full/pageviews/2008/03'
+-- the target elasticsearch index and mapping ("type"). Will be created, though you
+-- should do it yourself first instead as shown below.
+%default INDEX 'pageviews'
+%default OBJ 'pagehour'
+-- path to elasticsearch jars
+%default ES_JAR_DIR '/usr/local/share/elasticsearch/lib'
+-- Batch size for loading
+%default BATCHSIZE '10000'
+
+-- Example of bulk loading. This will easily load more than a billion documents
+-- into a large cluster. We recommend using Ironfan to set your junk up.
+--
+-- Preparation:
+--
+-- Create the index:
+--
+-- curl -XPUT 'http://projectes-elasticsearch-0.test.chimpy.us:9200/pageviews' -d '{"settings": { "index": {
+-- "number_of_shards": 12, "number_of_replicas": 0, "store.compress": { "stored": true, "tv": true } } }}'
+--
+-- Define the elasticsearch mapping (type):
+--
+-- curl -XPUT 'http://projectes-elasticsearch-0.test.chimpy.us:9200/pageviews/pagehour/_mapping' -d '{
+-- "pagehour": {
+-- "_source": { "enabled" : true },
+-- "properties" : {
+-- "page_id" : { "type": "long", "store": "yes" },
+-- "namespace": { "type": "integer", "store": "yes" },
+-- "title": { "type": "string", "store": "yes" },
+-- "num_visitors": { "type": "long", "store": "yes" },
+-- "date": { "type": "integer", "store": "yes" },
+-- "time": { "type": "long", "store": "yes" },
+-- "ts": { "type": "date", "store": "yes" },
+-- "day_of_week": { "type": "integer", "store": "yes" } } }}'
+--
+-- For best results, see the 'Tips for Bulk Loading' in the README.
+--
+
+-- Always disable speculative execution when loading into a database
+set mapred.map.tasks.speculative.execution false
+-- Don't re-use JVM: logging gets angry
+set mapred.job.reuse.jvm.num.tasks 1
+-- Use large file sizes; setup/teardown time for leaving the cluster is worse
+-- than non-local map tasks
+set mapred.min.split.size 3000MB
+set pig.maxCombinedSplitSize 2000MB
+set pig.splitCombination true
+
+register ./target/wonderdog*.jar;
+register $ES_JAR_DIR/*.jar;
+
+pageviews = LOAD '$PAGEVIEWS' AS (
+ page_id:long, namespace:int, title:chararray,
+ num_visitors:long, date:int, time:long,
+ epoch_time:long, day_of_week:int);
+pageviews_fixed = FOREACH pageviews GENERATE
+ page_id, namespace, title,
+ num_visitors, date, time,
+ epoch_time * 1000L AS ts, day_of_week;
+
+STORE pageviews_fixed INTO 'es://$INDEX/$OBJ?json=false&size=$BATCHSIZE' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage();
+
+-- -- To instead dump the JSON data to disk (needs Pig 0.10+)
+-- set dfs.replication 2
+-- %default OUTDUMP '$PAGEVIEWS.json'
+-- rmf $OUTDUMP
+-- STORE pageviews_fixed INTO '$OUTDUMP' USING JsonStorage();

0 comments on commit 3ee0281

Please sign in to comment.
Something went wrong with that request. Please try again.