Skip to content
Newer
Older
100644 218 lines (143 sloc) 9.23 KB
946219d @kornypoet cleaned up documentation, round 1
kornypoet authored
1 # Wonderdog
2
f42251b @berngp Fixing a typo, blank space.
berngp authored
3 Wonderdog is a Hadoop interface to Elastic Search. While it is specifically intended for use with Apache Pig, it does include all the necessary Hadoop input and output formats for Elastic Search. That is, it's possible to skip Pig entirely and write custom Hadoop jobs if you prefer.
946219d @kornypoet cleaned up documentation, round 1
kornypoet authored
4
5 ## Requirements
6
7 ## Usage
8
9 ### Using ElasticSearchStorage for Apache Pig
10
11 The most up-to-date (and simplest) way to store data into elasticsearch with hadoop is to use the Pig Store Function. You can write both delimited and json data to elasticsearch as well as read data from elasticsearch.
12
13 #### Storing tabular data:
14
15 This allows you to store tabular data (eg. tsv, csv) into elasticsearch.
16
17 ```pig
18 %default ES_JAR_DIR '/usr/local/share/elasticsearch/lib'
19 %default INDEX 'ufo_sightings'
20 %default OBJ 'sighting'
21
22 register target/wonderdog*.jar;
23 register $ES_JAR_DIR/*.jar;
24
25 ufo_sightings = LOAD '/data/domestic/aliens/ufo_awesome.tsv' AS (sighted_at:long, reported_at:long, location:chararray, shape:chararray, duration:chararray, description:chararray);
26 STORE ufo_sightings INTO 'es://$INDEX/$OBJ?json=false&size=1000' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage();
27 ```
28
29 Here the fields that you set in Pig (eg. 'sighted_at') are used as the field names when creating json records for elasticsearch.
30
31 #### Storing json data:
32
33 You can store json data just as easily.
34
35 ```pig
36 ufo_sightings = LOAD '/data/domestic/aliens/ufo_awesome.tsv.json' AS (json_record:chararray);
37 STORE ufo_sightings INTO 'es://$INDEX/$OBJ?json=true&size=1000' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage();
38 ```
39
40 #### Reading data:
41
42 Easy too.
43
44 ```pig
45 -- dump some of the ufo sightings index based on free text query
46 alien_sightings = LOAD 'es://ufo_sightings/ufo_sightings?q=alien' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage() AS (doc_id:chararray, contents:chararray);
47 DUMP alien_sightings;
48 ```
49
50 #### ElasticSearchStorage Constructor
51
52 The constructor to the UDF can take two arguments (in the following order):
53
54 * ```esConfig``` - The full path to where elasticsearch.yml lives on the machine launching the hadoop job
55 * ```esPlugins``` - The full path to where the elasticsearch plugins directory lives on the machine launching the hadoop job
56
57 #### Query Parameters
58
59 There are a few query paramaters available:
60
61 * ```json``` - (STORE only) When 'true' indicates to the StoreFunc that pre-rendered json records are being indexed. Default is false.
62 * ```size``` - When storing, this is used as the bulk request size (the number of records to stack up before indexing to elasticsearch). When loading, this is the number of records to fetch per request. Default 1000.
63 * ```q``` - (LOAD only) A free text query determining which records to load. If empty, matches all documents in the index.
64 * ```id``` - (STORE only) The name of the field to use as a document id. If blank (or -1) the documents are assumed to have no id and are assigned one by elasticsearch.
65 * ```tasks``` - (LOAD only) The number of map tasks to launch. Default 100.
66
67 Note that elasticsearch.yml and the plugins directory are distributed to every machine in the cluster automatically via hadoop's distributed cache mechanism.
68
69 ### Native Hadoop TSV Loader
70
71 **Note**: the tsv loader is deprecated. Instead, use the ElasticSearchOutputFormat coupled with either Apache Pig storefunc (ElasticSearchIndex or ElasticSearchJsonIndex).
72
73 Once you've got a working set up you should be ready to launch your bulkload process. The best way to explain is with an example. Say you've got a tsv file of user records (name,login,email,description) and you want to index all the fields. Assuming you're going to write to an index called ```users``` with objects of type ```user``` (elasticsearch will create this object automatically the first time you upload one). The workflow is as follows:
74
75 * Create the ```users``` index:
76
77 ```
78 bin/estool create --index users
79 ```
80
81 * Upload the data
82
83 ```
84 # Will only work if the hadoop elasticsearch processes can discover the running elasticsearch cluster
85 bin/wonderdog --rm --index_name=users --bulk_size=4096 --object_type=user --field_names=name,login,email,description --id_field=1 /hdfs/path/to/users.tsv /tmp/failed_records/users
86 ```
87
88 Notice the output path. When the bulk indexing job runs it is possible for index requests to fail for various reasons (too much load, etc). In this case the documents that failed are simply written to the hdfs so they can be retried in a later job.
89
90 * Refresh Index
91
92 After the bulk load is finished you'll want to refresh the index so your documents will actually be searchable:
93
94 ```
95 bin/estool refresh --index users
96 ```
97
98 * Snapshot Index
99
100 You'll definitely want to do this after the bulk load finishes so you don't lose any data in case of cluster failure:
101
102 ```
103 bin/estool snapshot --index users
104 ```
105
106 * Bump the replicas for the index up to at least one.
107
108 ```
109 bin/estool set_replication --index users --replicas=1
110 ```
111
112 This will take a while to finish and the cluster health will show yellow until it does.
113
114 * Optimize the index
115
116 ```
117 bin/estool optimize --index users -s 3
118 ```
119
120 This will also take a while to finish.
121
122 #### TSV loader command-line options
123
124 * ```index_name``` - Index to write data to. It does not have to exist ahead of time
125 * ```object_type``` - Type of object to index. The mapping for this object does not have to exist ahead of time. Fields will be updated dynamically by elasticsearch.
126 * ```field_names``` - A comma separated list of field names describing the tsv record input
127 * ```id_field``` - Index of field to use as object id (counting from 0; default 1), use -1 if there is no id field
128 * ```bulk_size``` - Number of records per bulk request sent to elasticsearch cluster
129 * ```es_home``` - Path to elasticsearch installation, read from the ES_HOME environment variable if it's set
130 * ```es_config``` - Path to elasticsearch config file (@elasticsearch.yml@)
131 * ```rm``` - Remove existing output? (true or leave blank)
132 * ```hadoop_home``` - Path to hadoop installation, read from the HADOOP_HOME environment variable if it's set
133 * ```min_split_size``` - Min split size for maps
134
135 ## Admin
136
137 There are a number of convenience commands in ```bin/estool```. Most of the common rest api operations have be mapped. Enumerating a few:
138
139 * Print status of all indices as a json hash to the terminal
140
141 ```
142 # See everything (tmi)
143 bin/estool -c <elasticsearch_host> status
144 ```
145
146 * Check cluster health (red,green,yellow,relocated shards, etc)
147
148 ```
149 bin/estool -c <elasticsearch_host> health
150 ```
151
152 * Set replicas for an index
153
154 ```
155 bin/estool set_replication -c <elasticsearch_host> --index <index_name> --replicas <num_replicas>
156 ```
157
158 * Optimize an index
159
160 ```
161 bin/estool optimize -c <elasticsearch_host> --index <index_name>
162 ```
163
164 * Snapshot an index
165
166 ```
167 bin/estool snapshot -c <elasticsearch_host> --index <index_name>
168 ```
169
170 * Delete an index
171
172 ```
173 bin/estool delete -c <elasticsearch_host> --index <index_name>
174 ```
3ee0281 Added a bulkload example
Philip (flip) Kromer authored
175
176
177 ## Bulk Loading Tips for the Risk-seeking Dangermouse
178
179 The file examples/bulkload_pageviews.pig shows an example of bulk loading elasticsearch, including preparing the index.
180
181 ### Elasticsearch Setup
182
183 Some tips for an industrial-strength cluster, assuming exclusive use of machines and no read load during the job:
184
185 * use multiple machines with a fair bit of ram (7+GB). Heap doesn't help too much for loading though, so you don't have to go nuts: we do fine with amazon m1.large's.
186 * Allocate a sizeable heap, setting min and max equal, and
187 - turn `bootstrap.mlockall` on, and run `ulimit -l unlimited`.
188 - For example, for a 3GB heap: `-Xmx3000m -Xms3000m -Delasticsearch.bootstrap.mlockall=true`
189 - Never use a heap above 12GB or so, it's dangerous (STW compaction timeouts).
190 - You've succeeded if the full heap size is resident on startup: that is, in htop both the VMEM and RSS are 3000 MB or so.
191 * temporarily increase the `index_buffer_size`, to say 40%.
192
193 ### Temporary Bulk-load settings for an index
194
195 To prepare a database for bulk loading, the following settings may help. They are
196 *EXTREMELY* aggressive, and include knocking the replication factor back to 1 (zero replicas). One
197 false step and you've destroyed Tokyo.
198
199 Actually, you know what? Never mind. Don't apply these, they're too crazy.
200
201 curl -XPUT 'localhost:9200/wikistats/_settings?pretty=true' -d '{"index": {
202 "number_of_replicas": 0, "refresh_interval": -1, "gateway.snapshot_interval": -1,
203 "translog": { "flush_threshold_ops": 50000, "flush_threshold_size": "200mb", "flush_threshold_period": "300s" },
204 "merge.policy": { "max_merge_at_once": 30, "segments_per_tier": 30, "floor_segment": "10mb" },
205 "store.compress": { "stored": true, "tv": true } } }'
206
207 To restore your settings, in case you didn't destroy Tokyo:
208
209 curl -XPUT 'localhost:9200/wikistats/_settings?pretty=true' -d ' {"index": {
210 "number_of_replicas": 2, "refresh_interval": "60s", "gateway.snapshot_interval": "3600s",
211 "translog": { "flush_threshold_ops": 5000, "flush_threshold_size": "200mb", "flush_threshold_period": "300s" },
212 "merge.policy": { "max_merge_at_once": 10, "segments_per_tier": 10, "floor_segment": "10mb" },
213 "store.compress": { "stored": true, "tv": true } } }'
214
215 If you did destroy your database, please send your resume to jobs@infochimps.com as you begin your
216 job hunt. It's the reformed sinner that makes the best missionary.
217
Something went wrong with that request. Please try again.