- Validate segments generated by realtime against raw data in HDFS.
- Rebuild segments that have discrepancies from raw data in HDFS.
- Collapse existing segments into lower granularity segments.
$ bin/dumbo You must supply -s PATH! Usage: bin/dumbo (options) -d, --database PATH path to database config, defaults to "database.json" -D, --debug Enable debug output -N, --dryrun do not submit tasks to overlord (dry-run) -e, --environment ENVIRONMENT Set the daemon environment --force force segment generation regardless of state -i, --interval INTERVAL force an explicit interval -l, --limit LIMIT limit the number of tasks to spawn (defaults to unlimited) -m, --mode MODE mode to perform (verify, merge, compact) --name NAME Process name -n, --namenodes LIST HDFS namenodes (comma seperated), defaults to "localhost" -f, --offset HOURS offset from now used as interval end, defaults to 2 hours -o, --overlord HOST[:PORT] overlord hostname and port, defaults to "localhost:8090" -r, --reverse BOOL run jobs in reverse order -s, --sources PATH path to sources config (required) -t, --topics LIST Topics to process (comma seperated), defaults to all in sources.json -w, --window HOURS scan window in hours, defaults to 24 hours -z, --zookeeper URI zookeeper URI, defaults to "localhost:2181/druid" --zookeeper-path PATH druid's discovery path within zookeeper, defaults to "/discovery" -h, --help Show this message
The repo contains examples for database.json and sources.json.
Assumption / Notes
- HDFS contains data in gzip'd files in gobblin-style folders
- database.json is used to initialize sequel
- sources.json uses keys in the format "service/dataSource" as established in ruby-druid
Verify uses gobblin counters in HDFS to compare the total number of events in HDFS vs. in druid. To do this, there is a hard coded aggregation count named "events".
source['input']['epoc'] is set, verify will enforce the interval to not go beyond this point. This is useful, if you know you have incomplete HDFS data and want to keep the existing segments.
Compacting verifies segmentGranularity and schema.
All tasks spawn Druid 0.17 native tasks
- Fork it
- Create your feature branch (
git checkout -b my-new-feature)
- Commit your changes (
git commit -am 'Add some feature')
- Push to the branch (
git push origin my-new-feature)
- Create a new Pull Request
Based on remerge/dumbo (which is a rewrite from scratch of druid-dumbo v1)