Embulk is a plugin-based parallel bulk data loader that helps data transfer between various storages, databases, NoSQL and cloud services.
You can release plugins to share your efforts of data cleaning, error handling, transaction control, and retrying. Packaging effrots into plugins brings OSS-style development to the data scripts which was tend to be one-time adhoc scripts.
Embulk, an open-source plugin-based parallel bulk data loader at Slideshare
The single-file package is the simplest way to try Embulk. You can download the latest embulk-VERSION.jar from the releases page and run it with java.
Following 4 commands install embulk to your home directory:
curl --create-dirs -o ~/.embulk/bin/embulk -L https://bintray.com/artifact/download/embulk/maven/embulk-0.4.8.jar
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
You can assume the jar file is a .bat file.
curl -o embulk.bat -L https://bintray.com/artifact/download/embulk/maven/embulk-0.4.8.jar
Let's load a CSV file, for example. embulk example
subcommand generates a csv file and config file for you.
embulk example ./try1
embulk guess ./try1/example.yml -o config.yml
embulk preview config.yml
embulk run config.yml
You can use plugins to load data from/to various systems and file formats. An example is embulk-output-postgres-json plugin. It outputs data into PostgreSQL server using "json" column type.
embulk gem install embulk-output-postgres-json
embulk gem list
You can search plugins on RubyGems: search for "embulk".
embulk bundle
subcommand creates (or updates if already exists) a plugin bundle directory.
You can use the bundle using -b <bundle_dir>
option. embulk bundle
also generates some example plugins to <bundle_dir>/embulk/*.rb directory.
See generated <bundle_dir>/Gemfile file how to plugin bundles work.
embulk bundle ./embulk_bundle
embulk guess -b ./embulk_bundle ...
embulk run -b ./embulk_bundle ...
TODO: documents
embulk-plugin-xyz
Embulk supports resuming failed transactions.
To enable resuming, you need to start transaction with -r PATH
option:
embulk run config.yml -r resume-state.yml
If the transaction fails, embulk stores state some states to the yaml file. You can retry the transaction using exactly same command:
embulk run config.yml -r resume-state.yml
If you giveup to resume the transaction, you can use embulk cleanup
subcommand to delete intermediate data:
embulk cleanup config.yml -r resume-state.yml
./gradlew cli # creates pkg/embulk-VERSION.jar
./gradlew gem # creates pkg/embulk-VERSION.gem
You can see JaCoCo's test coverage report at ${project}/build/reports/tests/index.html
You can see Findbug's report at ${project}/build/reports/findbug/main.html
# FIXME coverage information is not included somehow
You can use classpath
task to use ./bin/embulk
for development:
./gradlew classpath # -x test: skip test
./bin/embulk
To deploy artifacts to your local maven repository at ~/.m2/repository/:
./gradlew install
To compile the source code of embulk-core project only:
./gradlew :embulk-core:compileJava
Task dependencies
shows dependency tree of embulk-core project:
./gradlew :embulk-core:dependencies
Embulk uses Sphinx, YARD (Ruby API) and JavaDoc (Java API) for document generation.
brew install python
pip install sphinx
gem install yard
./gradlew site
# documents are: embulk-docs/build/html
You need to add your bintray account information to ~/.gradle/gradle.properties
bintray_user=(bintray user name)
bintray_api_key=(bintray api key)
Run following commands and follow its instruction:
./gradlew set_version -Pto=$VERSION
./gradlew releaseCheck
./gradlew release
git commit -am v$VERSION
git tag v$VERSION
See also: