tika-spark-example

This repository gives a minimal example for using Tika in Spark on EMR. Please read the corresponding article for a full walkthrough. The tika-core and tika-parsers JARs supplied here have been built with dependency shading on Apache commons-compress to avoid dependency conflicts between Spark and Tika. This issue and the solution ~~are~~ used to be documented here. If you want to run the example as quickly as possible, run deploy.sh and supply the necessary S3 paths as documented in the script.

Build

To build the project, ensure you have sbt installed, then run sbt assembly. The resulting JAR can be found in the target directory, and is equivalent to tika-spark-assembly-1.0.jar supplied here.

Run

To run the code on AWS, ensure that you have a local key with full permissions for Elastic Map Reduce. You also need an S3 bucket containing documents with text you would like to extract (labeled input-bucket below), a bucket where extracted text will be written (output-bucket) and a bucket where JARs and other resources will be stored (resource-bucket). These could also be different subdirectories within a single S3 bucket. Given all this, run sh deploy.sh s3://input-bucket s3://output-bucket s3://resource-bucket. You should see the cluster appear in the EMR console. The job will execute and the cluster will be destroyed upon completion.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
lib		lib
src/main/scala		src/main/scala
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
deploy.sh		deploy.sh
plugins.sbt		plugins.sbt
tika-spark-assembly-1.0.jar		tika-spark-assembly-1.0.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

src/main/scala

src/main/scala

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

deploy.sh

deploy.sh

plugins.sbt

plugins.sbt

tika-spark-assembly-1.0.jar

tika-spark-assembly-1.0.jar

Repository files navigation

tika-spark-example

Build

Run

About

Releases

Packages

Languages

License

mkreisel/tika-spark-example

Folders and files

Latest commit

History

Repository files navigation

tika-spark-example

Build

Run

About

Resources

License

Stars

Watchers

Forks

Languages