Skip to content

mkreisel/tika-spark-example

Repository files navigation

tika-spark-example

This repository gives a minimal example for using Tika in Spark on EMR. Please read the corresponding article for a full walkthrough. The tika-core and tika-parsers JARs supplied here have been built with dependency shading on Apache commons-compress to avoid dependency conflicts between Spark and Tika. This issue and the solution are used to be documented here. If you want to run the example as quickly as possible, run deploy.sh and supply the necessary S3 paths as documented in the script.

Build

To build the project, ensure you have sbt installed, then run sbt assembly. The resulting JAR can be found in the target directory, and is equivalent to tika-spark-assembly-1.0.jar supplied here.

Run

To run the code on AWS, ensure that you have a local key with full permissions for Elastic Map Reduce. You also need an S3 bucket containing documents with text you would like to extract (labeled input-bucket below), a bucket where extracted text will be written (output-bucket) and a bucket where JARs and other resources will be stored (resource-bucket). These could also be different subdirectories within a single S3 bucket. Given all this, run sh deploy.sh s3://input-bucket s3://output-bucket s3://resource-bucket. You should see the cluster appear in the EMR console. The job will execute and the cluster will be destroyed upon completion.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published