Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Rika is a JRuby wrapper for the Apache Tika Java library, which extracts text and metadata from files and resources of many different formats.

Caution: This gem only works with JRuby.

Rika currently supports some basic and commonly used functions of Tika. Future development may add Ruby support for more Tika functionality, and perhaps a command line interface as well. See the Other Tika Resources section for alternatives to Rika that may suit more demanding needs.

Code Climate Build Status


For a quick start with the simplest use cases, the following functions are provided to get what you need in a single function call, for your convenience:

require 'rika'

content           = Rika.parse_content('x.pdf')    # string containing all content text
metadata          = Rika.parse_metadata('x.pdf')   # hash containing the document metadata
content, metadata = Rika.parse_content_and_metadata('x.pdf')   # both of the above

A URL can be used instead of a filespec wherever a data source is specified:

content, metadata = Rika.parse_content_and_metadata('')

For other use cases and finer control, you can work directly with the Rika::Parser object:

require 'rika'

parser ='x.pdf')

# Return the content of the document:

# Return the metadata of the document:

# Return the media type for the document, e.g. "application/pdf":

# Return only the first 10000 chars of the content:
parser ='x.pdf', 10000)
parser.content # 10000 first chars returned

# Return content from URL
parser ='', 200)

# Return the language for the content
parser ='german-document.pdf')
=> "de"

# Check whether the language identification is certain enough to be trusted

Simple Command Line Use

Since Ruby supports the -r option to require a library, and the -e option to evaluate a string of code, you can easily do simple parsing on the command line, such as:

ruby -r rika -e 'puts Rika.parse_content("x.pdf")'

You could also parse the metadata and output it as JSON as follows:

ruby -r rika -r json -e 'puts Rika.parse_metadata("x.pdf").to_json'

If you want to get both content and metadata in JSON format, this would do that:

ruby -r rika -r json -e 'c,m = Rika.parse_content_and_metadata("tw.pdf"); puts({ c: c, m: m }.to_json)'

Using the rexe gem, that can be made much more concise:

rexe -r rika -oj 'c,m = Rika.parse_content_and_metadata("x.pdf"); { c: c, m: m }'

...and changing the -oj option gives you access to other output formats such as "Pretty JSON", YAML, and AwesomePrint (a very human readable format).


Add this line to your application's Gemfile. Use gem or jgem depending on your JRuby installation:

gem 'rika' # or: jgem 'rika'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rika  # or: jgem install rika

Other Tika Resources


Richard Nyström (@ricn) is the original author of Rika, but has not been able to maintain it since 2015. In July 2020, Richard transferred the project to Keith Bennett (@keithrbennett), who had made made some contributions back in 2013.


  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request


A JRuby wrapper for Apache Tika to extract content and metadata from various file formats.



Contributors 4



You can’t perform that action at this time.