Skip to content
Demonstration of searching PDF document with Solr, Tika, and Tesseract
JavaScript PowerShell Vue HTML Shell Dockerfile
Branch: master
Clone or download
Latest commit 05faaee Nov 16, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
ocr clean out unused file Nov 17, 2019
solr-proxy dockerification of front end and deploy Aug 27, 2019
.browserslistrc update PDF viewer to work with new SOLR response data Aug 23, 2019
.gitignore document lessons learned links Nov 6, 2019

PDF.JS End-to-End Search / Snippets / Highlighting

This is a demonstration project that leverages the Solr Payload Component from and the Offset Highlighter Component from to make PDF documents both searchable and have highlighting of text in context.


Just run docker-compose up and then browse to http://localhost:8080. You will need to wait till the init process finishes loading all of the Solr documents to use the website properly.

Solr is running on http://localhost:8983, with a CORS version at http://localhost:8000, and PDF images are served up on http://localhost:8443.

You may need to do docker-compose down -v if you have already run the demo.

You can also access this demo at

I want to learn how this works!

There are actually a couple of things that you can learn from this project. They are written up on the wiki:

  1. Using Tika/Tesseract standalone outside of Solr.
  2. Using Tika/Tesseract as an API exposed by Solr via ExtractingRequestHandler
  3. Parsing Tika/Tesseract output inside of Solr via the StatelessScriptUpdateProcessorFactory
  4. Tesseract 3 and Tika.
  5. Store binary data in Solr and serve it up like a object store!


Vue Component

cd into the pdf-viewer directory.


npm install
npm run serve

To build new version of pdf viewer standalone module

from inside the pdf-viewer directory:


And the script witll build and copy the dist directory contents into /app/pdfviewer/

Text Extraction and OCR process

From the ./ocr/ directory, there are some Powershell ( ;-) ) scripts to recreate the files if you want.

  1. Make sure you have Tesseract installed. brew install tesseract on OSX.

  2. Check the ./tika-properties/.../ file, make sure it points to your Tesseract setup.

  3. Run the extraction process, creating the working docs in the /extracts directory from the PDF's in /files.

pwsh extract-directory.ps1 ./files
  1. Create Solr documents.
pwsh create-solr-docs.ps1 ./extracts ./files ./docs_for_solr/

Interested in manually extracting content from Tika Server?

From the ./ocr/ directory run:

curl -T files/bcreg20090424a1.pdf --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_and_text_extraction" --header "X-Tika-OCRoutputType: hocr"

Run the Demo using Docker

docker-compose down -v
docker-compose build
docker-compose up

And then browse to http://localhost:8080/

To see payloads in action in Solr then run:


Building Docker images

Build the docker images from scratch via:

docker-compose build

Deploy to our private Docker registry

docker login

docker tag pdf-discovery-demo_solr-proxy
docker tag pdf-discovery-demo_app
docker tag pdf-discovery-demo_solr
docker tag pdf-discovery-demo_init

docker push
docker push
docker push
docker push
You can’t perform that action at this time.