PDF.JS End-to-End Search / Snippets / Highlighting

This is a demonstration project that leverages the Solr Payload Component from https://github.com/o19s/payload-component and the Offset Highlighter Component from https://github.com/o19s/offset-hl-formatter, as well as pdf.js to make PDF documents searchable and have highlighting of matches with the text in context of the PDF.

Online Demo!

Checkout how amazing this is at http://pdf-discovery-demo.dev.o19s.com/ ;-)

Quickstart!

Just run docker-compose up --build and then browse to http://localhost:8080. You will need to wait till the init process finishes loading all of the Solr documents to use the website properly.

Solr is running on http://localhost:8983, and PDF images are served up on http://localhost:8443.

You may need to do docker-compose down -v if you have already run the demo.

I want to learn how this works!

There are actually a couple of things that you can learn from this project. They are written up on the wiki:

Development

Vue Component

cd into the pdf-viewer directory.

Development

npm install
npm run serve

To build new version of pdf viewer standalone module

from inside the pdf-viewer directory:

./build.sh

And the script will build and copy the dist directory contents into /app/pdfviewer/

Text Extraction and OCR process

From the ./ocr/ directory, there are some Powershell ( ;-) ) scripts to recreate the files if you want.

cd ./ocr
Make sure you have Tesseract installed. brew install tesseract on OSX. Alternatively, check that the script extract.ps1 isn't pointing at the hosted pdf-discovery-demo version of Tika ;-) Or, if it is, then that's okay.
Check the ./tika-properties/.../TesseractOCRConfig.properties file, make sure it points to your Tesseract setup.
Run the extraction process, creating the working docs in the ./ocr/extracts directory from the PDF's in ./ocr/files. We have already a pattern of separate directory pairs of input filesN and output extractsN.

pwsh extract-directory.ps1 ./files ./extracts

Create Solr documents.

The output will end up in a docs_for_solrN.

pwsh create-solr-docs.ps1 ./extracts ./files ./docs_for_solr/

Update Scripts for any new docs_for_solrN folder:
Add it to the ./ocr/init/Dockerfile COPY command.
You will also need to add it to the ./app/Dockerfile COPY command.
Update the ./ocr/init/init.sh to load the files.
Now stand up the app with docker-compose up --build

Interested in manually extracting content from Tika Server?

From the ./ocr/ directory run:

curl -T files/bcreg20090424a1.pdf http://pdf-discovery-demo.dev.o19s.com:9998/rmeta --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_and_text_extraction" --header "X-Tika-OCRoutputType: hocr"

Using Solr Packages 101

Make sure your solrconfig.xml has the name com.o19s.hl.OffsetFormatter instead of the old com.o19s.labs.OffsetFormatter:

<formatter name="html"
           default="true"
           class="com.o19s.hl.OffsetFormatter">
</formatter>

Delete the old offset-hl-formatter-1.0.1-solr7.1.0-SNAPSHOT.jar and solr-payloads-1.0.3-solr7.1.0-SNAPSHOT.jar jars from the deployment process, we have nice shiny packages now!!!!

Make sure Solr is package enabled on startup, we need another parameter. (Lets verify install script). -Denable.packages=true

Run the Demo using Docker

docker-compose down -v
docker-compose build
docker-compose up

And then browse to http://localhost:8080/

To see payloads in action in Solr then run:

http://localhost:8983/solr/documents/select?fl=id,content,path,page_dimensions&hl=on&hl.snippets=10&hl.fl=content&indent=on&q=taxes&wt=json&pl=on&echoParams=all

Using Solr Packages 101

Make sure your solrconfig.xml has the name com.o19s.hl.OffsetFormatter instead of the old com.o19s.labs.OffsetFormatter:

<formatter name="html"
           default="true"
           class="com.o19s.hl.OffsetFormatter">
</formatter>

Delete the old offset-hl-formatter-1.0.1-solr7.1.0-SNAPSHOT.jar and solr-payloads-1.0.3-solr7.1.0-SNAPSHOT.jar jars from the deployment process, we have nice shiny packag es now!!!!

Make sure Solr is package enabled on startup, we need another parameter. (Lets verify install script). -Denable.packages=true

Building Docker images

Build the docker images from scratch via:

docker-compose build

Deploy to our private Docker registry http://harbor.dev.o19s.com:

docker login harbor.dev.o19s.com

docker tag pdf-discovery-demo-app harbor.dev.o19s.com/pdf-discovery-demo/app
docker tag pdf-discovery-demo-solr harbor.dev.o19s.com/pdf-discovery-demo/solr
docker tag pdf-discovery-demo-init harbor.dev.o19s.com/pdf-discovery-demo/init

docker push harbor.dev.o19s.com/pdf-discovery-demo/app
docker push harbor.dev.o19s.com/pdf-discovery-demo/solr
docker push harbor.dev.o19s.com/pdf-discovery-demo/init

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
app		app
ocr		ocr
pdf-viewer		pdf-viewer
solr		solr
tika-server-tesseract-3		tika-server-tesseract-3
.browserslistrc		.browserslistrc
.editorconfig		.editorconfig
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
ENRICHMENT_NOTES.md		ENRICHMENT_NOTES.md
LICENSE		LICENSE
README.md		README.md
UPGRADE_SOLR_7_TO_8.md		UPGRADE_SOLR_7_TO_8.md
docker-compose.digitalocean.yml		docker-compose.digitalocean.yml
docker-compose.yml		docker-compose.yml
eng		eng
eng_cleaned.tsv		eng_cleaned.tsv
spelling_tokens.csv		spelling_tokens.csv
spelling_tokens.json		spelling_tokens.json

Navigation Menu

License

o19s/pdf-discovery-demo

Folders and files

Latest commit

History

Repository files navigation

PDF.JS End-to-End Search / Snippets / Highlighting

Online Demo!

Quickstart!

I want to learn how this works!

Development

Vue Component

Development

To build new version of pdf viewer standalone module

Text Extraction and OCR process

Interested in manually extracting content from Tika Server?

Using Solr Packages 101

Run the Demo using Docker

Using Solr Packages 101

Building Docker images

About

Resources

License

Stars

Watchers

Forks

Languages