Skip to content
Erik Körner edited this page Dec 13, 2021 · 1 revision

Building images

Build the default Heritrix docker image for version 3.4.0-20210923 as follows:

docker build --build-arg version=3.4.0-20210923 -t iipc/heritrix .

To use the heritrix-contrib release, build the image with the following command:

docker build --build-arg version=3.4.0-20210923 --build-arg java=8-jre -t iipc/heritrix:contrib -f Dockerfile.contrib .

Note, that iipc/heritrix:contrib currently only runs with Java 8, not Java 11 (JRE/JDK).

Build args

To be supplied with --build-arg key=value:

Name Description
version Heritrix maven release version
java Java docker base image, 11-jre for heritrix, 8-jre for heritrix-contrib
user Custom user that runs heritrix in the container, default: heritrix
userid heritrix user id, default: 1000

Building images using the Makefile

In the docker/ folder a Makefile exists that wraps common build steps. Build the images with:

make (image|image-contrib|image-all) [version=3.4.0-20210923]
# e.g. basic latest release image:
make image

Supply a specific Heritrix version with version=3.4.0-20210923.

Publish the built images with the following command for the 3.4.0-20210923 to the iipc user:

make (publish|publish-contrib|publish-all) version=3.4.0-20210923 repo=iipc/

Build multiple releases with:

make image-all-version

Test run your images with:

make (run|run-contrib) [version=3.4.0-20210923] [repo=iipc/]

Running Heritrix under Docker

See the Heritrix Documentation about running the Docker images.

Basic run commands:

# run it in foreground
# -it required for clean stopping with Ctrl+C
# --rm for cleaning up afterwards of volumes etc.
docker run --rm -it iipc/heritrix

# run it in background
# - --name is optional but easier to find for stopping
docker run -d --name heritrix_container iipc/heritrix
# logs
docker logs heritrix_container
# stop it
docker stop heritrix_container

Configuring it for real™ usage:

# --init : use tini init wrapper
# --rm : remove container after exit
# -it : runs docker interactively (pseudo TTY)
# -d : detach, run container in background
# -p : map public api port of 8443 (host) to 8443 (container port)
# -e : set environment variables for user/pass of REST API
#      JAVA_OPTS=-Xmx1024M (to restrict heritrix memory usage)
# -v : mount local folder into container (to persist job results)
#      on windows/WSL[2] volume mounts might not work (container files are not in local folder?)
#      heritrix is install at /opt/heritrix
#      heritrix jobs are at   /opt/heritrix/jobs
docker run --init --rm -d \
    --name heritrix_container \
    -p 8443:8443 \
    -e "USERNAME=admin" -e "PASSWORD=admin" \
    -v $(pwd)/jobs:/opt/heritrix/jobs \
    iipc/heritrix
# or mount a credentials file into the container (docker secrets?)
echo "admin:admin" > $(pwd)/creds
docker run --init --rm -d \
    --name heritrix_container \
    -p 8443:8443 \
    -e "CREDSFILE=/opt/heritrix/creds.txt" \
    -v $(pwd)/creds:/opt/heritrix/creds.txt \
    -v $(pwd)/jobs:/opt/heritrix/jobs \
    iipc/heritrix
# switch `-d` with `-it` to run it interactively (see log, quit with Ctrl+C)

Run a single job:

# docker options the same as above
# * specify with -e "JOBNAME=<jobname> the job that should be run
# * mount the job folder with the crawler-beans.cxml to the <jobname>
#   folder within the container
# * the crawl will start immediately
# 
docker run --init --rm -d \
    --name heritrix_container \
    -p 8443:8443 \
    -e "USERNAME=admin" -e "PASSWORD=admin" -e "JOBNAME=myjob" \
    -v $(pwd)/myjob:/opt/heritrix/jobs/myjob \
    iipc/heritrix

Run other stuff, e.g. hoppath.pl script:

# the last two lines are the relative path to the job lob
# (in the container) as well as the URI_PREFIX
docker run -it --rm \
    -v "$(pwd)/myjob:/opt/heritrix/jobs/myjob" \
    --entrypoint bin/hoppath.pl \
    iipc/heritrix \
    jobs/myjob/latest/logs/crawl.log \
    https://

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally