Zimit is a scraper allowing to create ZIM file from any Web site.
This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.
zimit.py is the entrypoint for the system.
After the crawl is done, warc2zim is used to write a zim to the
/output directory, which can be mounted as a volume.
--keep flag, the crawled WARCs will also be kept in a temp directory inside
zimit is intended to be run in Docker.
To build locally run:
docker build -t openzim/zimit .
The image accepts the following parameters:
--url URL- the url to be crawled (required)
--workers N- number of crawl workers to be run in parallel
--wait-until- Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is
load, but for static sites,
--wait-until domcontentloadedmay be used to speed up the crawl (to avoid waiting for ads to load for example).
--name- Name of ZIM file (defaults to the hostname of the URL)
--output- output directory (defaults to
--limit U- Limit capture to at most U URLs
--exclude <regex>- skip URLs that match the regex from crawling. Can be specified multiple times.
--scroll [N]- if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
--keep- if set, keep the WARC files in a temp directory inside the output directory
The following is an example usage. The
flags are needed to run Chrome in Docker.
docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN \ --shm-size=1gb openzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded
The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.
A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.
That version is now considered outdated and archived in