All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.2.0).
- Adapt to new
warc2zim
code structure - Using browsertrix-crawler 0.12.4
- Using warc2zim 1.5.5
- New
--build
parameter (optional) to specify the directory holding Browsertrix files ; if not set,--output
directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only if--keep
is set.
--collection
parameter was not working (#252)
- Using browsertrix-crawler 0.12.3
- Fix logic passing args to crawler to support value '0' (#245)
- Fix documentation about Chrome and headless (#248)
- Using browsertrix-crawler 0.12.1
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
- User-Agent now has a default value (#228)
- Manipulation of spaces with UA suffix and adminEmail has been modified
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
- Using browsertrix-crawler 0.12.0
- Using browsertrix-crawler 0.11.2
- Using browsertrix-crawler 0.11.1
- Using browsertrix-crawler 0.11.0
- Scraper stat file is not created empty (#211)
- Crawler statistics are not available anymore (#213)
- Using warc2zim 1.5.4
--long-description
param
- Using browsertrix-crawler 0.10.4
- Using warc2zim 1.5.3
--title
to set ZIM title--description
to set ZIM description- New crawler options:
--maxPageLimit
,--delay
,--diskUtilization
--zim-lang
param to set warc2zim's--lang
(ISO-639-3)
- Using browsertrix-crawler 0.10.2
- Default and accepted values for
--waitUntil
from crawler's update - Using warc2zim 1.5.2
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
--failOnFailedSeed
used inconditionally--lang
now passed to crawler (ISO-639-1)
--newContext
from crawler's update
- Using browsertrix-crawler 0.8.0
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
- Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
- Using browsertrix-crawler 0.8.0-beta.1
- Fixed
--allowHashUrls
being a boolean param - Increased
check_url
timeout (12s to connect, 27s to read) instead of 10s
--urlFile
browsertrix crawler parameter--depth
browsertrix crawler parameter--extraHops
, parameter--collection
browsertrix crawler parameter--allowHashUrls
browsertrix crawler parameter--userAgentSuffix
browsertrix crawler parameter--behaviors
, parameter--behaviorTimeout
browsertrix crawler parameter--profile
browsertrix crawler parameter--sizeLimit
browsertrix crawler parameter--timeLimit
browsertrix crawler parameter--healthCheckPort
, parameter--overwrite
parameter
- using browsertrix-crawler
0.6.0
and warc2zim1.4.2
- default WARC location after crawl changed
from
collections/capture-*/archive/
tocollections/crawl-*/archive/
--scroll
browsertrix crawler parameter (see--behaviors
)--scope
browsertrix crawler parameter (see--scopeType
,--include
and--exclude
)
- using crawler 0.3.2 and warc2zim 1.3.6
- Defaults to
load,networkidle0
for waitUntil param (same as crawler) - Allows setting combinations of values for waitUntil param
- Updated warc2zim to 1.3.5
- Updated browsertrix-crawler to 0.3.1
- Warc to zim now written to
{temp_root_dir}/collections/capture-*/archive/
wherecapture-*
is dynamic and includes the datetime. (from browsertrix-crawler)
- allows same first-level-domain redirects
- fixed redirects to URL in scope
- updated crawler to 0.2.0
statsFilename
now informs whether limit was hit or not
- added support for --custom-css
- added domains block list (dfault)
- updated browsertrix-crawler to 0.1.4
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3