Skip to content

Latest commit

 

History

History
194 lines (122 loc) · 4.76 KB

CHANGELOG.md

File metadata and controls

194 lines (122 loc) · 4.76 KB

Changelog

All notable changes to this project are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.2.0).

[Unreleased]

[1.6.3] - 2024-01-18

Changed

  • Adapt to new warc2zim code structure
  • Using browsertrix-crawler 0.12.4
  • Using warc2zim 1.5.5

Added

  • New --build parameter (optional) to specify the directory holding Browsertrix files ; if not set, --output directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only if --keep is set.

Fixed

  • --collection parameter was not working (#252)

[1.6.2] - 2023-11-17

Changed

  • Using browsertrix-crawler 0.12.3

Fixed

  • Fix logic passing args to crawler to support value '0' (#245)
  • Fix documentation about Chrome and headless (#248)

[1.6.1] - 2023-11-06

Changed

  • Using browsertrix-crawler 0.12.1

[1.6.0] - 2023-11-02

Changed

  • Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
  • User-Agent now has a default value (#228)
  • Manipulation of spaces with UA suffix and adminEmail has been modified
  • Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
  • Using browsertrix-crawler 0.12.0

[1.5.3] - 2023-10-02

Changed

  • Using browsertrix-crawler 0.11.2

[1.5.2] - 2023-09-19

Changed

  • Using browsertrix-crawler 0.11.1

[1.5.1] - 2023-09-18

Changed

  • Using browsertrix-crawler 0.11.0
  • Scraper stat file is not created empty (#211)
  • Crawler statistics are not available anymore (#213)
  • Using warc2zim 1.5.4

[1.5.0] - 2023-08-23

Added

  • --long-description param

[1.4.1] - 2023-08-23

Changed

  • Using browsertrix-crawler 0.10.4
  • Using warc2zim 1.5.3

[1.4.0] - 2023-08-02

Added

  • --title to set ZIM title
  • --description to set ZIM description
  • New crawler options: --maxPageLimit, --delay, --diskUtilization
  • --zim-lang param to set warc2zim's --lang (ISO-639-3)

Changed

  • Using browsertrix-crawler 0.10.2
  • Default and accepted values for --waitUntil from crawler's update
  • Using warc2zim 1.5.2
  • Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
  • --failOnFailedSeed used inconditionally
  • --lang now passed to crawler (ISO-639-1)

Removed

  • --newContext from crawler's update

[1.3.1] - 2023-02-06

Changed

  • Using browsertrix-crawler 0.8.0
  • Using warc2zim version 1.5.1 with wabac.js 2.15.2

[1.3.0] - 2023-02-02

Added

  • Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)

Changed

  • Using warc2zim version 1.5.0 with scope conflict fix and videos fix
  • Using browsertrix-crawler 0.8.0-beta.1
  • Fixed --allowHashUrls being a boolean param
  • Increased check_url timeout (12s to connect, 27s to read) instead of 10s

[1.2.0] - 2022-06-21

Added

  • --urlFile browsertrix crawler parameter
  • --depth browsertrix crawler parameter
  • --extraHops, parameter
  • --collection browsertrix crawler parameter
  • --allowHashUrls browsertrix crawler parameter
  • --userAgentSuffix browsertrix crawler parameter
  • --behaviors, parameter
  • --behaviorTimeout browsertrix crawler parameter
  • --profile browsertrix crawler parameter
  • --sizeLimit browsertrix crawler parameter
  • --timeLimit browsertrix crawler parameter
  • --healthCheckPort, parameter
  • --overwrite parameter

Changed

  • using browsertrix-crawler 0.6.0 and warc2zim 1.4.2
  • default WARC location after crawl changed from collections/capture-*/archive/ to collections/crawl-*/archive/

Removed

  • --scroll browsertrix crawler parameter (see --behaviors)
  • --scope browsertrix crawler parameter (see --scopeType, --include and --exclude)

[1.1.5]

  • using crawler 0.3.2 and warc2zim 1.3.6

[1.1.4]

  • Defaults to load,networkidle0 for waitUntil param (same as crawler)
  • Allows setting combinations of values for waitUntil param
  • Updated warc2zim to 1.3.5
  • Updated browsertrix-crawler to 0.3.1
  • Warc to zim now written to {temp_root_dir}/collections/capture-*/archive/ where capture-* is dynamic and includes the datetime. (from browsertrix-crawler)

[1.1.3]

  • allows same first-level-domain redirects
  • fixed redirects to URL in scope
  • updated crawler to 0.2.0
  • statsFilename now informs whether limit was hit or not

[1.1.2]

  • added support for --custom-css
  • added domains block list (dfault)

[1.1.1]

  • updated browsertrix-crawler to 0.1.4
    • autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets

[1.0]

  • initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3