Skip to content

Release Notes Heritrix 3.4.0 20220727

Alex Osborne edited this page Jul 30, 2022 · 15 revisions

Summary of changes since Release Notes - Heritrix 3.4.0-20210923 - see the full changelog for more details.

Additions

  • JDK 18 compatibility
  • DNS over HTTPS support
  • SOCKS5 proxy support
  • robotsTxtOnly robots policy which obeys robots.txt rules but ignores HTML robots meta tags
  • CandidatesProcessor gained a seedsRedirectNewSeedsAllowTLDs option to disallow bare top level domains being added as additional seeds when an initial seed redirects to them #461
  • Configurable URL matching and extraction for sitemaps #441

Changes

  • Dependencies updated:
    • Spring Framework 5.3.20
    • JSch 0.1.52
    • gson 2.8.9

Removals

  • Dependencies removed:
    • JNA (symlink creation is now done using the standard Java API)

Bugfixes

  • Fixed HTML srcset attribute only matching in lowercase #477
  • Fixed sitemap links (M) being considered transclusions when limiting hop depth #469
  • Fixed "java.lang.NoClassDefFoundError: Could not initialize class org.archive.util.CLibrary" on Apple Silicon #467
  • Fixed Heritrix crashing on unexpected characters in the Content-Length header #449
  • Fixed StringIndexOutOfBoundsException on exact major Java versions like the first JDK 18 release #439
  • Fixed dnsjava NIO selector thread consuming 100% CPU after terminating job #425
  • Fixed <link href=...> tags being treated as embed (E) links for rel values where they shouldn't be #263
  • Fixed "RIS already open for ToeThread..." exception when crawling https pages via a proxy #191
  • Fixed setting maxLogFileSize in BDBModule #464
  • Fixed group id is too big error when building on some systems #448
  • Fixed CrawlURI.hashCode() NullPointException sometimes breaking Browse Beans #488
  • Fixed ownership conflict over /tmp/Crashpad when running the Chrome Extractor under different users

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally