-
Notifications
You must be signed in to change notification settings - Fork 762
Release Notes Heritrix 3.4.0 20190207
Andy Jackson edited this page Feb 19, 2019
·
1 revision
For an overview of this release, see A New Release of Heritrix 3.
The following summarises the changes since late 2017.
- ToeThread death when using HighestUriPrecedenceProvider
- Add checks to guard against server sending 304 in error
- Ensure scope logs have different names per job
- HBase fixes: #222, #243, #224
- Do not checkpoint if crawl job has not started
- Resolved significant BDBFrontier thread safety issues
- HTML extractor does not handle the base href correctly when it's relative
- Catch exceptions scoping outlinks to stop them from derailing process
- Fix for test failures in a workspace on NFS-mounted filesystem
- Enforce robots.txt character limit per char not per line
- Treat a failed fetch (e.g. socket timeout) of robots.txt the same way as http errors
- Allow JavaDNS to be disabled as part of resolving outstanding build and test issues
- WARCLimitEnforcer.java - Add support for multiple warc writers.
- By default only execute decide rules if they might change the outcome NOTE that this is potentially a breaking change IF decide rules have side effects you rely on!
- Can now deploy to Maven Central
- Add a simple way to launch the
latest
checkpoint - Add parameter to allow even distribution for parallel queues.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse