Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bdb frontier access #289

Merged

Conversation

csrster
Copy link
Contributor

@csrster csrster commented Dec 10, 2019

First I should explain that this pull request builds on the existing pull request #281 .

The request adds four new methods to the Frontier - both interface and implementation.

The methods are all accessors of various kinds and none of them are called from within Heritrix. Their purpose is to expose Frontier internals to querying and manipulation by external 3rd Party tools via the scripting console - so there should be very little risk that the changes will in themselves cause any instability.

The only non-trivial method of the four is exportPendingUris(). The idea of this method is to write all pending uris to a file-cache where a suitable tool can browse and search in them to determine which, if any, one might want to delete manually (after pausing the job).

These methods are taken directly from the NetarchiveSuite fork of Heritrix which means that the code has been tested at scale (with je 5.0.104) at both Netarkivet in Denmark and BnF.

@csrster
Copy link
Contributor Author

csrster commented Dec 11, 2019

The reason why we need the bdb/je upgrade is the bug described under point 2 here: https://download.oracle.com/otndocs/products/berkeleydb/html/je/je-5.0.104_changelog.html

I believe that what happens in older versions of je is that the cursor used in writing the frontier data to disk is not transactionally isolated from changes in the frontier as a result of ongoing harvesting, and this causes errors errors in bdb that can't be handled by the Java code.

@anjackson
Copy link
Collaborator

This LGTM and these localised changes shouldn't impact elsewhere.

@anjackson anjackson merged commit e0c82f7 into internetarchive:master Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants