Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add XMLparser recover option for client; keep default behavior same: recover=False #39

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sdm7g
Copy link

@sdm7g sdm7g commented Apr 12, 2019

Adds an option in Client and BaseClient to use recover=True option on etree.XMLParser, so that OAI harvesting won't fail on a bad metadata payload.
Default is recover=False, so that it doesn't change any existing behavior ( Making the conservative choice here, just in case anyone is relying on catching failures to validate feeds. )

Initially, I tried parsing first with recover=False, catching errors and then retrying with recover=True, but I discovered that was unnecessary.

  1. I though parsing with recover=True might be slower, but when there are no parser errors to recover from, I could not measure any time penalty for using that option.
  2. I wanted to complete the harvest, but I also wanted to catch and log errors so that I could notify upstream maintainers of the feed that there were issues to be fixed. I discovered that self.XMLParser.error_log will contain any errors if they occur, and that lxml has a facility to pass lxml errors to the Python logging error log module by calling tree.use_global_python_log(). Logging should be configured first before using that facility, so to enable, use something like this in the calling program:
from lxml import etree
import logging
etree.use_global_python_log(etree.PyErrorLog())
logging.basicConfig()

That will output a line like this, for example, on an unescaped ampersand:
CRITICAL:root:<string>:47:219:FATAL:PARSER:ERR_ENTITYREF_SEMICOL_MISSING: EntityRef: expecting ';'

lxml documents how that message can be changed or filtered, if for example, you want to change those 'CRITICAL' to 'WARNING' , but I didn't think that was worth adding to the base code if it's not already using logging module.

@bloomonkey
Copy link

👍
I'm the author and maintainer of oai-harvest and would love to see this PR merged, to avoid having to maintain a fork of the Client class in that project 🙏

Copy link
Collaborator

@wetneb wetneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Merging soon unless there are any concerns.

@interrogator
Copy link

interrogator commented Apr 24, 2020

Looks good to me! Mergeable in my opinion @jascoul, would love to bring down the use of forks with patches for ppl like @bloomonkey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants