Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need way to handle XML with entity references #5

Closed
sarneaud opened this Issue Apr 18, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@sarneaud
Copy link

sarneaud commented Apr 18, 2018

Hi, I have an XML document that happens to contain references to entities in its DTD. For my use case, I don't care about interpreting them, but the references are still there. I get the tradeoff dxml makes in not supporting the DTD, but currently I can't use dxml to process this document at all because an XMLParsingException gets thrown.

It would be useful to have a way to work around this case.

How about supporting a hook like immutable(ElementType!R)[] translateEntityRef(ref R reference) (i.e., returns a string for a char range, wstring for wchar range, etc.)? The hook either returns the value of the reference, or null if the reference isn't supported. The idea is that the same function implementation could be used in the config for parseXML and as a hook for normalize (of course, separate implementations could be used if needed for performance reasons) and existing functions like parseStdEntityRef could be adapted to fit the same interface.

I don't mind submitting a PR, but I'd like to get your feedback first.

@sarneaud

This comment has been minimized.

Copy link
Author

sarneaud commented Apr 18, 2018

Thinking a little more, a better return type would be Nullable!R2 where R2 is a range with the same element type as R.

@jmdavis

This comment has been minimized.

Copy link
Owner

jmdavis commented Apr 18, 2018

I didn't add support for it, because it wasn't clear to me from reading the XML spec that it was even possible to guarantee that skipping an entity when parsing it would result in a valid XML document (e.g. if it inserted a start tag but not an end tag). After some discussions about it in D.Announce, I think that it's guaranteed that any such entity has to be complete enough that skipping it won't screw up the rest of the document. And as such, what I'm probably going to do is add an option to Config where you can tell it to treat entities as normal text. That way, by default, it would still throw, but anyone who wanted to let unparsed entities be ignored would be able to do so. However, I will probably have it still throw in the case where the entity is clearly invalid (not as in undeclared but as in contains characters that clearly make it so that it could never be a valid entity).

It's my intention to tackle this after I've finished the writer support, since that's almost done.

@jmdavis jmdavis added the enhancement label Apr 18, 2018

@jmdavis

This comment has been minimized.

Copy link
Owner

jmdavis commented Apr 18, 2018

Either way, I don't see much point in adding support for trying to actually process entity references. If I made it possible to skip the entities, then in principle, a parser could parse the DTD, then use dxml to parse the rest of the document, and then process the entities in the document itself, but if you're going that far, you probably might as well just write the full parser rather than using dxml. Given that dxml doesn't parse the DTD, I think that the only options that make sense are to either throw when it encounters an entity reference (like it does now) or to just skip them and let the program using dxml either ignore them or try do something on its own to handle them if it really wants to. And I'm fine with making the second possible so long as it's not going to result in treating invalid XML documents as valid due to the fact that the entities weren't replaced with whatever they were supposed to be replaced with.

jmdavis added a commit that referenced this issue Apr 19, 2018

Added Config.throwOnEntityRef.
#5

If throwOnEntityRef == ThrowOnEntityRef.yes, then when EntityRange
encounters non-standard entity references, it throws as before. However,
when throwOnEntityRef == ThrowOnEntityRef.no, then it only throws when
the entity reference is syntactically invalid. Otherwise, it's treated
as normal text just like the five predefined entity references are.

@jmdavis jmdavis added the 0.3 label Apr 19, 2018

@jmdavis jmdavis closed this Apr 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.