-
Notifications
You must be signed in to change notification settings - Fork 7
Entities in Document header #58
Comments
OK, substitutions in Asciidoc headers are different:
That means that The misrendering as |
Oooh AsciiDoc characteristics. I wonder if a better defined format helps 😉 |
smile One product at a time, Ronald! Yeah, I understand the header was one of your major concerns; and the Asciidoc substitutions are idiosyncratic. (My solution to entities in attributes, btw, was to expand out all entities using HTMLEntities, and let Nokogiri reencode them on output.) I would say in retrospect btw that, given how nasty Nokogiri XML is about entities, it was more trouble than it was worth to encode. the XML using Nokogiri (as opposed to validating it after the event). One of the major pushes behind RFC2XML, I'm seeing from the RFC Format FAQ, is to permit non-ASCII characters in RFC. Dealing with HTML entities has resulted in me dealing with those too; the XML is now not in UTF-8 but ASCII, because who knows what you're going to find downstream; but non-ASCII is being encoded in entities, and we are now addressing that non-ASCII requirement safely. Decimal not Hex entities, because that's what Nokogiri does out of the box. I am less of a Nokogiri fan now than I was six months ago... |
I went through the code we have now and it's quite confusing how we switch back and forth between just "nokogiri" and "nokogiri-generated text to be inserted back to nokogiri". Don't you think everything will be cleaner if we just stick to the plain "nokogiri"? 😉 That will help us take care the UTF-8 issues too. |
On the other hand doesn't the entity issue stem from RFC XML's usage of it? XML isn't supposed to work with HTML Character Entities. |
On the other hand doesn't the entity issue stem from RFC XML's usage of it? And yet, the v1 RFC XML documents had Paolo was migrating the code from text templating to nokogiri; the migration is probably not complete, and I can look at it. Again, I now think migrating to nokogiri was in fact a mistake, because of the hassles around entities. I'm going to give priority still to the issues you found in #59. |
Yes the Character Entity problem is a RFC XML problem. They should not have allowed HTML Entities inside XML. But in any case, we can still deal with them using Nokogiri. I still think using Nokogiri was the correct way to go, since we're just writing Entities, not reading Entities. We just need to make sure when we write we generate Entities the RFC XML way and will only involve handling text nodes -- but we might not even need to do this? In fact, I don't think XML2RFC relies on Character Entities -- in the #59 document I have gotten rid of all character entities, and the characters generated are identical to the original ones. |
Oh, the output will be the same. My concern is that, if we are making the tool widely available, we cannot guarantee that people won't use The noko() routine is consistently treating the document fragments it builds as XHTML not XML. That is what takes care of reading entities. The outputting entities is taken care of by the XML encoding as ASCII; we can leave it as UTF-8, but even in 2017, I don't think it's safe to. |
@opoudjis but people most likely won't use HTML Character Entities in the AsciiDoc format as input, right? I don't think we should use the We should also use UTF-8 for v3 output but only "US-ASCII" for v2 output. Only at the end we should call |
Well, up to you. I made the noko() routine XHTML to deal with in the samples; it was XML before. I can pass the xml document model around, but the XTHTML/XML choice of dealing with HTML samples would still need to be made. So what you want is XML no XHTML; do not accept any HTML entities; and pass xml document model instead of using an external builder. Right? This was @paolobrasolin 's framework, so I'd like to hear from him too. |
Child of #56
Any use of entities in document header attributes is being mangled:
is being converted through
into
The text was updated successfully, but these errors were encountered: