Entities in Document header #58

opoudjis · 2017-10-25T08:52:20Z

Child of #56

Any use of entities in document header attributes is being mangled:

:area: Operations & Management Area

is being converted through

      def area(node, xml)
        node.attr("area")&.split(/, ?/)&.each do |ar|
          xml.area ar
        end
      end

into

<area>Operations &amp;amp; Management Area</area>

The text was updated successfully, but these errors were encountered:

opoudjis · 2017-10-25T09:08:25Z

OK, substitutions in Asciidoc headers are different:

Special characters
Attributes
But not
Quotes
Replacements
Macros
Post Replacement

That means that & converts to & (special characters), but otherwise, HTML and XML entities are NOT recognised (replacements). So it is a characteristic of Asciidoc that & and &nsbp; cannot appear in the header.

The misrendering as &amp is fixed by replacing xml.area text with xml.area { |a| << text }

ronaldtse · 2017-10-26T02:25:30Z

Oooh AsciiDoc characteristics. I wonder if a better defined format helps 😉

opoudjis · 2017-10-26T05:00:16Z

smile One product at a time, Ronald!

Yeah, I understand the header was one of your major concerns; and the Asciidoc substitutions are idiosyncratic. (My solution to entities in attributes, btw, was to expand out all entities using HTMLEntities, and let Nokogiri reencode them on output.)

I would say in retrospect btw that, given how nasty Nokogiri XML is about entities, it was more trouble than it was worth to encode. the XML using Nokogiri (as opposed to validating it after the event).

One of the major pushes behind RFC2XML, I'm seeing from the RFC Format FAQ, is to permit non-ASCII characters in RFC. Dealing with HTML entities has resulted in me dealing with those too; the XML is now not in UTF-8 but ASCII, because who knows what you're going to find downstream; but non-ASCII is being encoded in entities, and we are now addressing that non-ASCII requirement safely.

Decimal not Hex entities, because that's what Nokogiri does out of the box. I am less of a Nokogiri fan now than I was six months ago...

ronaldtse · 2017-10-26T07:22:29Z

I went through the code we have now and it's quite confusing how we switch back and forth between just "nokogiri" and "nokogiri-generated text to be inserted back to nokogiri".

Don't you think everything will be cleaner if we just stick to the plain "nokogiri"? 😉 That will help us take care the UTF-8 issues too.

ronaldtse · 2017-10-26T07:28:33Z

On the other hand doesn't the entity issue stem from RFC XML's usage of it? XML isn't supposed to work with HTML Character Entities.

opoudjis · 2017-10-27T02:25:04Z

XML isn't supposed to work with HTML Character Entities.

On the other hand doesn't the entity issue stem from RFC XML's usage of it? And yet, the v1 RFC XML documents had   all over them. And people will use HTML entities whether we want them to or not; now, at least, we can deal with them.

Paolo was migrating the code from text templating to nokogiri; the migration is probably not complete, and I can look at it. Again, I now think migrating to nokogiri was in fact a mistake, because of the hassles around entities.

I'm going to give priority still to the issues you found in #59.

ronaldtse · 2017-10-27T02:39:33Z

Yes the Character Entity problem is a RFC XML problem. They should not have allowed HTML Entities inside XML. But in any case, we can still deal with them using Nokogiri.

I still think using Nokogiri was the correct way to go, since we're just writing Entities, not reading Entities. We just need to make sure when we write we generate Entities the RFC XML way and will only involve handling text nodes -- but we might not even need to do this?

In fact, I don't think XML2RFC relies on Character Entities -- in the #59 document I have gotten rid of all character entities, and the characters generated are identical to the original ones.

opoudjis · 2017-10-27T08:57:02Z

Oh, the output will be the same. My concern is that, if we are making the tool widely available, we cannot guarantee that people won't use  , and I'd rather we not constrain it if we don't have to. In fact, the RFC XML spec doesn't say anything about HTML entities, and certainly doesn't rely on them; but if only because the v1 templates did use them, better safe than sorry.

The noko() routine is consistently treating the document fragments it builds as XHTML not XML. That is what takes care of reading entities. The outputting entities is taken care of by the XML encoding as ASCII; we can leave it as UTF-8, but even in 2017, I don't think it's safe to.

ronaldtse · 2017-10-27T09:08:50Z

@opoudjis but people most likely won't use HTML Character Entities in the AsciiDoc format as input, right?

I don't think we should use the noko() routine but directly pass around the XML document model around to add nodes/attributes. The noko() routine is treating fragments as XHTML because that's what it was specified in our code.

We should also use UTF-8 for v3 output but only "US-ASCII" for v2 output. Only at the end we should call to_xml, once.

opoudjis · 2017-10-27T13:36:12Z

Well, up to you. I made the noko() routine XHTML to deal with in the samples; it was XML before. I can pass the xml document model around, but the XTHTML/XML choice of dealing with HTML samples would still need to be made. So what you want is XML no XHTML; do not accept any HTML entities; and pass xml document model instead of using an external builder. Right?

This was @paolobrasolin 's framework, so I'd like to hear from him too.

opoudjis self-assigned this Oct 25, 2017

opoudjis closed this as completed Oct 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entities in Document header #58

Entities in Document header #58

opoudjis commented Oct 25, 2017

opoudjis commented Oct 25, 2017

ronaldtse commented Oct 26, 2017

opoudjis commented Oct 26, 2017 •

edited

Loading

ronaldtse commented Oct 26, 2017 •

edited

Loading

ronaldtse commented Oct 26, 2017

opoudjis commented Oct 27, 2017

ronaldtse commented Oct 27, 2017

opoudjis commented Oct 27, 2017

ronaldtse commented Oct 27, 2017

opoudjis commented Oct 27, 2017

Entities in Document header #58

Entities in Document header #58

Comments

opoudjis commented Oct 25, 2017

opoudjis commented Oct 25, 2017

ronaldtse commented Oct 26, 2017

opoudjis commented Oct 26, 2017 • edited Loading

ronaldtse commented Oct 26, 2017 • edited Loading

ronaldtse commented Oct 26, 2017

opoudjis commented Oct 27, 2017

ronaldtse commented Oct 27, 2017

opoudjis commented Oct 27, 2017

ronaldtse commented Oct 27, 2017

opoudjis commented Oct 27, 2017

opoudjis commented Oct 26, 2017 •

edited

Loading

ronaldtse commented Oct 26, 2017 •

edited

Loading