Skip to content
This repository has been archived by the owner on Dec 23, 2019. It is now read-only.

Entities in Document header #58

Closed
opoudjis opened this issue Oct 25, 2017 · 10 comments
Closed

Entities in Document header #58

opoudjis opened this issue Oct 25, 2017 · 10 comments
Assignees

Comments

@opoudjis
Copy link
Contributor

Child of #56

Any use of entities in document header attributes is being mangled:

:area: Operations & Management Area

is being converted through

      def area(node, xml)
        node.attr("area")&.split(/, ?/)&.each do |ar|
          xml.area ar
        end
      end

into

<area>Operations &amp;amp; Management Area</area>
@opoudjis opoudjis self-assigned this Oct 25, 2017
@opoudjis
Copy link
Contributor Author

OK, substitutions in Asciidoc headers are different:

  • Special characters
  • Attributes
    But not
  • Quotes
  • Replacements
  • Macros
  • Post Replacement

That means that & converts to &amp; (special characters), but otherwise, HTML and XML entities are NOT recognised (replacements). So it is a characteristic of Asciidoc that & and &nsbp; cannot appear in the header.

The misrendering as &amp;amp is fixed by replacing xml.area text with xml.area { |a| << text }

@ronaldtse
Copy link
Contributor

Oooh AsciiDoc characteristics. I wonder if a better defined format helps 😉

@opoudjis
Copy link
Contributor Author

opoudjis commented Oct 26, 2017

smile One product at a time, Ronald!

Yeah, I understand the header was one of your major concerns; and the Asciidoc substitutions are idiosyncratic. (My solution to entities in attributes, btw, was to expand out all entities using HTMLEntities, and let Nokogiri reencode them on output.)

I would say in retrospect btw that, given how nasty Nokogiri XML is about entities, it was more trouble than it was worth to encode. the XML using Nokogiri (as opposed to validating it after the event).

One of the major pushes behind RFC2XML, I'm seeing from the RFC Format FAQ, is to permit non-ASCII characters in RFC. Dealing with HTML entities has resulted in me dealing with those too; the XML is now not in UTF-8 but ASCII, because who knows what you're going to find downstream; but non-ASCII is being encoded in entities, and we are now addressing that non-ASCII requirement safely.

Decimal not Hex entities, because that's what Nokogiri does out of the box. I am less of a Nokogiri fan now than I was six months ago...

@ronaldtse
Copy link
Contributor

ronaldtse commented Oct 26, 2017

I went through the code we have now and it's quite confusing how we switch back and forth between just "nokogiri" and "nokogiri-generated text to be inserted back to nokogiri".

Don't you think everything will be cleaner if we just stick to the plain "nokogiri"? 😉 That will help us take care the UTF-8 issues too.

@ronaldtse
Copy link
Contributor

On the other hand doesn't the entity issue stem from RFC XML's usage of it? XML isn't supposed to work with HTML Character Entities.

@opoudjis
Copy link
Contributor Author

XML isn't supposed to work with HTML Character Entities.

On the other hand doesn't the entity issue stem from RFC XML's usage of it? And yet, the v1 RFC XML documents had &nbsp; all over them. And people will use HTML entities whether we want them to or not; now, at least, we can deal with them.

Paolo was migrating the code from text templating to nokogiri; the migration is probably not complete, and I can look at it. Again, I now think migrating to nokogiri was in fact a mistake, because of the hassles around entities.

I'm going to give priority still to the issues you found in #59.

@ronaldtse
Copy link
Contributor

Yes the Character Entity problem is a RFC XML problem. They should not have allowed HTML Entities inside XML. But in any case, we can still deal with them using Nokogiri.

I still think using Nokogiri was the correct way to go, since we're just writing Entities, not reading Entities. We just need to make sure when we write we generate Entities the RFC XML way and will only involve handling text nodes -- but we might not even need to do this?

In fact, I don't think XML2RFC relies on Character Entities -- in the #59 document I have gotten rid of all character entities, and the characters generated are identical to the original ones.

@opoudjis
Copy link
Contributor Author

Oh, the output will be the same. My concern is that, if we are making the tool widely available, we cannot guarantee that people won't use &nbsp;, and I'd rather we not constrain it if we don't have to. In fact, the RFC XML spec doesn't say anything about HTML entities, and certainly doesn't rely on them; but if only because the v1 templates did use them, better safe than sorry.

The noko() routine is consistently treating the document fragments it builds as XHTML not XML. That is what takes care of reading entities. The outputting entities is taken care of by the XML encoding as ASCII; we can leave it as UTF-8, but even in 2017, I don't think it's safe to.

@ronaldtse
Copy link
Contributor

@opoudjis but people most likely won't use HTML Character Entities in the AsciiDoc format as input, right?

I don't think we should use the noko() routine but directly pass around the XML document model around to add nodes/attributes. The noko() routine is treating fragments as XHTML because that's what it was specified in our code.

We should also use UTF-8 for v3 output but only "US-ASCII" for v2 output. Only at the end we should call to_xml, once.

@opoudjis
Copy link
Contributor Author

Well, up to you. I made the noko() routine XHTML to deal with   in the samples; it was XML before. I can pass the xml document model around, but the XTHTML/XML choice of dealing with HTML samples would still need to be made. So what you want is XML no XHTML; do not accept any HTML entities; and pass xml document model instead of using an external builder. Right?

This was @paolobrasolin 's framework, so I'd like to hear from him too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants