-
-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entities #442
Comments
Some experiments along these lines in the entities branch of jgm/cmark. Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes). |
Looks good so far!
Hmm. Isn't a "link title" in CommonMark just a fancy way to write the attribute value literal that ends up in the, well, And given that CommonMark does not even look at, let alone does any conversion or replacement for attribute value literals in "HTML tags" anyway (and nor does Markdown in general, IIRC): leaving the "link title" string alone (maybe apart from checking for literal Following are some thoughts of mine on this matter. LexisRegarding syntactically recognizing entity and character references, the spec should spell out that references of the "usual" form are recognized. The following is basically copied from the XML syntax, except for the Note that XML requires the terminating
The character class Digit simply comprises the ten decimal digits, while the name start character and name character classes differ among versions of HTML, XML, etc. Using the XML definition restricted to ISO 646 (which is what CommonMark currently, implicitly, but incompletely does—eg, it disallows
Here Letter would just be the basic 52 upper and lower case letters of the ISO 646 repertoire. In my opinion, a good argument could be made for allowing to omit the terminating
is equivalent to
Or because it allows "joining lines" (exploiting the "lazy continuation line" rule, of course):
is equivalent to
If one defines an entity
is equivalent (after replacement, using
This is what ISO 8879 SGML has always supported (even in "Minimal SGML Documents"), and I tend to find it useful. But it might be too much for authors accustomed to HTML/XML rules … The insane decision in the HTML5 "syntax" to allow omitting ProcessingI agree that the spec should not require (but indeed allow) replacing entity references with (which? whatever?) replacement texts. And, as I have argued, it seems wise to also forbid replacing numeric character references (at least for the ISO 646 repertoire), to preserve the distinction between eg, As far as the spec talks about the parsing result in terms of an AST (or—equivalently?—its representation as a CommonMark-DTD-valid XML document instance), some "entity reference" node type would suffice for unreplaced entity references, similar to your However, it might be useful to include an optional character number just in case that "resolution" of character entity references (in the parser) is desired. The pre-defined XML entities
I find placing the entity name in a If the parser would (be allowed to) replace entity references with something other than a Unicode character—that is, really handle general entities, not just character entities—, then the replacement text would directly be inserted (without delimiters or its own node) into the regular character data content, that is: into the And similarly for character references (lumping numeric and hex together, for this distinction is IMO negligible):
One could possibly unite the |
Let me to remind there are more such contexts:
|
+++ Martin Mitáš [Dec 04 16 13:52 ]:
Let me to remind there are more such contexts:
* Link title (included for the sake of completeness here)
* Link destination (see [1]Example 308)
* Image ALT string (usually rendered differently from links; also
note the difference in handling of nested versus non-nested image)
* Info string in code fence line (see [2]Example 309)
Actually not the Image ALT string (or as we call it the link
description), since this is represented in cmark as a list
of inlines, and we can just use ENTITY nodes there.
The problem really only arises for the other three contexts,
where we just have a raw string.
|
We no longer use the HTML5 entity list. Instead, we recognize any potential character entity of length 1-32 letters. Entities are carried through unchanged to HTML rather than being converted to UTF-8. Entities in URLs are also left unchanged rather than being URL-encoded. See https://talk.commonmark.org/t/spec-issues-character-entity-references/2306 and #442
As noted in this thread, it might be desirable to change what the spec says about entities.
Arguably the spec should not require that entities be replaced (in the parsing phase) by unicode characters. A replacement will be necessary for some output formats, but there is no reason why an implementation that only targets HTML should do the replacement at all, and even an implementation that targets multiple formats might choose to handle entities in the renderer, or in an intermediate AST filter. And some implementations might want to preserve entities in the output.
Currently the spec requires replacement for entities in a certain list. It would also simplify things not to have such a list.
The text was updated successfully, but these errors were encountered: