-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support entities (anyway, please!) #44
Comments
i think supporting inserting characters by codepoint is a good thing—especially with invisible or confusable characters it can be useful. i think HTML entity names are not so good; many of them are essentially legacy and the coverage is not necessarily complete or well‐thought‐out. i don’t like the XML/HTML entity reference syntax because it makes the decimal form of codepoints why not extend the emoji syntax to allow arbitrary characters by unicode codepoint, like |
As there are escapes already, why not add unicode escapes as supported in many programming languages? Along the lines of |
If so Lua 5.3 style with braces |
This said I think |
I do not care about the syntax here but would like to point out entities are essential for comfort writing of mixed-language texts - e.g. when mixing right-to-left and left-to-right languages as is common e.g. in United Arab Emirates, Qatar, etc. So any solution you come up in here has to be well readable (and comfortable to write) for characters changing the direction etc. |
Can you explain a bit more why entities help with this? (E.g. give an example?) |
Is the purpose for supporting entities to let you put in unicode characters when you're unable to insert the actual unicode character into your source? (That is, you know the character you want but cannot copy/paste it into your content file? Is it common to know the codepoint but not be able to copy/paste the character in?)
I didn't realize that the list of djot-supported emojis was so large. Seems like adding 10 or 20 commonly-used readable unicode char names like |
@uvtc the bigger concern is invisible characters, for example variation selectors, right‐to‐left and left‐to‐right marks, ligation marks (zero‐width joiner and zero‐width non‐joiner), characters which allow breaks (zero‐width space) and prevent them (word joiner), “shy” hyphens, etc…… in some text editors it may be possible to inspect whether these characters are present (CotEditor for example is very good), but in others it may not, and regardless simply having those characters written out in the text is often much easier to handle. as an example, the codepoint similar arguments extend to things like wanting to type as for having to remember the unicode codepoints as opposed to the names, i think many people probably would prefer writing |
@jgm sorry for the delay - yes, the intent is mostly what @marrus-sh wrote above. Namely to make visible all those characters (incl. future ones) which change or influence overall "style", "form", "layout", "paragraphing", etc. |
See my idea in #112 of generalizing the syntax currently used for emojis.
If you use emojis, you can use this syntax for them with a filter that inserts the emoji character proper to the alias. But you could just as easily use a different filter to associate whatever unicode string you like. |
I have written a simple Pandoc filter which replaces codepoint escapes like Gotcha: a literal colon (:) next to a digit must be escaped as local char = utf8.char
local pat = '(%:(%w+)%:)'
local function subst (match, id)
--'If we can numify it it is probably a codepoint!'
local cp = tonumber(id)
if cp then
--'If the codepoint is out of range char throws a scarcely helpful error!'
local ok, res = pcall(char, cp)
if ok then
return res
end
error("Failed to convert " .. tostring(match) .. " to a character:\n\t" .. tostring(res))
end
return match
end
function Str (str)
return pandoc.Str(str.text:gsub(pat, subst))
end It could easily be ported to a djot filter using my pure-Lua char function from #44 (comment) |
I know it's said that HTML-style entities are not supported because djot is not to favor any target format, but I wonder if it wouldn't be a good idea to have a mechanism for including characters which are hard to type, and entities is a well-known syntax for that, which I would say is good enough.1 I can share a Lua table mapping HTML 5 entity names to UTF-8 characters, but supporting only numeric entities would be a reasonable limitation, since djot would only borrow the syntax. Those can be handled very effectively in Lua, e.g.
where
char
can be eitherutf8.char
or this:Footnotes
I would prefer a paired delimiter. My string interpolation DSL uses
@(...)
where the parentheses may contain one or more of (1) a decimal code point like331
, (2) a hex codepoint like0x14b
, (3) an entity name likeeng
, or a Unicode name in angle brackets like<Latin small letter eng>
(in the Perl implementation). ↩The text was updated successfully, but these errors were encountered: