-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(markdown): support CJK and emoji #3026
Conversation
66c5efa
to
3f8b4df
Compare
src/util.js
Outdated
"$1$2" | ||
) | ||
) | ||
.split(/(\s+)/g) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/\s/.test(" "); // true
/\s
match
(U+3000/Full-width ideographic space)
Because, \s
is [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
in JavaScript.
Current implementaion always replace
to
.
Both are used for different usage in Japanese.
Actually, the result of rendering between " " and " " is different.
Example:
A B(Three full-width space)
A B(Three space)
For more details, see Unicode spaces.
Conclusion and My Proposal:
- Treat full-width space character as a space character is OK
- Should not replace full-width space(
FYI: Treat full-width space(
) as whitespace in JavaScript Source code is OK.
Because ECMAScript treat “Zs” category that include U+3000 as whitespace.
Related:
Many implementations use full-width ideographic space (cl-14) for the one em space appended after dividing punctuation marks (cl-04).
-- https://www.w3.org/TR/2012/NOTE-jlreq-20120403/#en-subheading2_1_6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just treat full-width whitespace as CJK punctuation.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| abc | def | ghi | | ||
| ------ | ------ | ------ | | ||
| 👍👍👍 | 👍👍👍 | 👍👍👍 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's based on the font, if you use monospace-font that supports CJK character then you'll see:
Can we just turn splitting on and remove the option? If you don't want your text to wrap you can use |
I'm fine with it but most of the markdown converters do not support CJK splitting, e.g. GitHub. |
* disallow leading/trailing full-width whitespace
src/util.js
Outdated
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp | ||
// `\s` but exclude full-width whitspace (`\u3000`) | ||
.split( | ||
/([ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\ufeff]+)/g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex can be written as /([^\S\u3000]+)/
. Not sure if that's better. (It can even be written as /((?:(?!\u3000)\s)+)/
, but that's definitely not better!)
(Also note that the g
flag is not needed for .split()
.)
@azu any further comments? otherwise I'd like to merge this tomorrow. |
@ikatyang Look good to me. |
I'm not completely up-to-speed with the nuances of CJK chars but if the tests are passing I'm happy! |
Oh, I thought you said @azz 😆 |
azz vs azu 😆 |
Fixes #3013
Rules:
line
(" "
or"\n"
) betweennon-cjk
andcjk-character
softline
(""
or"\n"
) betweencjk
andcjk
non-cjk
andcjk-punctuation
, i.e. they're considered not breakableQuestion:
Isremoved the option and enabled it by default.--split-cjk-text
an suitable option name?cc @azu as you might be interested in this.