-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 headings not detected (patch included) #7
Comments
This takes care of most trivial cases, but there's still a couple cases where this would fail. For example: $ cat test.md
👨👩👦 Family
=========
$ ./smu test.md
<p>👨👩👦 Family
=========</p> I think it's best to do what Hiltjo suggested in here and simply treat P.S: Also for comparison, both lowdown and md4c are able to deal with the above case: $ md2html test.md
<h1>👨👩👦 Family</h1>
$ lowdown test.md
<h1 id="family">👨👩👦 Family</h1> |
That fails on my example above...
I'm not sure. All writing systems have a single codepoint for their letters in UNICODE. I don't consider multi-multi-byte emoji's not working as a failure, when all of the other writing system otherwise work just fine. I think the best would be entirely forget comparing with IMHO, |
Not well versed in Unicode to know weather that's correct or not. But some languages can have multiple codepoints which fuse together into a single user visible character. An example in Bengali: মৌ মৌ
====
Ah yes, my bad. Doing মৌ
= Probably not worth it to try and deal with such cases. But just wanted to point it out. |
This avoids problems with counting unicode glyphs and negative side-effects seem very unlikely. Closes #7
This avoids problems with counting unicode glyphs and negative side-effects seem very unlikely. Closes #7
I like this approach. If users have titles with less than three glyphs, they can just put three characters in the underline to work around it. The PR is #9. |
Hi,
This code does not work if there are UTF-8 characters in the headings. The reason for this is, it expects at least as many
=
or-
characters in the next line as the heading's length in bytes. Now this is only true for English headings, encoded as ASCII. With UTF-8, you can't use the number of bytes, you have to use the number of characters instead.For example:
This patch fixes this, while keeping backward compatibility.
This code uses
l
for the length (number of bytes, just as the original), but it also counts number of multi-byte characters ink
, and then comparesj
with that latter.Cheers,
bzt
The text was updated successfully, but these errors were encountered: