Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 headings not detected (patch included) #7

Closed
bztsrc opened this issue Jul 11, 2022 · 4 comments
Closed

UTF-8 headings not detected (patch included) #7

bztsrc opened this issue Jul 11, 2022 · 4 comments

Comments

@bztsrc
Copy link

bztsrc commented Jul 11, 2022

Hi,

This code does not work if there are UTF-8 characters in the headings. The reason for this is, it expects at least as many = or - characters in the next line as the heading's length in bytes. Now this is only true for English headings, encoded as ASCII. With UTF-8, you can't use the number of bytes, you have to use the number of characters instead.

For example:

abc          (l=3)
===          (j=3, so correctly detected as a heading)

ábc          (l=4)
===          (j=3, so incorrectly NOT detected as a heading)

This patch fixes this, while keeping backward compatibility.

diff --git a/smu.c b/smu.c
index fea4fbd..9272855 100644
--- a/smu.c
+++ b/smu.c
@@ -589,19 +589,20 @@ dosurround(const char *begin, const char *end, int newblock) {
 
 int
 dounderline(const char *begin, const char *end, int newblock) {
-       unsigned int i, j, l;
+       unsigned int i, j, l, k;
        const char *p;
 
        if (!newblock)
                return 0;
        p = begin;
-       for (l = 0; p + l != end && p[l] != '\n'; l++);
+       for(l = k = 0; p + l != end && p[l] != '\n'; l++)
+               if(p[l] > 0 || ((unsigned char)p[l] & 0xC0) == 0xC0) k++;
        p += l + 1;
        if (l == 0)
                return 0;
        for (i = 0; i < LENGTH(underline); i++) {
                for (j = 0; p + j < end && p[j] != '\n' && p[j] == underline[i].search[0]; j++);
-               if (j >= l) {
+               if (j >= k) {
                        fputs(underline[i].before, stdout);
                        if (underline[i].process)
                                process(begin, begin + l, 0);

This code uses l for the length (number of bytes, just as the original), but it also counts number of multi-byte characters in k, and then compares j with that latter.

Cheers,
bzt

@N-R-K
Copy link
Collaborator

N-R-K commented Jul 11, 2022

This takes care of most trivial cases, but there's still a couple cases where this would fail. For example:

$ cat test.md
👨‍👩‍👦 Family
=========
$ ./smu test.md
<p>👨‍👩‍👦 Family
=========</p>

I think it's best to do what Hiltjo suggested in here and simply treat j > 3 as a heading.


P.S: Also for comparison, both lowdown and md4c are able to deal with the above case:

$ md2html test.md
<h1>👨‍👩‍👦 Family</h1>
$ lowdown test.md
<h1 id="family">👨‍👩‍👦 Family</h1>

@bztsrc
Copy link
Author

bztsrc commented Jul 11, 2022

I think it's best to do what Hiltjo suggested in here and simply treat j > 3 as a heading.

That fails on my example above...

This takes care of most trivial cases, but there's still a couple cases where this would fail.

I'm not sure. All writing systems have a single codepoint for their letters in UNICODE. I don't consider multi-multi-byte emoji's not working as a failure, when all of the other writing system otherwise work just fine.

I think the best would be entirely forget comparing with l (Hiltjo's patch has that problem too), and for simplicity just check j >= 3.

IMHO,
bzt

@N-R-K
Copy link
Collaborator

N-R-K commented Jul 11, 2022

All writing systems have a single codepoint for their letters in UNICODE. I don't consider multi-multi-byte emoji's not working as a failure, when all of the other writing system otherwise work just fine.

Not well versed in Unicode to know weather that's correct or not. But some languages can have multiple codepoints which fuse together into a single user visible character. An example in Bengali:

মৌ মৌ
====

মৌ here consists of two codepoints.

That fails on my example above...

Ah yes, my bad. Doing j >= 3 would take care of most practical use-cases. Though the following would still fail:

মৌ
=

Probably not worth it to try and deal with such cases. But just wanted to point it out.

karlb added a commit that referenced this issue Jul 17, 2022
karlb added a commit that referenced this issue Jul 17, 2022
This avoids problems with counting unicode glyphs and negative
side-effects seem very unlikely.

Closes #7
karlb added a commit that referenced this issue Jul 17, 2022
karlb added a commit that referenced this issue Jul 17, 2022
This avoids problems with counting unicode glyphs and negative
side-effects seem very unlikely.

Closes #7
@karlb
Copy link
Owner

karlb commented Jul 17, 2022

I think it's best to do what Hiltjo suggested Gottox#9 (comment) and simply treat j > 3 as a heading.

I like this approach. If users have titles with less than three glyphs, they can just put three characters in the underline to work around it.

The PR is #9.

karlb added a commit that referenced this issue Jul 23, 2022
@karlb karlb closed this as completed in 7d256e5 Jul 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants