UTF-8 headings not detected (patch included) #7

bztsrc · 2022-07-11T05:40:50Z

Hi,

This code does not work if there are UTF-8 characters in the headings. The reason for this is, it expects at least as many = or - characters in the next line as the heading's length in bytes. Now this is only true for English headings, encoded as ASCII. With UTF-8, you can't use the number of bytes, you have to use the number of characters instead.

For example:

abc          (l=3)
===          (j=3, so correctly detected as a heading)

ábc          (l=4)
===          (j=3, so incorrectly NOT detected as a heading)

This patch fixes this, while keeping backward compatibility.

diff --git a/smu.c b/smu.c
index fea4fbd..9272855 100644
--- a/smu.c
+++ b/smu.c
@@ -589,19 +589,20 @@ dosurround(const char *begin, const char *end, int newblock) {
 
 int
 dounderline(const char *begin, const char *end, int newblock) {
-       unsigned int i, j, l;
+       unsigned int i, j, l, k;
        const char *p;
 
        if (!newblock)
                return 0;
        p = begin;
-       for (l = 0; p + l != end && p[l] != '\n'; l++);
+       for(l = k = 0; p + l != end && p[l] != '\n'; l++)
+               if(p[l] > 0 || ((unsigned char)p[l] & 0xC0) == 0xC0) k++;
        p += l + 1;
        if (l == 0)
                return 0;
        for (i = 0; i < LENGTH(underline); i++) {
                for (j = 0; p + j < end && p[j] != '\n' && p[j] == underline[i].search[0]; j++);
-               if (j >= l) {
+               if (j >= k) {
                        fputs(underline[i].before, stdout);
                        if (underline[i].process)
                                process(begin, begin + l, 0);

This code uses l for the length (number of bytes, just as the original), but it also counts number of multi-byte characters in k, and then compares j with that latter.

Cheers,
bzt

The text was updated successfully, but these errors were encountered:

N-R-K · 2022-07-11T06:29:08Z

This takes care of most trivial cases, but there's still a couple cases where this would fail. For example:

$ cat test.md
👨‍👩‍👦 Family
=========
$ ./smu test.md
<p>👨‍👩‍👦 Family
=========</p>

I think it's best to do what Hiltjo suggested in here and simply treat j > 3 as a heading.

P.S: Also for comparison, both lowdown and md4c are able to deal with the above case:

$ md2html test.md
<h1>👨‍👩‍👦 Family</h1>
$ lowdown test.md
<h1 id="family">👨‍👩‍👦 Family</h1>

bztsrc · 2022-07-11T08:51:09Z

I think it's best to do what Hiltjo suggested in here and simply treat j > 3 as a heading.

That fails on my example above...

This takes care of most trivial cases, but there's still a couple cases where this would fail.

I'm not sure. All writing systems have a single codepoint for their letters in UNICODE. I don't consider multi-multi-byte emoji's not working as a failure, when all of the other writing system otherwise work just fine.

I think the best would be entirely forget comparing with l (Hiltjo's patch has that problem too), and for simplicity just check j >= 3.

IMHO,
bzt

N-R-K · 2022-07-11T14:53:24Z

All writing systems have a single codepoint for their letters in UNICODE. I don't consider multi-multi-byte emoji's not working as a failure, when all of the other writing system otherwise work just fine.

Not well versed in Unicode to know weather that's correct or not. But some languages can have multiple codepoints which fuse together into a single user visible character. An example in Bengali:

মৌ মৌ
====

মৌ here consists of two codepoints.

That fails on my example above...

Ah yes, my bad. Doing j >= 3 would take care of most practical use-cases. Though the following would still fail:

মৌ
=

Probably not worth it to try and deal with such cases. But just wanted to point it out.

For #7

This avoids problems with counting unicode glyphs and negative side-effects seem very unlikely. Closes #7

For #7

This avoids problems with counting unicode glyphs and negative side-effects seem very unlikely. Closes #7

karlb · 2022-07-17T09:46:44Z

I think it's best to do what Hiltjo suggested Gottox#9 (comment) and simply treat j > 3 as a heading.

I like this approach. If users have titles with less than three glyphs, they can just put three characters in the underline to work around it.

The PR is #9.

For #7

karlb added a commit that referenced this issue Jul 17, 2022

Add test case unicode headings

a757a94

For #7

karlb added a commit that referenced this issue Jul 17, 2022

Allow sloppy underlines (>=3 characters)

a3466c0

This avoids problems with counting unicode glyphs and negative side-effects seem very unlikely. Closes #7

karlb added a commit that referenced this issue Jul 17, 2022

Add test case unicode headings

41357ce

For #7

karlb added a commit that referenced this issue Jul 17, 2022

Allow sloppy underlines (>=3 characters)

60a24ed

This avoids problems with counting unicode glyphs and negative side-effects seem very unlikely. Closes #7

karlb added a commit that referenced this issue Jul 23, 2022

Add test case unicode headings

c4b4c01

For #7

karlb closed this as completed in 7d256e5 Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 headings not detected (patch included) #7

UTF-8 headings not detected (patch included) #7

bztsrc commented Jul 11, 2022 •

edited

Loading

N-R-K commented Jul 11, 2022 •

edited

Loading

bztsrc commented Jul 11, 2022

N-R-K commented Jul 11, 2022 •

edited

Loading

karlb commented Jul 17, 2022

UTF-8 headings not detected (patch included) #7

UTF-8 headings not detected (patch included) #7

Comments

bztsrc commented Jul 11, 2022 • edited Loading

N-R-K commented Jul 11, 2022 • edited Loading

bztsrc commented Jul 11, 2022

N-R-K commented Jul 11, 2022 • edited Loading

karlb commented Jul 17, 2022

bztsrc commented Jul 11, 2022 •

edited

Loading

N-R-K commented Jul 11, 2022 •

edited

Loading

N-R-K commented Jul 11, 2022 •

edited

Loading