Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken meta.date_formatted output #560

Closed
HuidaeCho opened this issue Aug 8, 2020 · 5 comments
Closed

Broken meta.date_formatted output #560

HuidaeCho opened this issue Aug 8, 2020 · 5 comments

Comments

@HuidaeCho
Copy link

HuidaeCho commented Aug 8, 2020

Hello,

I set my date_format in config/config.yml to %Y년 %m월 %d일 in ko_KR.UTF-8 (Korean). When I print {{ meta.date_formatted }}, its output is broken (2019ë 12ì 07ì¼ instead of 2019년 12월 07일) because of line 1,522 in vendor/picocms/pico/lib/Pico.php:

$meta['date_formatted'] = $meta['time'] ? utf8_encode(strftime($dateFormat, $meta['time'])) : '';

I was able to fix this issue by removing utf8_encode():

$meta['date_formatted'] = $meta['time'] ? strftime($dateFormat, $meta['time']) : '';

This issue is due to double UTF-8 encodings. It can be easily replicated:

<?php
$a = "2019년 12월 07일";
echo $a."\n";              # OK
echo utf8_encode($a)."\n"; # BAD; $a is already in UTF-8; don't utf8_encode it again!

outputs

2019년 12월 07일
2019ë 12ì 07ì¼

Please remove utf8_encode() from the above line.

Thanks!

@PhrozenByte
Copy link
Collaborator

We really should implement i18n and not rely on strftime() 😒

Anyway, since strftime() relies on the system's encoding (which might get changed by setlocale()) we must encode the value on systems using a non-UTF-8 default charset. I'd assume basically all Linux systems to use UTF-8, but there are OS vendors out there which are a few decades behind (Windows, obviously 😒), so... Since you use a UTF-8 encoded locale you shouldn't experience any problems with the solution below. However, on a side note for other users, this won't work for people using e.g. ko_KR.EUC-KR. That's because the static parts of your date string are encoded in UTF-8, while the strftime() placeholders would use EUC-KR, resulting in mixed encodings. If you use non-ASCII static chars in your date format you must use a UTF-8 encoded locale.

The funny thing is that PHP includes special charset detection lists (available via mb_detect_order()) for Japanese, Korean, Traditional and Simplified Chinese, Russian, Armenian, Turkish and Ukrainian, so that these locales should work out of the box with the solution below. However, the fallback charset detection list (called "neutral", which I assume is a bit more common…) doesn't include Windows-1252 (which is the most common non-cryllic charset after ISO-8859-1). So we need yet another hack just for PHP... 😒 So, on yet another side note for other users, if you experience encoding issues with the formatted date string, set mbstring.detect_order in your php.ini to ASCII, UTF-8, <your system's charset>.

Can you please check whether the following works for you? (it should, just wanna be sure)

if ($meta['time']) {
    $encodingList = (mb_detect_order() === [ 'ASCII', 'UTF-8' ]) ? [ 'ASCII', 'UTF-8', 'CP1252' ] : mb_detect_order();
    $meta['date_formatted'] = mb_convert_encoding(strftime($dateFormat, $meta['time']), 'UTF-8', $encodingList);
} else {
    $meta['date_formatted'] = '';
}

Still a hack... As I said, relying on strftime() is bad... But hopefully a better working hack, even though it still has some issues.

@HuidaeCho
Copy link
Author

I see. I replaced the original if (empty($meta['date_formatted'])) block with

if (empty($meta['date_formatted']) && $meta['time']) {
    $encodingList = (mb_detect_order() === [ 'ASCII', 'UTF-8' ]) ? [ 'ASCII', 'UTF-8', 'CP1252' ] : mb_detect_order();
    $dateFormat = $this->getConfig('date_format');
    $meta['date_formatted'] = mb_convert_encoding(strftime($dateFormat, $meta['time']), 'UTF-8', $encodingList);
}

and it works! Thanks!

@HuidaeCho
Copy link
Author

HuidaeCho commented Aug 9, 2020

However, on a side note for other users, this won't work for people using e.g. ko_KR.EUC-KR. That's because the static parts of your date string are encoded in UTF-8, while the strftime() placeholders would use EUC-KR, resulting in mixed encodings.

Here, do you mean that the encoding of config/config.yml can be different from that of the system so that the date_format setting may be in UTF-8 while the system locale may be EUC-KR? In this case, strftime() is passed a UTF-8 string and outputs a EUC-KR string? It may not even work properly, I guess.

Or do you mean that both the encodings of config/config.yml and the system may be in EUC-KR, but the Pico site setting is set to produce UTF-8 and NOT converting EUC-KR strftime() output to UTF-8 can be problematic?

Just curious. Thanks!

@HuidaeCho
Copy link
Author

And is this hack going to be part of Pico or should I keep patching it myself?

@PhrozenByte
Copy link
Collaborator

Exactly. If you save config/config.yml in UTF-8 (what is default) and add static chars to date_format, the static chars are encoded in UTF-8. They are then passed to strftime() which replaces placeholders with chars encoded in the system's default charset. If the placeholders are replaced by ASCII chars only anyway (like %Y, the four digit representation for the year) it wouldn't make any difference. But for placeholders which yield non-ASCII chars (like %B, the full month name, based on the locale) the replacement chars are encoded in the system's charset, resulting in a string with some chars being encoded in UTF-8 and some in the system's charset. Pico always uses UTF-8.

And is this hack going to be part of Pico or should I keep patching it myself?

Will be released with v2.1.4 👍

PhrozenByte added a commit that referenced this issue Aug 29, 2020
```
* [Changed] Silence PHP errors in Parsedown
* [Fixed] #560: Improve charset guessing for formatted date strings using
          `strftime()` (Pico always uses UTF-8, but `strftime()` might not)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants