Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Man writer doesn't use UTF-8 encoding but escapes all non-Latin letters. #8507

Closed
van-de-bugger opened this issue Dec 28, 2022 · 1 comment
Closed
Labels

Comments

@van-de-bugger
Copy link

van-de-bugger commented Dec 28, 2022

Consider an example:

$ cat test.md
Ελληνικά
========

српски հայերեն

Source markdown file includes Greek, Cyrillic, and Armenian letters.

$ pandoc -s -t man < test.md > test.man

$ man -P cat ./test.man
()                                                           ()

Ελληνικά
       српски հայերեն

Pandoc converted markdown to man page, it is ok. However, let's have a look into .man file content:

$ cat test.man
.\" Automatically generated by Pandoc 2.14.0.3
.\"
.TH "" "" "" "" ""
.hy
.SH \[*E]\[*l]\[*l]\[*y]\[*n]\[*i]\[*k]\[u03AC]
.PP
\[u0441]\[u0440]\[u043F]\[u0441]\[u043A]\[u0438]
\[u0570]\[u0561]\[u0575]\[u0565]\[u0580]\[u0565]\[u0576]

Look, all the non-Latin characters are represented as escape sequences. It is not a showstopper, since the rendered man page looks good, but every non-Latin character is represented with 5 bytes (in case of Greek), or 8 bytes (in case of Cyrillic and Armenian). If the characters are not escaped, they would occupy only 2 bytes each. It is just waste of space.

Modern groff allows using UTF-8 encoding in source files:

$ cat test.man
.\" Automatically generated by Pandoc 2.14.0.3
.\"
.TH "" "" "" "" ""
.hy
.SH Ελληνικά
.PP
српски հայերեն

$ groff -D utf8 -m man -T utf8 < test.man
()                                                           ()
Ελληνικά
       српски հայերեն
                                                             ()

Thus, I request the man writer outputs non-Latin character as-is, without converting them to escape sequences.

Pandoc version:

$ pandoc --version
pandoc 2.14.0.3
Compiled with pandoc-types 1.22.1, texmath 0.12.3.3, skylighting 0.10.5.2,
citeproc 0.4.0.1, ipynb 0.1.0.1
User data directory: /home/vdb/.local/share/pandoc
Copyright (C) 2006-2021 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

It is not the last available version. However, I scanned the pandoc release notes for releases after 2.14.0.3, it seems there were no changes in man writer.

BTW, in Fedora 37 man pages in languages with non-Latin writing systems do not use escape sequences. For example, Serbian:

$ cat /usr/share/man/sr/man1/cat.1.gz | gunzip | head -n20
.\" -*- coding: UTF-8 -*-
.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.48.5.
.\"*******************************************************************
.\"
.\" This file was generated with po4a. Translate the source file.
.\"
.\"*******************************************************************
.TH CAT 1 "Августа 2022" "ГНУ coreutils 9.1" "Корисничке наредбе"
.SH НАЗИВ
cat \- concatenate files and print on the standard output
.SH УВОД
\fBcat\fP [\fI\,ОПЦИЈА\/\fP]... [\fI\,ДАТОТЕКА\/\fP]...
.SH ОПИС
.\" Add any additional description here
.PP
Надовежите ДАТОТЕКУ(Е) на стандардни излаз.
.PP
Без ДАТОТЕКЕ, или када је ДАТОТЕКА \-, чита стандардни улаз.
.TP 
\fB\-A\fP, \fB\-\-show\-all\fP

Or Japanese:

$ cat /usr/share/man/ja/man1/cat.1.gz | gunzip | head -n20
.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.47.13.
.TH CAT "1" "2021年5月" "GNU coreutils" "ユーザーコマンド"
.SH 名前
cat \- ファイルの内容を連結して標準出力に出力する
.SH 書式
.B cat
[\fI\,オプション\/\fR]... [\fI\,ファイル\/\fR]...
.SH 説明
.\" Add any additional description here
.PP
ファイル (複数可) の内容を結合して標準出力に出力します。
.PP
ファイルの指定がない場合や FILE が \- の場合, 標準入力から読み込みを行います。
.HP
\fB\-A\fR, \fB\-\-show\-all\fR           \fB\-vET\fR と同じ
.TP
\fB\-b\fR, \fB\-\-number\-nonblank\fR
空行以外に行番号を付ける。\-n より優先される
.HP
\fB\-e\fR                       \fB\-vE\fR と同じ

I am not aware about other distros, though.

@jgm
Copy link
Owner

jgm commented Dec 29, 2022

It used to be that UTF-8 in man pages was not reliably supported.
Perhaps that situation has changed and we can revisit this. In any case, we could keep the present behavior when the --ascii option is used.

@jgm jgm closed this as completed in ce7d1d1 Dec 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants