Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mdbook-i18n-normalize to convert existing PO files #46

Merged
merged 3 commits into from
Aug 14, 2023
Merged

Conversation

mgeisler
Copy link
Collaborator

@mgeisler mgeisler commented Aug 13, 2023

The new binary will take an old PO file with existing translations as input. It runs extract_messages on every message, matches up the new messages, and outputs a normalized PO file. The normalized file maintains the old translations when possible.

How is this tested? I used the Korean translation of Comprehensive Rust as a starting point since it is the most complete translation at the moment. I ran the following commands

MDBOOK_OUTPUT='{"markdown": {}}' MDBOOK_BOOK__LANGUAGE=ko mdbook build -d markdown-ko

to generate Markdown using the current translation. I then ran

mdbook-i18n-normalize po/ko.po po/ko-norm.po
MDBOOK_OUTPUT='{"markdown": {}}' MDBOOK_BOOK__LANGUAGE=ko-norm mdbook build -d markdown-ko-norm

to generate Markdown using the normalized file. Before comparing the two directories, I made all links inline links in the old translation (comrak from https://github.com/kivikakk/comrak happens to do this) and I formatted all files as well:

for f in markdown-*/**/*.md; do
    comrak $f --to commonmark --gfm > $f.tmp && mv $f.tmp $f
    sed -i 's/<!-- end list -->//g' $f
done
dprint fmt 'markdown-ko-*/**/*.md'

After all this, diff markdown-ko markdown-ko-norm shows differences in 9 files. Some of the differences are changes were we now invalidate less of the file when the original English text changes. An example can be seen in “Running the Course” which has a large list. We now translate list items one-by-one:

diff --color --unified markdown-ko/running-the-course.md markdown-ko-norm/running-the-course.md
--- markdown-ko/running-the-course.md	2023-08-13 16:34:58.367680394 +0200
+++ markdown-ko-norm/running-the-course.md	2023-08-13 16:34:58.371680346 +0200
@@ -1,4 +1,4 @@
-# 강의 진행 방식
+# 강의 진행

 > 강사를 위한 안내 페이지입니다.

@@ -6,11 +6,10 @@

 강의를 실행하기 위한 준비:

-1. Make yourself familiar with the course material. We've included speaker notes
-   to help highlight the key points (please help us by contributing more speaker
-   notes\!). When presenting, you should make sure to open the speaker notes in
-   a popup (click the link with a little arrow next to "Speaker Notes"). This
-   way you have a clean screen to present to the class.
+1. 강의 자료를 숙지합니다. 주요 요점을 강조하기 위해 강의 참조 노트를
+   포함하였습니다. (추가적인 노트를 작성하여 제공해 주시면 감사하겠습니다.) 강의
+   참조 노트의 링크를 누르면 별도의 팝업으로 분리가 되며, 메인 화면에서는
+   사라집니다. 깔끔한 화면으로 강의를 진행할 수 있습니다.

 2. Decide on the dates. Since the course takes at least three full days, we
    recommend that you schedule the days over two weeks. Course participants have

This new fine-grained extraction technique allow us to use the translation for the first list item, even though something has changed for the later items.

The next steps is to make a new release of mdbook-i18n-helpers and then normalize all existing translations to the new more fine-grained format. This will make life easier for translators because it removes the direct dependency on the formatting of the Markdown (# foo and ## foo are both extracted into just "foo").

Fixes #33.

@mgeisler mgeisler requested a review from djmitche August 13, 2023 15:18
@mgeisler mgeisler force-pushed the normalize branch 2 times, most recently from 05954e3 to f7bf78c Compare August 13, 2023 15:23
These links would normally end up as-is in the output text. This is a
problem since it decouples them from the link definitions, which in
turn makes it hard to translate since the message with the link is far
removed from the message that defines the link. Inlining everything
ensures that each message is self-contained.
Extracting reference links without the link definition results in a
broken link. Such links are turned into an escaped form in the output
of `extract_messages`.

See #33.
This binary will take an old PO file with existing translations as
input. It runs `extract_messages` on every message, matches up the new
messages, and outputs a normalized PO file. The normalized file
maintains the old translations when possible.

How is this tested? I used the Korean translation of Comprehensive
Rust as a starting point since it is the most complete translation. I
ran the following commands

    MDBOOK_OUTPUT='{"markdown": {}}' MDBOOK_BOOK__LANGUAGE=ko mdbook build -d markdown-ko

to generate Markdown using the current translation.

I then ran

    mdbook-i18n-normalize po/ko.po po/ko-norm.po
    MDBOOK_OUTPUT='{"markdown": {}}' MDBOOK_BOOK__LANGUAGE=ko-norm mdbook build -d markdown-ko-norm

to generate Markdown using the normalized file. Before comparing the
two directories, I made all links inline links in the old
translation (`comrak` from https://github.com/kivikakk/comrak happens
to do this) and I formatted all files as well:

    for f in markdown-*/**/*.md; do
        comrak $f --to commonmark --gfm > $f.tmp && mv $f.tmp $f
        sed -i 's/<!-- end list -->//g' $f
    done
    dprint fmt 'markdown-ko-*/**/*.md'

After all this, `diff markdown-ko markdown-ko-norm` shows differences
in 9 files. Some of the differences are changes were we now invalidate
less of the file when the original English text changes. An example
can be seen in “Running the Course” which has a large list. We now
translate list items one-by-one:

```patch
diff --color --unified markdown-ko/running-the-course.md markdown-ko-norm/running-the-course.md
--- markdown-ko/running-the-course.md	2023-08-13 16:34:58.367680394 +0200
+++ markdown-ko-norm/running-the-course.md	2023-08-13 16:34:58.371680346 +0200
@@ -1,4 +1,4 @@
-# 강의 진행 방식
+# 강의 진행

 > 강사를 위한 안내 페이지입니다.

@@ -6,11 +6,10 @@

 강의를 실행하기 위한 준비:

-1. Make yourself familiar with the course material. We've included speaker notes
-   to help highlight the key points (please help us by contributing more speaker
-   notes\!). When presenting, you should make sure to open the speaker notes in
-   a popup (click the link with a little arrow next to "Speaker Notes"). This
-   way you have a clean screen to present to the class.
+1. 강의 자료를 숙지합니다. 주요 요점을 강조하기 위해 강의 참조 노트를
+   포함하였습니다. (추가적인 노트를 작성하여 제공해 주시면 감사하겠습니다.) 강의
+   참조 노트의 링크를 누르면 별도의 팝업으로 분리가 되며, 메인 화면에서는
+   사라집니다. 깔끔한 화면으로 강의를 진행할 수 있습니다.

 2. Decide on the dates. Since the course takes at least three full days, we
    recommend that you schedule the days over two weeks. Course participants have
```

This new fine-grained extraction technique allow us to use the
translation for the first list item, even though something has changed
for the later items.

Fixes #33.
Copy link
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run this, but the code looks good and sensible. I can take a deeper look if you'd like, but it seems like your experimentation demonstrates that it's fit for purpose.

@mgeisler
Copy link
Collaborator Author

I didn't run this, but the code looks good and sensible. I can take a deeper look if you'd like, but it seems like your experimentation demonstrates that it's fit for purpose.

Thanks for skimming it — that's all I can ask for! 🙂

@mgeisler mgeisler merged commit 22ed587 into main Aug 14, 2023
5 checks passed
@mgeisler mgeisler deleted the normalize branch August 14, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reference links are escaped
2 participants