Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write tool which can convert translated files back to PO #30

Open
mgeisler opened this issue May 12, 2023 · 9 comments
Open

Write tool which can convert translated files back to PO #30

mgeisler opened this issue May 12, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mgeisler
Copy link
Collaborator

This idea is from rust-embedded/book#326: we should write a converter tool which takes two Markdown files as input and outputs a PO file.

More concretely, the tool should take an en/foo.md and xx/foo.md file and output a xx.po file. The tool will call extract_messages on both files and line up the results. It will use the messages from en/foo.md as the msgid and the corresponding message from foo/xx.md as the msgstr.

The output is marked fuzzy to ensure that a human translator double-checks it all before publication.

@mgeisler mgeisler changed the title Write tool which can convert back to PO Write tool which can convert translated files back to PO May 19, 2023
@mgeisler mgeisler added the good first issue Good for newcomers label Jun 13, 2023
@mgeisler mgeisler added the enhancement New feature or request label Aug 14, 2023
@mgeisler mgeisler added the help wanted Extra attention is needed label Aug 24, 2023
@memark
Copy link

memark commented Sep 1, 2023

I'd be happy to take a look at this!

It would be a separate bin crate then?

Are there any good example .md files? Or should I just generate them myself from some already translated mdbook?
I've generated some example files myself (with po4a) to work with.

@mgeisler
Copy link
Collaborator Author

mgeisler commented Sep 4, 2023

Hejsa! Thanks for looking at this!

I was thinking to use some of the Rust book translations to start with, e.g., pick one here which is somewhat current and which you can read 🙂

The Rust Embedded book is also translated into a few languages — that's where the idea came from.

You might want be able to reuse some of the logic in mdbook-i18n-normalize. That tool takes a PO file looks at each (msgid, msgstr) pair. It then runs extract_messages on both to get two lists:

let msgids = ["msgid_1", "msgid_2", "msgid_1"];
let msgstrs = ["msgstr_1", "msgstr_2", "msgstr_1"];

Finally, it zips those lists and output the ("msgid_N", "msgstr_N") pairs to the PO file. Having written this, it occurs to me that this is what your tool will have to do as well.

One feature your tool could have would be to (attempt to) synchronize the translations on various points. So if the source text is

# The Foo Project

Welcome to the Foo projects...

## Getting Started

To install Foo, do ...

and the translated document is

# Foo-projektet

Velkommen til Foo-projektet...

## Komme godt igang

For at installere Foo, ...

then the conversion tool could align on the # and ## headings. This would make it robust against extra paragraphs being inserted in either language.

Similar with lists and block quotes, they could perhaps also be synchronization points. This is just an idea, I'm not sure it's useful 😄

@mgeisler
Copy link
Collaborator Author

It would be a separate bin crate then?

I missed this part, I think it could just be a new binary next to the others to start with. If people like the tool but somehow don't want to see mdbook-* binaries on their system, then we could consider moving this binary to its own crate. Luckily, mdbook-i18n-helpers is already a library crate, so it's easy to depend on it from other crates.

@memark
Copy link

memark commented Sep 25, 2023

I think it could just be a new binary next to the others to start with.

Yes, that's what I mean by "separate bin crate". I agree that we might do a separate "package" later on if needed.

Additionally, I'm a bit held up at the moment with other work, but I'll keep working in this whenever I have the time.

I was thinking to use some of the Rust book translations to start with

From that link, how do I get the actual translated .md file? It seems that all that's available is either .po or .html files.

@mgeisler
Copy link
Collaborator Author

Hejsa!

I think it could just be a new binary next to the others to start with.

Yes, that's what I mean by "separate bin crate". I agree that we might do a separate "package" later on if needed.

I see, great!

Additionally, I'm a bit held up at the moment with other work, but I'll keep working in this whenever I have the time.

That's completely fine, I know the feeling 😄

I was thinking to use some of the Rust book translations to start with

From that link, how do I get the actual translated .md file? It seems that all that's available is either .po or .html files.

Clicking through to the Swedish translation (to keep a Scandinavian theme — I'm from Denmark and I guess you're from Sweden), leads me to https://github.com/sebras/book/tree/master/src where I see a bunch of translated files. So I would use these two files an a starting point:

I'm not sure if the idea actually works... will 10% of the paragraphs line up, or 80%? Just from skimming those two files, it looks like a lot of text lines up very nicely.

@memark
Copy link

memark commented May 9, 2024

Hejsan!

I've (finally) started looking into this. The code required is pretty straight forward for reading, parsing and extracting. The hard part is the syncing. Having looked at a couple different translations of the book, they can differ a lot, a little, or not at all.

Some examples.
Filename: number-of-extracted-messages

ch00-00-introduction.en.md: 45
ch00-00-introduction.sv.md: 46

ch01-02-hello-world.en.md: 45
ch01-02-hello-world.sv.md: 42

ch02-00-guessing-game-tutorial.en.md: 177
ch02-00-guessing-game-tutorial.pt.md: 252

Trying to line them up might work in really simple scenarios, where the structure is still intact. But in the PT case above, the EN version is really far ahead. The main problem I see is that there is a lack of context in the MD files. If a single heading (or other sync point of choice) is left out in either file, the remaining message pairs will be garbage. Even with a small difference like 42 vs 45, there is no way to tell if the diff was at message 1 or 40.

So what is "good enough" here? Should we fail as soon as something appears to be off? Or greedily try to do what we can? What's the threshold? Strict or lax? Need some guidance here.

@memark
Copy link

memark commented May 9, 2024

Btw, having run just my own tool (which calls extract_messages()) a number of times, closing down my editor and my terminal, I get a number of zoombie processes that linger consuming CPU.
(I only noticed this because this is the first time ever the fan on my M3 has started.)
Is this expected?

image

@mgeisler
Copy link
Collaborator Author

Trying to line them up might work in really simple scenarios, where the structure is still intact. But in the PT case above, the EN version is really far ahead.

That is a very good point and I don't think there is a clear right answer here.

Our mdbook-i18n-normalize tool has a similar problem: it normalizes a .po file by re-applying the extraction logic. The idea is that we've improved the way we extract messages over time and so this tool lets you migrate an old xx.po file to benefit from the latest improvements. Biggest example of this: we used to extract a Markdown table as a single unit, now we extract the text cell by cell.

When we normalize each (msgid, msgstr) pair, we run into similar problems as what you have here: we could easily end up with different number of extracted messages from each of msgid and msgstr. The normalization tool handles this in a very primitive way here:

match new_msgids.len().cmp(&new_msgstrs.len()) {
std::cmp::Ordering::Less => {
// Treat left-over translations as separate paragraphs.
// This makes normalization stable.
let tail = new_msgstrs[new_msgids.len() - 1..]
.iter()
.map(|(_, msgstr)| msgstr.as_str())
.collect::<Vec<_>>()
.join("\n\n");
new_msgstrs.truncate(new_msgids.len() - 1);
new_msgstrs.push((0, tail));
}
std::cmp::Ordering::Greater => {
// Set missing msgstr entries to "".
new_msgstrs.resize(new_msgids.len(), (0, String::new()));
}
_ => {}
}

If there are extra messages in the msgstr field, it accumulates them in the final msgid field. This is confusing in its own way and indirectly lead to #125.

The main problem I see is that there is a lack of context in the MD files. If a single heading (or other sync point of choice) is left out in either file, the remaining message pairs will be garbage. Even with a small difference like 42 vs 45, there is no way to tell if the diff was at message 1 or 40.

You mention a "sync point"... which is something we haven't looked into before! For the use case of generating a .po file from two .md files, I think establishing some sync points could be very helpful. This could be Markdown features which we hope are stable over time:

  • headings of different levels
  • lists
  • tables
  • block quote

and perhaps others?

I haven't thought this through, but I could imagine using a largest common subsequence or diffing algorithm to find common parts of the Markdown AST between the two documents.

The flow would be something like this:

  1. Extract the AST for each document

  2. Normalize all text so that you're left with a normalized outline. Something like

    # Heading 1
    ## Heading 1.1
    
    Paragraph 1 in 1.1
    
    Paragraph 2 in 1.1
    
    - Item 1 in 1.1
    - Item 2 in 1.1
    
    Paragraph 3 in 1.1

    Here I'm thinking that it'll be good to include the heading numbers in the normalized paragraph and list item text to give the diffing algorithm something to "attach to"?

  3. Diff the two normalized documents. Everything that matches gives you a mapping from msgid (from the first document) to msgstr (the second document). Things that are extra in the second document can be dropped, things that are missing in the second document just have a msgstr equal to ""

  4. Go back from the normalized outlines to the original documents and output a .po file.

At the end of the day, one document will be the source. So regardless of what the translated document contains, the source will win: if there are 10 paragraphs in the source document, then there should also be 10 paragraphs in the translation. This is kind of a built in limitation of the way we attempt to chop up the source document when we do translations.

@mgeisler
Copy link
Collaborator Author

Btw, having run just my own tool (which calls extract_messages()) a number of times, closing down my editor and my terminal, I get a number of zoombie processes that linger consuming CPU.
(I only noticed this because this is the first time ever the fan on my M3 has started.)
Is this expected?

Nope, I would not expect this — everything in this library should be completely in-process and not spawn anything. It only does text to text transformations.

The group_events-... name is interesting: we don't have a binary with that name, so I guess it comes from somewhere else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants