Rewrite the po4a-gettextize documentation

mquinson · Aug 6, 2022 · f2bddad · f2bddad
1 parent 8463b1c
commit f2bddad
Show file tree

Hide file tree

Showing 2 changed files with 160 additions and 118 deletions.
diff --git a/NEWS b/NEWS
@@ -12,6 +12,7 @@ __   __/ _ \ / /_  ( _ )
 Project organization:
  * Add a deprecation warning to po4a-translate and po4a-updatepo stating
    that po4a is the prefered interface.
+ * Rewrite the po4a-gettextize documentation (Debian's #1016695).
 
 Translations:
  * Updated: German, thanks Helge Kreutzmann.

diff --git a/po4a-gettextize b/po4a-gettextize
@@ -28,15 +28,17 @@ the classical gettext tools. The main feature of po4a is that it decouples the
 translation of content from its document structure.  Please refer to the page
 L<po4a(7)> for a gentle introduction to this project.
 
-The B<po4a-gettextize> script is in charge of converting documentation files into
-PO files. You only need it to setup your translation project with po4a, never afterward.
-
-If you start from scratch, B<po4a-gettextize> will extract the translatable
-strings from the documentation and write a POT file. If you provide a previously
-existing translated file with the B<-l> flag, B<po4a-gettextize> will try to use
-the translations that it contains in the produced PO file. This process remains
-tedious and manual, as explained in Section 'Converting a manual translation to
-po4a' below.
+The B<po4a-gettextize> script helps you converting your previously existing
+translations into a po4a-based workflow. This is only to be done once to salvage
+an existing translation while converting to po4a, not on a regular basis after
+the conversion of your project. This tedious process is explained in details in
+Section 'Converting a manual translation to po4a' below.
+
+You must provide both a master file (e.g., the source in English) and an
+existing translated file (e.g. a previous translation attempt without po4a). If
+you provide more than one master or translation files, they will be used in
+sequence, but it may be easier to gettextize each page or chapter separately and
+then use B<msgmerge> to merge all produced PO files. As you wish.
 
 If the master document has non-ASCII characters, the new generated PO file will
 be in UTF-8. Else (if the master document is completely in ASCII), the generated
@@ -79,8 +81,8 @@ catalog will be written to the standard output.
 =item B<-o>, B<--option>
 
 Extra option(s) to pass to the format plugin. See the documentation of each
-plugin for more information about the valid options and their meanings. For 
-example, you could pass '-o tablecells' to the AsciiDoc parser, while the 
+plugin for more information about the valid options and their meanings. For
+example, you could pass '-o tablecells' to the AsciiDoc parser, while the
 text parser would accept '-o tabs=split'.
 
 =item B<-h>, B<--help>
@@ -91,9 +93,9 @@ Show a short help message.
 
 List the documentation formats understood by po4a.
 
-= item B<-k> B<--keep-temps>
+=item B<-k> B<--keep-temps>
 
-Keep the temporary master and localized POT files built before merging. 
+Keep the temporary master and localized POT files built before merging.
 This can be useful to understand why these files get desynchronized, leading to gettextization problems
 
 =item B<-V>, B<--version>
@@ -130,19 +132,24 @@ Set the package version for the POT header. The default is "VERSION".
 
 =head2 Converting a manual translation to po4a
 
-B<po4a-gettextize> will try to extract the content of any provided translation
-file, and use this content as msgstr in the produced PO file. Be warned that
-this process is very fragile: the Nth string of the translated file is supposed
-to be the translation of the Nth string in the original. This will naturally not
-work unless both files share exactly the same structure.
+B<po4a-gettextize> synchronizes the master and localized files to extract their
+content into a PO file. The content of the master file gives the B<msgid> while
+the content of the localized file gives the B<msgstr>. This process is somewhat
+fragile: the Nth string of the translated file is supposed to be the translation
+of the Nth string in the original.
+
+Gettextization works best if you manage to retrieve the exact version of the
+original document that was used for translation. Even so, you may need to fiddle
+with both master and localized files to align their structure if it was changed
+by the original translator, so working on files' copies is advised.
 
 Internally, each po4a parser reports the syntactical type of each extracted
 strings. This is how desynchronization are detected during the gettextization.
-For example, if the files have the following structure, it is very unlikely that
-the 4th string in translation (of type 'chapter') is the translation of the 4th
-string in original (of type 'paragraph'). It is more likely that a new
-paragraph was added to the original, or that two original paragraphs were merged
-together in the translation.
+In the example depicted below, it is very unlikely that the 4th string in
+translation (of type 'chapter') is the translation of the 4th string in original
+(of type 'paragraph'). It is more likely that a new paragraph was added to the
+original, or that two original paragraphs were merged together in the
+translation.
 
     Original         Translation
 
@@ -153,73 +160,57 @@ together in the translation.
   chapter              paragraph
     paragraph          paragraph
 
-B<po4a-gettextize> will verbosely diagnose any detected structure
-desynchronization. When this happens, you should manually edit the files (this
-probably requires that you have some notions of the target language). You must
-add fake paragraphs or remove some content in one of the documents (or both) to
-fix the reported disparities, until the structure of both documents perfectly
-match. Some tricks are given in the next section.
-
-Even when the document is successfully processed, undetected disparities and
-silent errors are still possible. That is why any translation associated
-automatically by po4a-gettextize is marked as I<fuzzy> to require an manual
-inspection by humans. One has to check that each retrieved msgstr is actually
-the translation of the associated msgid, and not the string before or after.
-
-As you can see, the key here is to have the exact same structure in the
-translated document and in the original one. The best is to do the
-gettextization on the exact version of F<master.doc> that was used for the
-translation, and only update the PO file against the latest master file once the
-gettextization was successful.
-
-If you are lucky enough to have a a perfect match in the file structures,
-building a correct PO file is a matter of seconds. Otherwise, you will soon
-understand why this process has such an ugly name :) But remember that this
-grunt work is the price to pay to get the comfort of po4a afterward. Once
-converted, the synchronization between master documents and translations will
-always be fully automatic.
-
-Even when things go wrong, gettextization often remains faster than translating
-everything again. I was able to gettextize the existing French translation of
-the whole Perl documentation in one day, even though the structure of many
-documents were desynchronized. That was more than two megabytes of original text
-(2 millions of characters): restarting the translation from scratch would have
-required several months of work.
-
-=head2 Hints and tricks for the gettextization process
-
-The gettextization stops as soon as a desynchronization is detected. In theory,
-it should probably be possible resynchronize the gettextization later in the
-documents using e.g. the same algorithm than the L<diff(1)> utility. But a manual
-intervention would still be mandatory to manually match the elements that
-couldn't be automatically matched, explaining why automatic resynchronization is
-not implemented (yet?).
-
-When this happens, the whole game comes down to the alignment of these damn
-files' structures again through manual edits. B<po4a-gettextize> is rather
-verbose about what went wrong when it happens. It reports the strings that don't
-match, their positions in the text, and the type of each of them. Moreover, the
-PO file generated so far is dumped as F<gettextization.failed.po> for further
-inspection.
-
-Here are some other tricks to help you in this tedious process:
+B<po4a-gettextize> will verbosely diagnose any structure desynchronization. When
+this happens, you should manually edit the files to add fake paragraphs or
+remove some content here and there until the structure of both files actually
+match. Some tricks are given below to salvage the most of the existing
+translation while doing so.
+
+If you are lucky enough to have a perfect match in the file structures out of
+the box, building a correct PO file is a matter of seconds. Otherwise, you will
+soon understand why this process has such an ugly name :) Even so,
+gettextization often remains faster than translating everything again. I
+gettextized the French translation of the whole Perl documentation in one day
+despite the I<many> synchronization issues. Given the amount of text (2Mb of
+original text), restarting the translation without first salvaging the old
+translations would have required several months of work. In addition, this grunt
+work is the price to pay to get the comfort of po4a. Once converted, the
+synchronization between master documents and translations will always be fully
+automatic.
+
+After a successful gettextization, the produced documents should be manually
+checked for undetected disparities and silent errors, as explained below.
+
+=head3 Hints and tricks for the gettextization process
+
+The gettextization stops as soon as a desynchronization is detected. When this
+happens, you need to edit the files as much as needed to re-align the files'
+structures. B<po4a-gettextize> is rather verbose when things go wrong. It
+reports the strings that don't match, their positions in the text, and the type
+of each of them. Moreover, the PO file generated so far is dumped as
+F<gettextization.failed.po> for further inspection.
+
+Here are some tricks to help you in this tedious process and ensure that you
+salvage the most of the previous translation:
 
 =over
 
 =item
 
 Remove all extra content of the translations, such as the section giving credits
-to the translators. You can add them back in po4a afterward, using an addenda
-(see L<po4a(7)>).
+to the translators. They should be added separately to B<po4a> as addendas (see
+L<po4a(7)>).
 
 =item
 
-If you need to edit the files to align their structures, you should prefer
-editing the translation if possible. Indeed, if the changes to the original are
-too intrusive, the old and new versions will not be matched during the PO
-update, and the corresponding translation will be dumped anyway. But do not
-hesitate to also edit the original document if required: the important thing is
-to get a first PO file to start with.
+When editing the files to align their structures, prefer editing the translation
+if possible. Indeed, if the changes to the original are too intrusive, the old
+and new versions will not be matched during the first po4a run after
+gettextization (see below). Any unmatched translation will be dumped anyway.
+That being said, you still want to edit the original document if it's too hard
+to get the gettextization to proceed otherwise, even if it means that one
+paragraph of the translation is dumped. The important thing is to get a first PO
+file to start with.
 
 =item
 
@@ -230,9 +221,11 @@ when synchronizing the PO file with the document.
 =item
 
 You should probably inform the original author of any structural change in the
-translation that seems justified. Issues in the original document should reported
-to the author. Fixing them in your translation only fixes them for a part of the
-community. Plus, it is impossible to do so when using po4a ;)
+translation that seems justified. Issues in the original document should
+reported to the author. Fixing them in your translation only fixes them for a
+part of the community. Plus, it is impossible to do so when using po4a ;) But
+you probably want to wait until the end of the conversion to B<po4a> before
+changing the original files.
 
 =item
 
@@ -252,45 +245,86 @@ line and the content of the item.
 Sometimes, the desynchronization message seems odd because the translation is
 attached to the wrong original paragraph. It is the sign of an undetected issue
 earlier in the process. Search for the actual desynchronization point by
-inspecting F<gettextization.failed.po>, and fix the problem where it really is.
-
-=item
-
-In some case, po4a adds a space at the end of either the original or the
-translated strings. This is because every string must be deduplicated during the
-gettextize process. Imagine that a string appearing several times unmodified in
-the original, but is translated in differing way, or that different paragraphs
-are translated in the exact same way.
-
-Without deduplication, such case would break the gettexization algorithm, as it
-is a simple one to one pairing between the msgids of both the master and the
-localized files. Since one of the PO files would miss an entry (that would be
-reported as duplicate, with two references), the pairing would fail.
-
-Since po4a uses the entry type ("title" or "plain paragraph", etc) to detect
-whether the parsing streams got desynchronized, similar issues could occur if
-two identical entries (same content but differing type) of the master file are
-translated in the exact same way in the localized file. po4a would detect a fake
-desyncronization in such case.
-
-In most cases, the extra space added by po4a to deduplicate the strings has no
-impact on the formatting. Strings are fuzzied anyway, and msgmerge will probably
-match the strings accordingly afterward.
+inspecting the file F<gettextization.failed.po> that was produced, and fix the
+problem where it really is.
 
 =item
 
-As a final note, do not be too surprised if the first synchronization of your PO
-file takes a long time. This is because most of the msgid of the PO file
-resulting from the gettextization don't match exactly any element of the POT
-file built from the recent master files. This forces gettext to search for the
-closest one using a costly string proximity algorithm.
-
-For example, the first B<po4a-updatepo> of the Perl documentation's French
-translation (5.5 MB PO file) took about 48 hours (yes, two days) while the
-subsequent ones only take a dozen of seconds.
+Other issues may come from duplicated strings in either the original or
+translation. Duplicated strings are merged in PO files, with two references.
+This constitutes a difficulty for the gettexization algorithm, that is a simple
+one to one pairing between the B<msgid>s of both the master and the localized
+files.
+
+If the exact same string is used several times in the master file, but with
+differing translations in the localized file, the translation will contain more
+B<msgid>s than the master. One of the master's B<msgid>s would have several references
+to compensate. Conversely, if different original paragraphs are translated in
+the exact same way, the original will have more B<msgid>s than the translation.
+
+To avoid such spurious desynchronization, po4a deduplicates all strings in both
+the master and localized by appending spaces to them. This way, PO files will
+not merge any entry and the gettextization can proceed. In most cases, the extra
+space added by po4a to deduplicate the strings has no impact on the formatting.
+Strings are fuzzied anyway, and msgmerge will probably match the strings
+accordingly afterward.
 
 =back
 
+=head2 Reviewing files produced by B<po4a-gettextize>
+
+Any file produced by B<po4a-gettextize> should be manually reviewed, even when
+the script terminates successfully. You should skim over the PO file, ensuring
+that the B<msgid> and B<msgstr> actually match. It is not necessary to ensure
+that the translation is perfectly correct yet, as all entries are marked as
+fuzzy translations anyway. You only need to check for obvious matching issues
+because badly matched translations will be dumped in subsequent steps while you
+want to salvage them.
+
+Fortunately, this step does not require to master the target languages as you
+only want to recognize similar elements in each B<msgid> and its corresponding
+B<msgstr>. As a speaker of French, English, and some German myself, I can do
+this for all European languages at least, even if I cannot say one word of most
+of these languages. I sometimes manage to detect matching issues in non-Latin
+languages by looking at string length, phrase structures (does the amount of
+interrogation marks match?) and other clues, but I prefer when someone else can
+review those languages.
+
+If you detect a mismatch, edit the original and translation files as if
+B<po4a-gettextize> reported an error, and try again. Once you have a decent PO
+file for your previous translation, backup it until you get po4a working
+correctly.
+
+=head2 Running B<po4a> for the first time
+
+The easiest way to setup po4a is to write a B<po4a.conf> configuration file, and
+use the integrated po4a program (B<po4a-updatepo> and B<po4a-translate> are
+deprecated). Please check the "CONFIGURATION FILE" Section in L<po4a(1)>
+documentation for more details.
+
+When B<po4a> runs for the first time, the current version of the master
+documents will be used to update the PO files containing the old translations
+that you salvaged through gettextization. This can take quite a long time,
+because many of the B<msgid>s of from the gettextization do not exactly match
+the elements of the POT file built from the recent master files. This forces
+gettext to search for the closest one using a costly string proximity algorithm.
+For example, the first run over the Perl documentation's French translation (5.5
+MB PO file) took about 48 hours (yes, two days) while the subsequent ones only
+take seconds.
+
+=head2 Moving your translations to production
+
+After this first run, the PO files are ready to be reviewed by translators. All
+entries were marked as fuzzy in the PO file by B<po4a-gettextization>, forcing
+their careful review before use. Translators should take each entry to verify
+that the salvaged translation actually match the current original text, update
+the translation on need, and remove the fuzzy markers.
+
+Once enough fuzzy markers are removed, B<po4a> will start generating the
+translation files on disk, and you're ready to move your translation workflow to
+production. Some projects find it useful to rely on weblate to coordinate
+between translators and maintainers, but that's beyond B<po4a>' scope.
+
 =head1 SEE ALSO
 
 L<po4a(1)>,
@@ -307,7 +341,7 @@ L<po4a(7)>.
 
 =head1 COPYRIGHT AND LICENSE
 
-Copyright 2002-2020 by SPI, inc.
+Copyright 2002-2022 by SPI, inc.
 
 This program is free software; you may redistribute it and/or modify it
 under the terms of GPL (see the COPYING file).
@@ -386,6 +420,13 @@ foreach (@options) {
     }
 }
 
+if (scalar @locfile == 0) {
+    die wrap_msg(gettext("You must provide the same amount of master files and localized files to synchronize them, ".
+                         "as po4a-gettextize is intended to synchronize master files and previously existing translations. ".
+                         "If just want to extract POT files of your master files, please po4a-updatepo. ".
+                         "Please note that the most convenient way of using po4a is to write a po4a.conf file and use the integrated po4a(1) program."))
+}
+
 # Check file existence
 foreach my $file ( @masterfile, @locfile ) {
     $file eq '-' || -e $file || die wrap_msg( gettext("File %s does not exist."), $file );