Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MediaWiki reader on internalLink for JCK languages #8525

Merged
merged 2 commits into from
Jan 13, 2023
Merged

Conversation

liruqi
Copy link
Contributor

@liruqi liruqi commented Jan 5, 2023

This PR would fix wrong link text for Chinese or JCK languages which words are not seperated by blank space. Could anyone share the purpose for this linktrail variable?

For example:

是[[台灣|台式]]的傳統[[月餅]]

would be parsed as below, which is abviously unreasonable.

        {
          "t": "Str",
          "c": "是"
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "台式的傳統"
              }
            ],
            [
              "台灣",
              "wikilink"
            ]
          ]
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "月餅"
              }
            ],
            [
              "月餅",
              "wikilink"
            ]
          ]
        },

After this fix, json output would be:

        {
          "t": "Str",
          "c": "是"
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "台式"
              }
            ],
            [
              "台灣",
              "wikilink"
            ]
          ]
        },
        {
          "t": "Str",
          "c": "的傳統"
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "月餅"
              }
            ],
            [
              "月餅",
              "wikilink"
            ]
          ]
        },

Original data:
chinapedia/mediawiki-to-gfm@40b17e7

Json output with fix:
chinapedia/mediawiki-to-gfm@63add8c

@liruqi
Copy link
Contributor Author

liruqi commented Jan 5, 2023

I checked the wikipedia rendering rules for English, for example: https://en.wikipedia.org/wiki/100-year_flood

[[drainage basin]]s

is equivalent to

[[drainage basins|drainage basin]]

which would be rendered to

<a href="/wiki/Drainage_basin" title="Drainage basin">drainage basins</a>

This is a special treatment on wikipedia wikitext parser, which would apply to English (or Latin languages) wikipedia, but not to CJK languages. Probably we should not apply this treatment if pandoc is not aware of input language.

@liruqi liruqi changed the title Fix MediaWiki reader on internalLink Fix MediaWiki reader on internalLink for JCK languages Jan 6, 2023
@jgm
Copy link
Owner

jgm commented Jan 13, 2023

Here's the link to their syntax description for "blending" links:

https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

Note the exception, which we currently don't implement:

Exception: a trailing apostrophe (') and any characters following the apostrophe are not blended.

@jgm
Copy link
Owner

jgm commented Jan 13, 2023

But I did some experiments on Wikipedia's sandbox, and their actual implementation of blending seems to be much simpler. Only ASCII letters are taken to be part of the "link trail." The blended content stops as soon as you get a punctuation mark or a non-ASCII character, even an accented latin character.

I think we should try to match this behavior -- and it will handle your case well.
Do you want to update your PR?

@liruqi
Copy link
Contributor Author

liruqi commented Jan 13, 2023

Yes. It looks like a appropriate solution.

@liruqi
Copy link
Contributor Author

liruqi commented Jan 13, 2023

Hi @jgm, please help to review my code changes.

@jgm
Copy link
Owner

jgm commented Jan 13, 2023

Looks great!

@jgm jgm merged commit fdfa9fc into jgm:main Jan 13, 2023
@waldyrious
Copy link
Contributor

I did some experiments on Wikipedia's sandbox, and their actual implementation of blending seems to be much simpler. Only ASCII letters are taken to be part of the "link trail." The blended content stops as soon as you get a punctuation mark or a non-ASCII character, even an accented latin character.

For the record, the mediawiki documentation indicates that the definition of what content after a wikilink is considered part of it is indeed language-specific. What this means, in practice, is that the specification is encoded in the language's MessagesXx.php file, namely in a variable called $linkTrail. For example, here's the English configuration file (which should match the experience at the English Wikipedia Sandbox), and here's the Chinese configuration file (which should match the experience at the Chinese Wikipedia Sandbox).

I'm not 100% sure, but the issue this PR raised (that CJK languages don't separate words with whitespace) may be addressed in MediaWiki's segmentByWord() method —which, according to the docs, is (re)implemented in the zh_hans, yue, and ja configuration files— in addition to any language-specific $linkTrail configuration.

liruqi added a commit to chinapedia/pandoc that referenced this pull request Mar 3, 2023
The rules for "blending" characters outside a link into the link are
described here: https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

These pose a problem for CJK languages, which generally don't have
spaces after links.

However, it turns out that the blending behavior, as implemented on
Wikipedia, is (contrary to the documentation) only for ASCII letters.
This commit implements that restriction, which fixes the problem for
CJK.  (jgm#8525)
@lwolfsonkin
Copy link
Contributor

This issue has just come to my attention. While it is laudable that this fixed an issue parsing documents from Chinese Wikipedia, it seems to have also broken tons of other languages, as partially pointed out by @waldyrious. That is, this likely fixed links for languages using non-segmented scripts, such as Chinese, Japanese, Thai, while simultaneously breaking similar cases in every other language (except English which was specifically tested and given exception).

While on English wikipedia, A through Z is the acceptable "link trail", on Spanish Wikipedia this also includes the letters in the Spanish alphabet which can carry diacritics, namely á, é, í, ó, ú, and ñ. Similar cases exist on Italian, French, Portuguese, and German Wikipedias. This is then even more egregious in languages using a segmented script that isn't Latin, such as on Arabic, Russian, Hebrew, Hindi, Tamil, and Urdu Wikipedias.

In short, while it seems to me that the "proper" solution to this would be to have to have rules for each sub-dialect of Wikitext, a much simpler and likely sufficient solution which would likely fix nearly all cases would be to default-allow this blended link behavior for all trailing characters, unless the trailing characters belong to one of the unsegmented scripts (Chinese characters, Thai, Lao, etc.). This can be easily tested by checking the script of the following characters (Each character in Unicode belongs to a script, each of which are defined by ISO 15924). While I'm not sure off-hand how to access this information in Haskell, if one can do this easily via some package, this would likely be the easiest and most inclusive solution (with the caveat that more scripts should be added the this list of unsegmented scripts, such as Khmer and Balinese for example).

I hope this is helpful and hope we can see this solved. Thank you!

Pinging @liruqi and also @jgm.

@jgm
Copy link
Owner

jgm commented Jul 5, 2023

@waldyrious @lwolfsonkin thanks for bringing this to my attention. Could you open a new issue about this, so we don't lose track?

jgm added a commit that referenced this pull request Jul 6, 2023
Previously we only included ASCII letters. That is correct for
English but not for, e.g., Spanish (see comment in #8525).
A safer approach is to include all letters except those in the
CJK unified ideograph ranges.
@jgm
Copy link
Owner

jgm commented Jul 6, 2023

I've added a commit along the lines of @lwolfsonkin 's suggestion. No need to open a new issue if this is adequate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants