Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the internationalization of documentation #56301

Open
kozo2 opened this issue Dec 2, 2023 · 21 comments
Open

About the internationalization of documentation #56301

kozo2 opened this issue Dec 2, 2023 · 21 comments
Labels
Docs Needs Discussion Requires discussion from core team before further action

Comments

@kozo2
Copy link

kozo2 commented Dec 2, 2023

I want to contribute to the internationalization of pandas documentation.

I created the POT files corresponding to the Sphinx source of pandas and submitted a pull request #56296 .

This PR was made following a suggestion I received from @steppi in the Scientific Python Discord.

If you have any thoughts, please let me know.

@rhshadrach
Copy link
Member

Thanks for opening this up. Can you give a summary of what these POT files are, how they are to be used, and why they need to be added to pandas.

@rhshadrach rhshadrach added the Docs label Dec 3, 2023
@steppi
Copy link

steppi commented Dec 3, 2023

Hi @rhshadrach. I'm working on a CZI supported project to help add translations to the websites for major packages in the Scientific Python ecosystem. You may have noticed the language drop-down at https://numpy.org/. I planned to reach out to maintainers of all of the major projects shortly to ask if they'd like help setting something like this up. @kozo2 is very enthusiastic about the translation efforts and started working on the set-up independently. I'm very busy finishing a grant application at the moment, but will be able to fill you in on the details sometime next week.

@steppi
Copy link

steppi commented Jan 8, 2024

Sorry, for taking so long to get back to this. I've had a lot on my plate, and there were some delays getting a free enterprise organization set up for Scientific Python on the localization management platform Crowdin.

I work for Quansight Labs and am helping with a CZI supported project to translate and localize the brochure websites of core scientific Python projects. The goal is to translate the websites of 8 core Scientific Python projects into at least 3 widely used languages. I'd like to gauge interest Pandas maintainers have in participating, and will give an overview of what would be involved.

For any given project

  • I will need to invite someone with admin privileges on the repo containing the code for the brochure website to the Scientific Python Crowdin organization. For Pandas, this repo seems to be the main Pandas repo itself.
  • A configuration file crowdin.yml needs to be added to this repo. The configuration file controls which files will be targeted for translation, and various configuration files. You can see numpy.orgs here. For Pandas, I think a restructuring of the content files may be needed. Also Add pot files for translating pandas documentation #56296 attempts to add pot files for enabling translation of all of Pandas' documentation, but this is extremely ambitious and outside of the scope of this CZI-backed project. It is not only a very large set of material, but documentation can change frequently. I recommend starting with the content of the brochure website. As far as I can tell, this is all in markdown for Pandas, which is very well supported by Crowdin, so there would be no need to bother with pot files. Some other projects use rst for their brochure websites and will need to use pot files though.
  • Myself and/or a colleague(s) at Quansight will help find interested and qualified translators and there will likely be a substantial overlap in translators between projects.
  • The person mentioned above with admin privileges will need to sync Crowdin with GitHub. Crowdin will automatically segment the content of interest into translatable strings and autogenerate a PR where new translations will be pushed. Changes and additions to the target content on the repo will automatically be propagated to Crowdin's UI for translation. There are some quirks in the Crowdin workflow that I've found workarounds for which I can explain later.
  • Depending on the static site generator used, changes may be needed in order to add a drop down for selecting the language on the page like the one at numpy.org. I know the Scientific Python hugo theme makes this easy, but it appears that Pandas has its own custom static site generator. I imagine it would not be too difficult to make the necessary enhancements though.

Please let me know if you have any questions. I have likely missed some important information, but will read through this again to see if there's anything I should add. I will start reaching out to other projects this coming week, but starting here since @kozo2 has already got the ball rolling.

@kozo2
Copy link
Author

kozo2 commented Jan 8, 2024

Thanks for opening this up. Can you give a summary of what these POT files are, how they are to be used, and why they need to be added to pandas.

@rhshadrach Sorry for the late reply.

These POT files serve as template files for creating multilingual translations of the Sphinx rst files.
PO files are created by adding translations in languages other than English to these POT files.
And the PO files are used to generate rst files that are translations of the original English documents.

Since these POT files need to exist in the same repository as the original English documents to work, I sent this PR.

However, I understand that these POT files are numerous and their mechanism is difficult to understand, making them challenging to merge.

Therefore, I would be grateful if you could consider starting the multilingual documentation on a small scale without using the POT files, as @steppi suggests.
And for the time being, it's okay to ignore (the translations that use) these POT files.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 8, 2024

Thanks @steppi for filling in some of the details!

For Pandas, I think a restructuring of the content files may be needed.

Can you elaborate here?

I recommend starting with the content of the brochure website.

What is meant by "brochure website"? Searching the term gives me definitions such as "A brochure website is an informational website that is designed to look and feel like a printed brochure" but that doesn't make too much sense in this context I think.

The person mentioned above with admin privileges will need to connect sync Crowdin with GitHub.

Is this the connect sync you're referring to?

Crowdin will automatically segment the content of interest into translatable strings and autogenerate a PR where new translations will be pushed.

What repository are these PRs being put up in?

@kozo2

Therefore, I would be grateful if you could consider starting the multilingual documentation on a small scale without using the POT files, as @steppi suggests.
And for the time being, it's okay to ignore (the translations that use) these POT files.

From this, it sounds like there will be a desire to eventually consider the POT files though - is that correct? If that's the case, then I think they need to be considered upfront. I do not think pandas should start going down this path without a clear understanding of the endpoint.

cc @datapythonista

@steppi
Copy link

steppi commented Jan 9, 2024

Good questions @rhshadrach.

Can you elaborate here?

Certainly. The changes would be minor. The content files for what I'm calling Pandas' brochure website are at https://github.com/pandas-dev/pandas/tree/main/web/pandas. It will simplify the Crowdin config to put the English content in something like web/pandas/content/en, and then translated content can go in web/pandas/content/ja, web/pandas/content/pt, etc. where en, ja, pt are two letter ISO 639-1 language codes. I think it would also work to put the English content in web/pandas/en and then put translated content in web/pandas/ja etc. The important thing is to have parallel folders for each language. (I know it probably seems obvious, but you'll also want to avoid putting folders which should not be targets for translation together with the content. That's why I suggested having a separate content folder that things like static/* robots.txt, versions.json, would sit outside of).

What is meant by "brochure website"? Searching the term gives me definitions such as "A brochure website is an informational website that is designed to look and feel like a printed brochure" but that doesn't make too much sense in this context I think.

I'm just repeating the term that was used when this task was explained to me. "brochure website" here is meant to stand in contrast to things like API documentation, tutorials, etc. Another term could be "core project website". For Pandas, this is https://pandas.pydata.org/. For NumPy, it is https://numpy.org. You may want to click around there to see the extent of the content which has been translated. A sample of other Scientific Python brochure websites includes https://scipy.org, https://scikit-learn.org/, https://matplotlib.org/, https://jupyter.org/, https://www.sympy.org.

Is this the connect sync you're referring to?

Crowdin will automatically segment the content of interest into translatable strings and autogenerate a PR where new translations will be pushed.

Yes. I had meant to include a link, and to only use the word sync. "connect sync" is a consequence of the sentence being left in an inconsistent state after editing. Sorry, I should probably compose long messages like this separately in my editor, instead of directly in the Github UI.

What repository are these PRs being put up in?

There will be single PR to the repository hosting the content files for the "brochure website". Currently for Pandas, this would be to the Pandas repo itself.

From this, it sounds like there will be a desire to eventually consider the POT files though - is that correct? If that's the case, then I think they need to be considered upfront. I do not think pandas should start going down this path without a clear understanding of the endpoint.

I think localization of the "brochure website" and localization of the sphinx documentation can be considered separately. As a SciPy maintainer, I can tell you that there is little will to create and maintain official and up-to-date translations of the large and oft-changing API documentation and tutorials for SciPy. Though these translations would be nice to have, it would involve considerable effort, and the collective bandwidth would be difficult to secure. You'll have to decide what extent of translation you all would find feasible to create and maintain. From here, the scope of the project I'm helping with is:

One such improvement is translation and localization. Development takes place in English, as reflected by project websites and documentation. While many contributors are comfortable with English as a first, second, or even third language, the language barrier excludes especially users that are very young, are new to the community, have learning disabilities, or are from the Global South—all potential future contributors and leaders in the scientific Python community! We will therefore translate key pages of core project websites, and provide translation infrastructure for the web themes.

This is the endpoint I have in mind at least.

@rhshadrach
Copy link
Member

Thanks - I don't personally have a good understanding of the security in place with integrating Crowdin to GitHub, but other projects already using it like NumPy is certainly promising. I do think this should be better understood (at least, better than my current understanding) assuming we are to move forward with it.

cc @pandas-dev/pandas-core

@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label Jan 9, 2024
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 9, 2024

I'm not sure of the value with doing this, given the work it may put on the pandas team. If someone needs a translated version of the pandas web site, they can use google translate. I'm not sure how good those translations are, but you then have automatic translation for many languages.

E.g., here's the pandas web site translated to German:
https://pandas-pydata-org.translate.goog/?_x_tr_sch=http&_x_tr_sl=en&_x_tr_tl=de&_x_tr_hl=en&_x_tr_pto=wapp

@steppi what is the value of what you are proposing as opposed to telling people to use Google Translate? Is it that the auto-translations are inaccurate? Or something else?

@steppi
Copy link

steppi commented Jan 10, 2024

That's a good question @Dr-Irv. We wouldn't want to put effort into this if it's not actually helpful. From what we've seen so far, automatic translation services aren't always adequate for translating the kind of jargon-laden technical language involved. Machine translation is being used in the translation process, but it is very helpful to have human editors with subject expertise review and sign off on the machine translated content. @melissawm helped write the proposal for this grant and worked on the Portuguese translation of numpy.org. I'm sure she'd have more to share about the value of these translation efforts.

@steppi
Copy link

steppi commented Jan 10, 2024

Thanks - I don't personally have a good understanding of the security in place with integrating Crowdin to GitHub, but other projects already using it like NumPy is certainly promising. I do think this should be better understood (at least, better than my current understanding) assuming we are to move forward with it.

I agree that this is important to consider. I wasn't involved in this project yet when Crowdin was first synced with the numpy.org repo, and so wasn't there for the discussions around security. I'll ask around and investigate a little and get back with a clear report on the security implications.

@rgommers
Copy link
Contributor

Quick thought re auto-translation: it also forces the use of Google Chrome, which may not be something that's perfectly aligned with the openness of community open source. It's a trade-off of course, so just one of multiple things to take into account. Hopefully Firefox and other browsers will get there, but it looks like a long way off still.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 10, 2024

Quick thought re auto-translation: it also forces the use of Google Chrome, which may not be something that's perfectly aligned with the openness of community open source. It's a trade-off of course, so just one of multiple things to take into account. Hopefully Firefox and other browsers will get there, but it looks like a long way off still.

Just did a French translation with Google translate (the web site) in Firefox. This URL should work:
https://pandas-pydata-org.translate.goog/?_x_tr_sch=http&_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp

@melissawm
Copy link
Contributor

Thanks for the ping @steppi - I'd be happy to share my experience although I'm sure your mileage may vary considering how each project is organized 😄

For NumPy, it was pretty natural to keep the "brochure" home page translated but not the main docs, since these two sets of pages live in separate repos. I think that's the best approach - translating the entire API docs doesn't seem feasible in light of the need for human reviewers, as mentioned above. As a non-native speaker (my mother tongue is Brazilian Portuguese), translation efforts are highly appreciated since machine translations (at this point) really don't come close to human, especially for specialized jargon or domain-specific terms.

For the translated content, it is recommended to have 2 people - one translator, one reviewer - for each language you aim to translate. This is not strictly required though, and it is also why collaborating with groups such as Scientific Python may help, since we could have a group of translators that can act across different projects. For maintainers, this would mean signing off on translations (i.e. merging them) once in a while. It is really up to you to decide on a process for this, if you require a maintainer to speak that language or if you can trust the translators to do a good job. In my opinion, this is where machine translation can help - verifying that the human translation is reasonably close to the original.

I'm happy to chat more if you have questions!

@rgommers
Copy link
Contributor

Just did a French translation with Google translate (the web site) in Firefox. This URL should work: https://pandas-pydata-org.translate.goog/?_x_tr_sch=http&_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp

Sorry, not what I meant. You're still using a Google product there, you can't avoid that AFAIK. Firefox has a "Translate page" built-in, but it's never worked for me. For the few supported languages it hangs or produces little, and other common languages aren't supported at all yet. E.g., here's it reporting it cannot translate Japanese at all yet:

image

Anyway, @melissawm's point is the more pertinent one: quality of machine translations is fairly low.

@datapythonista
Copy link
Member

I guess it's difficult to tell how many people will benefit from having our website and docs in languages other than English (with proper human made translations), but I guess in general we'll all agree this would be awesome if we don't consider the cost to implement it.

To me personally this only makes sense if we can manage to have all the translations completely outside of the pandas core repo. I didn't work in localization recently, but the last time I did the Python community used transifex. How this would work is that in the pandas repo (e.g. the website) we would "simply" identify which texts need to be translated. In general, in Python code this would require changing from "foo" to _("foo"), in jinja templates a {% trans "foo" %} would be used instead of a literal, and for other technologies like markdown or yaml I don't know but I guess something similar can be done.

Then, all the translations could be managed externally, from transifex or another system we would fetch the pandas repo, extract all the texts to translate, translate them to the languages of interest, generate the .po/.pot files, and then we could fetch these .pot files during the website/docs build.

The idea is that for pandas devs we would have to manage all the identification of what needs to be translated, but if someone is translating a text to Spanish, Arab or Mandarin, the pandas repo or any core dev wouldn't be affected. I think anything that requires core devs for specific translations is going to have a huge negative impact in this project, we have hundreds of thousands of texts to translate, and there are hundreds of languages in the world.

I guess the first step would be to write a PDEP, see if the core team is onboard with our website and docs i18n, and if we are, start with a PoC with maybe just the website and couple of languages to see how this looks like in practice.

Does this make sense?

@steppi
Copy link

steppi commented Jan 11, 2024

I guess it's difficult to tell how many people will benefit from having our website and docs in languages other than English (with proper human made translations), but I guess in general we'll all agree this would be awesome if we don't consider the cost to implement it.

Thanks @datapythonista! There will be work involved, but the hope is that with the support from CZI, Quansight will be able to take on much of the burden and make things manageable for maintainers of core projects like pandas.

To me personally this only makes sense if we can manage to have all the translations completely outside of the pandas core repo.

I agree with this assessment. The NumPy website has its own repo, https://github.com/numpy/numpy.org, which ensures this kind of separation. In my previous message I described the vanilla Crowdin integration which is used for numpy.org, but other workflows are possible. It's possible to use Crowdin without GitHub integration as described here using pretty much the same workflow you described for transifex above. I think the GitHub integrations are nice, they automate the upload of strings to Crowdin for translation when their are changes to the website content, and the open PR provides a running backup; but the integrations aren't necessary. Also, Crowdin could be set up to make the PR to a separate repo where translated content is managed, giving the benefit of backups without having to touch the core repo.

As an aside, to explain the choice of Crowdin instead of transifex or some other platform: I wasn't around when Crowdin was chosen for numpy.org, but my understanding is that it was chosen for the quality of its translation UI and ease of set up. Since then, Crowdin has generously offered a free enterprise organization account with support; their support thus far has been excellent. Since there will likely be heavy overlap between the teams of translators working on the different core project websites, it will streamline things if translators have a common UI and workflow across projects. It will definitely streamline things for me to have only one platform to work with when supporting translation infrastructure set up for the different Scientific Python packages.

I guess the first step would be to write a PDEP, see if the core team is onboard with our website and docs i18n, and if we are, start with a PoC with maybe just the website and couple of languages to see how this looks like in practice.

Does this make sense?

Sounds good, makes sense to me! Please let me know if you have any questions or anything I can help with.

Also, thanks @melissawm and @rgommers for helping explain why we think it's important to have a human in the loop for the translations!

@datapythonista
Copy link
Member

Excellent, all sounds good to me. I wasn't advocating on Transifex over Crowdin, I preferred to use it as an example since it's the one I used, but that was probably 15 years ago, so it surely may not be the standard or the best option anymore.

Also fine with me to have GitHub integrations I think, as far as translations don't create PRs on the main pandas repo.

Do you want to open a PDEP with the proposal yourself? I can help if you need it.

@steppi
Copy link

steppi commented Jan 12, 2024

Great! I’d be happy to open the PDEP and will start on it Monday. Thanks @datapythonista!

@WillAyd
Copy link
Member

WillAyd commented Jan 12, 2024

Looking forward to the PDEP and thanks for the input so far @steppi . I would also be curious how we prevent sabotage to translations like what happened with ubuntu 23.10

@datapythonista
Copy link
Member

Looking forward to the PDEP and thanks for the input so far @steppi . I would also be curious how we prevent sabotage to translations like what happened with ubuntu 23.10

What Django used to do in the past (and probably still does) is to have admins for each language. Not sure what happened with Ubuntu, but I guess if the original translator, or a pandas contributor, is the one who approves the translations, the changes of anyone spamming in the translations is quite small.

@steppi
Copy link

steppi commented Jan 13, 2024

What Django used to do in the past (and probably still does) is to have admins for each language. Not sure what happened with Ubuntu, but I guess if the original translator, or a pandas contributor, is the one who approves the translations, the changes of anyone spamming in the translations is quite small.

Yes, that's the plan. We want to have trusted people who can give the final say before translations are published. Also translators will need an invite to Crowdin, let us know their real identity, and go through some kind of vetting. Anyone who spams inflammatory or low quality translations can be banned, and will do damage to their reputation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

8 participants