Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for T274511 #22

Merged
merged 5 commits into from Apr 5, 2023
Merged

Fix for T274511 #22

merged 5 commits into from Apr 5, 2023

Conversation

pamputt
Copy link
Contributor

@pamputt pamputt commented Jan 29, 2023

No description provided.

@hugolpz
Copy link
Member

hugolpz commented Jan 29, 2023

Hello @pamputt , you are requesting a code review right ?

I read on phabricator:

I have created a pull request on Github and I have already run the script on Toolserver for kuwiktionary.

So (1) the code of this PR already tested live on Kurdish, right ?
And (2) you now want approval to ingrate it to the whole bot, so it runs on all existing languages ?

There is the thing I see forward with such tool. Did you anticipated side effects ? For French, when there is 20+ recordings for a word, doesn't this risks to force inject the "missing audios" ? or does the language-specific python script set limits for those ? (Note: my understanding of the bot is low so I need to ask basic questions, doing my best to still provide a review and think about everythings.)

I would recommend to make a short test run with French, our largest language, to check if something goes wrong. Few dozen or hundreds words ? Is that possible ?

@hugolpz hugolpz self-assigned this Jan 29, 2023
@hugolpz hugolpz self-requested a review January 29, 2023 12:03
@pamputt
Copy link
Contributor Author

pamputt commented Jan 30, 2023

So (1) the code of this PR already tested live on Kurdish, right ?

Yes, it is tested only on kuwiktionary for now.

And (2) you now want approval to ingrate it to the whole bot, so it runs on all existing languages ?

Actually this is a script, so the idea is just to run it manually. It is not part of Lingua Libre Bot itself, it is a tool to make using Lingua Libre Bot easier.

There is the thing I see forward with such tool. Did you anticipated side effects ? For French, when there is 20+ recordings for a word, doesn't this risks to force inject the "missing audios" ? or does the language-specific python script set limits for those ? (Note: my understanding of the bot is low so I need to ask basic questions, doing my best to still provide a review and think about everythings.)

The number of maximum recordings is managed wiki by wiki depending on the wiki rules (frwiktionary may have a different policy than kuwiktionary). So no side effects are expected.

I would recommend to make a short test run with French, our largest language, to check if something goes wrong. Few dozen or hundreds words ? Is that possible ?

Once I have finished on kuwiktionary and checked that everything looks fine, I will run the script on the all wikis together to add all missing pronunciations that were not added during the bot outage.

you are requesting a code review right ?

More or less. First, I wanted to host the code somewhere that is visible for everyone, i.e. here. Thus, we can discuss about it and report if any issue is detected. At the end, we may decide to merge if we think this is useful.

@hugolpz
Copy link
Member

hugolpz commented Jan 30, 2023

So it's a convenience script built around lingualibre bot, a part of the toolkit, to put in the same repository to occasionally serve a specific purpose. Seems pretty cool !
Side effects have been considered and are under observation with ku:wikt.

You lead the way. I cant review the python code much but there are my few "reviewer" advices

On code:

  1. Maybe this Seems better #23.

  2. The if condition seems unnecessary to me. Few additional languages are reaching the 60~80k recordings milestones. For resiliency sake and code-simplicity sake, shouldnt we apply this yearly split to all languages ?

Around it:

  1. Test with French

I will run the script on the all wikis together

The final run may involve 100,000 edits. I guess you plan to closely monitor the first edits of each language ?

  1. Document its usage (Clarify usage #24)
    In the script itself, the file's header contains a description, great! Adding a hack guide would be healthy. (Just one variable to change)

  2. Put it on the map
    Add the script name and purpose somewhere in the general Readme.md. Likely before the See also section, in a new section ?

Also, if it is ran by hand and has no expected side effect, we can merge into master when you feel ready, within this week. ✌🏼

@hugolpz
Copy link
Member

hugolpz commented Feb 9, 2023

Hello @pamputt , i corrected a minor typo per commit above.

Second, i m not sure sleep 5 is of any use there, between llbot.py calls. I believe such 5secs pause should be inbetween specific edits, therefore into llbot.py. If the jointing of two llbot.py calls miss this 5 secs pause, then it should be added within llbot.py, at its beginning or better, at its end.

Third, the conditional switch for large languages requiring time-frangmented sparql calls, and other languages requiring unique sparql call, seems superflux. Let time-frangment them all, it only make our code more resilient.

Fourth, IMHO and optionally, this time-fragmenting code could be moved to llbot.py.

Fifth, since we now have avenues to provide language + number of records (see hugolpz/Sparql2Data#4), we could add to llbot.py a parameter and conditional switch if [[ ?records >= 50000 ]] then ... else ... fi. But as suggested in 3), easier and more resilient to time-fragment all queries.

@pamputt
Copy link
Contributor Author

pamputt commented Feb 9, 2023

The 5 second sleep is there because otherwise the Lingua Libre SPARQL sends the 429 error. It happends when I query the list of all recordings. I did not get this error elsewhere. Probably the sleep can be smaller (2-3 seconds) but this is not a big deal.

About your third point, why not; it is just a bit slower because there is this sleep 5 between every call.

About the fourth point, I do not think it is a good idea to put that in llbot.py (more precisely in lili.py) because we cannot know what will happen in the future. For example, if the server Lingua Libre SPARQL server becomes more robust, it might not be required to split the request. And how to know the granularity for the splitting? Is 6 months is correct, too reduced, too large? It may not fit if 100k recordings are done within this 6-month period.

@hugolpz
Copy link
Member

hugolpz commented Feb 9, 2023

Ok for me as for most points. Your lead. 👍🏼

About the fourth point, I do not think it is a good idea to put that in llbot.py (more precisely in lili.py) because we cannot know what will happen in the future. For example, if the server Lingua Libre SPARQL server becomes more robust, it might not be required to split the request. And how to know the granularity for the splitting? Is 6 months is correct, too reduced, too large? It may not fit if 100k recordings are done within this 6-month period.

Your observation encourages to split more, not less. Current trends suggest our blazegraph is getting slower since records per languages is getting larger. Python is doing the math, so there is no cost at splitting all languages by 3 months chunks. Whereas not splitting will result in crash when a new languages suddenly become large but we have no developer to hard code it into the list of exceptions.
I don't see the point to have a IF switch between complex and simple calls when complex calls would work for all languages at no cost.

@pamputt
Copy link
Contributor Author

pamputt commented Feb 9, 2023

Yes we could split more but what is the good granularity (every year, every month, every week, every day, ...). We cannot guess it and that is why it is not a good idea to add it in lili.py, i.e. do not let the user adapt the splitting to its need.

About your last sentence, it means that currently there is the "sleep 5" that slows down the script if we do too many sleep. But as I already said, it is not a big deal; we can implement the splitting for all languages vent if I do not see any advantage to do it.

@hugolpz
Copy link
Member

hugolpz commented Feb 9, 2023

I currently do my best to explore and learn how this python bot distributes tasks and works. This allows me to raise those previous points to you but I'm still blind and not confident enough in my understanding to do more. You considered those points, so we are good to go. We follow your lead Pamputt 🌻 👍🏼

Copy link
Member

@hugolpz hugolpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with code. I let Pamputt do the merge.

@hugolpz hugolpz merged commit fe25f52 into master Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants