Fix for T274511 #22

pamputt · 2023-01-29T09:55:57Z

No description provided.

hugolpz · 2023-01-29T11:34:19Z

Hello @pamputt , you are requesting a code review right ?

I read on phabricator:

I have created a pull request on Github and I have already run the script on Toolserver for kuwiktionary.

So (1) the code of this PR already tested live on Kurdish, right ?
And (2) you now want approval to ingrate it to the whole bot, so it runs on all existing languages ?

There is the thing I see forward with such tool. Did you anticipated side effects ? For French, when there is 20+ recordings for a word, doesn't this risks to force inject the "missing audios" ? or does the language-specific python script set limits for those ? (Note: my understanding of the bot is low so I need to ask basic questions, doing my best to still provide a review and think about everythings.)

I would recommend to make a short test run with French, our largest language, to check if something goes wrong. Few dozen or hundreds words ? Is that possible ?

pamputt · 2023-01-30T09:25:57Z

So (1) the code of this PR already tested live on Kurdish, right ?

Yes, it is tested only on kuwiktionary for now.

And (2) you now want approval to ingrate it to the whole bot, so it runs on all existing languages ?

Actually this is a script, so the idea is just to run it manually. It is not part of Lingua Libre Bot itself, it is a tool to make using Lingua Libre Bot easier.

There is the thing I see forward with such tool. Did you anticipated side effects ? For French, when there is 20+ recordings for a word, doesn't this risks to force inject the "missing audios" ? or does the language-specific python script set limits for those ? (Note: my understanding of the bot is low so I need to ask basic questions, doing my best to still provide a review and think about everythings.)

The number of maximum recordings is managed wiki by wiki depending on the wiki rules (frwiktionary may have a different policy than kuwiktionary). So no side effects are expected.

I would recommend to make a short test run with French, our largest language, to check if something goes wrong. Few dozen or hundreds words ? Is that possible ?

Once I have finished on kuwiktionary and checked that everything looks fine, I will run the script on the all wikis together to add all missing pronunciations that were not added during the bot outage.

you are requesting a code review right ?

More or less. First, I wanted to host the code somewhere that is visible for everyone, i.e. here. Thus, we can discuss about it and report if any issue is detected. At the end, we may decide to merge if we think this is useful.

hugolpz · 2023-01-30T11:48:58Z

So it's a convenience script built around lingualibre bot, a part of the toolkit, to put in the same repository to occasionally serve a specific purpose. Seems pretty cool !
Side effects have been considered and are under observation with ku:wikt.

You lead the way. I cant review the python code much but there are my few "reviewer" advices

On code:

Maybe this Seems better #23.
The if condition seems unnecessary to me. Few additional languages are reaching the 60~80k recordings milestones. For resiliency sake and code-simplicity sake, shouldnt we apply this yearly split to all languages ?

Around it:

Test with French

I will run the script on the all wikis together

The final run may involve 100,000 edits. I guess you plan to closely monitor the first edits of each language ?

Document its usage (Clarify usage #24)
In the script itself, the file's header contains a description, great! Adding a hack guide would be healthy. (Just one variable to change)
Put it on the map
Add the script name and purpose somewhere in the general Readme.md. Likely before the See also section, in a new section ?

Also, if it is ran by hand and has no expected side effect, we can merge into master when you feel ready, within this week. ✌🏼

hugolpz · 2023-02-09T11:26:21Z

Hello @pamputt , i corrected a minor typo per commit above.

Second, i m not sure sleep 5 is of any use there, between llbot.py calls. I believe such 5secs pause should be inbetween specific edits, therefore into llbot.py. If the jointing of two llbot.py calls miss this 5 secs pause, then it should be added within llbot.py, at its beginning or better, at its end.

Third, the conditional switch for large languages requiring time-frangmented sparql calls, and other languages requiring unique sparql call, seems superflux. Let time-frangment them all, it only make our code more resilient.

Fourth, IMHO and optionally, this time-fragmenting code could be moved to llbot.py.

Fifth, since we now have avenues to provide language + number of records (see hugolpz/Sparql2Data#4), we could add to llbot.py a parameter and conditional switch if [[ ?records >= 50000 ]] then ... else ... fi. But as suggested in 3), easier and more resilient to time-fragment all queries.

pamputt · 2023-02-09T14:00:30Z

The 5 second sleep is there because otherwise the Lingua Libre SPARQL sends the 429 error. It happends when I query the list of all recordings. I did not get this error elsewhere. Probably the sleep can be smaller (2-3 seconds) but this is not a big deal.

About your third point, why not; it is just a bit slower because there is this sleep 5 between every call.

About the fourth point, I do not think it is a good idea to put that in llbot.py (more precisely in lili.py) because we cannot know what will happen in the future. For example, if the server Lingua Libre SPARQL server becomes more robust, it might not be required to split the request. And how to know the granularity for the splitting? Is 6 months is correct, too reduced, too large? It may not fit if 100k recordings are done within this 6-month period.

hugolpz · 2023-02-09T14:39:13Z

Ok for me as for most points. Your lead. 👍🏼

About the fourth point, I do not think it is a good idea to put that in llbot.py (more precisely in lili.py) because we cannot know what will happen in the future. For example, if the server Lingua Libre SPARQL server becomes more robust, it might not be required to split the request. And how to know the granularity for the splitting? Is 6 months is correct, too reduced, too large? It may not fit if 100k recordings are done within this 6-month period.

Your observation encourages to split more, not less. Current trends suggest our blazegraph is getting slower since records per languages is getting larger. Python is doing the math, so there is no cost at splitting all languages by 3 months chunks. Whereas not splitting will result in crash when a new languages suddenly become large but we have no developer to hard code it into the list of exceptions.
I don't see the point to have a IF switch between complex and simple calls when complex calls would work for all languages at no cost.

pamputt · 2023-02-09T15:46:01Z

Yes we could split more but what is the good granularity (every year, every month, every week, every day, ...). We cannot guess it and that is why it is not a good idea to add it in lili.py, i.e. do not let the user adapt the splitting to its need.

About your last sentence, it means that currently there is the "sleep 5" that slows down the script if we do too many sleep. But as I already said, it is not a big deal; we can implement the splitting for all languages vent if I do not see any advantage to do it.

hugolpz · 2023-02-09T15:56:25Z

I currently do my best to explore and learn how this python bot distributes tasks and works. This allows me to raise those previous points to you but I'm still blind and not confident enough in my understanding to do more. You considered those points, so we are good to go. We follow your lead Pamputt 🌻 👍🏼

hugolpz

Agree with code. I let Pamputt do the merge.

pamputt added 2 commits January 29, 2023 10:53

new script to process all recordings

a95f016

list of Qidentifiers for all languages supported

924a019

hugolpz self-assigned this Jan 29, 2023

hugolpz self-requested a review January 29, 2023 12:03

hugolpz and others added 2 commits February 5, 2023 00:11

Seems better (#23)

f751e3f

Improve script (fix some bugs)

41ec314

hugolpz mentioned this pull request Feb 7, 2023

Convert formats to feed alternative usage hugolpz/Sparql2Data#4

Open

typo

72287bb

hugolpz approved these changes Feb 9, 2023

View reviewed changes

hugolpz merged commit fe25f52 into master Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for T274511 #22

Fix for T274511 #22

pamputt commented Jan 29, 2023

hugolpz commented Jan 29, 2023 •

edited

pamputt commented Jan 30, 2023 •

edited

hugolpz commented Jan 30, 2023 •

edited

hugolpz commented Feb 9, 2023 •

edited

pamputt commented Feb 9, 2023

hugolpz commented Feb 9, 2023 •

edited

pamputt commented Feb 9, 2023

hugolpz commented Feb 9, 2023 •

edited

hugolpz left a comment

Fix for T274511 #22

Fix for T274511 #22

Conversation

pamputt commented Jan 29, 2023

hugolpz commented Jan 29, 2023 • edited

pamputt commented Jan 30, 2023 • edited

hugolpz commented Jan 30, 2023 • edited

hugolpz commented Feb 9, 2023 • edited

pamputt commented Feb 9, 2023

hugolpz commented Feb 9, 2023 • edited

pamputt commented Feb 9, 2023

hugolpz commented Feb 9, 2023 • edited

hugolpz left a comment

Choose a reason for hiding this comment

hugolpz commented Jan 29, 2023 •

edited

pamputt commented Jan 30, 2023 •

edited

hugolpz commented Jan 30, 2023 •

edited

hugolpz commented Feb 9, 2023 •

edited

hugolpz commented Feb 9, 2023 •

edited

hugolpz commented Feb 9, 2023 •

edited