Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured #79

kelson42 · 2022-05-28T06:43:09Z

https://farm.openzim.org/pipeline/9d1f72ebc98ea2bced85e826

benoit74 · 2022-05-29T17:45:39Z

Looking at the scraper logs, the termination of the process seems quite "abrupt".
Is there any logic in the Zimfarm to stop the process once the reported progress is 100% (the task logs seems to hint something like this). The scraper is reporting 100% once everything has been scraped, but there is still a bit of work to do, among which finishing the Zim production process properly + in this scraper reporting some URLs for debug.

benoit74 · 2022-05-29T17:51:20Z

Oh no, I missed the log line stating "Finishing ZIM file" + "Finished Zim ifixit_es_all_2022-05.zim in /output"
@rgaudin: could you have a look at what happened in the farm more precisely? I see no obvious problem in the scraper logs.

benoit74 · 2022-05-29T17:52:24Z

Looks like the process finished properly, there is just some stranger "phantom" logs about Images been scrapped with weird number of images.

benoit74 · 2022-05-29T18:04:27Z

Who, I looked at another scrape I ran on my side and it looks like I have the same issue. The logs states that the final ZIM file has been produced but it is not there.

rgaudin · 2022-05-29T19:52:06Z

OK, seems similar ; let me know if there's anything I can check on my end.

benoit74 · 2022-05-29T21:09:15Z

I had a look again at the scraper logs and noticed the following.

When the scraper finishes, I usually have these logs:

[MainThread::2022-05-29 22:58:23,865] INFO:Finishing ZIM file
T:8; A:107; RA:3; CA:41; UA:63; C:0; CC:0; UC:0; WC:0
T:8; ResolveRedirectIndexes
Resolve redirect
T:8; Set entry indexes
set index
T:8; Resolve mimetype
T:8; Waiting for workers
T:9; 112 title index created
T:9; 2 clusters created
T:9; write zimfile :
T:9;  write mimetype list
T:9;  write directory entries
T:9;  write url prt list
T:9;  write cluster offset list
T:9;  write header
T:9;  write checksum
T:9; rename tmpfile to final one.
T:9; finish
[MainThread::2022-05-29 22:58:24,333] INFO:Finished Zim ifixit_fr_selection_2022-05.zim in /Users/benoit/Repos/ifixit2zim/output

From my understanding, all logs starting with "T:" are indeed logs from the zimscraperlib / libzim.

What I observed (from few logs, so it is maybe not a good generalization) is that:

when no ZIM file is produced, all final lines starting with "T:" are missing; we are moving directly from "Finishing ..." to "Finished ..."
when a ZIM file is produced, I have the lines starting with "T:" as above

@rgaudin : any idea what this could mean?

kelson42 · 2022-05-29T21:28:29Z

@mgautierfr I wonder if this is somehow linked to openzim/libzim#666

benoit74 · 2022-05-29T21:31:55Z

@kelson42 same hint on my side, looks like something was calling finishZimCreation and it is not called anymore now
By the way @rgaudin, it is unclear to me what is supposed to call finishZimCreation from zimscraperlib / why zimscraperlib is not calling it explicitly in the 'finish' method.

benoit74 · 2022-05-29T21:33:44Z

(pull requests with ID #666 should probably be refused anyway ^^)

mgautierfr · 2022-05-29T21:43:45Z

In is the python-scraperlib's Creator.finish() which call finishZimCreation. But the actual call may be discard if self.can_finish is False (https://github.com/openzim/python-scraperlib/blob/master/src/zimscraperlib/zim/creator.py#L225-L232)

And this is possible if there is a exception during the add_item : https://github.com/openzim/python-scraperlib/blob/master/src/zimscraperlib/zim/creator.py#L203

This may be the cause if some specific content making add_item fail (and it would explain why it works with some content and not with other)

rgaudin · 2022-05-30T08:17:38Z

@mgautierfr's right. Because finish() would lead to the creation of a .zim file – which itself would trigger the upload of the ZIM–, we added this workaround so that an Exception in the scraper would prevent the call to finish and thus save us from uploading an invalid ZIM.

The workaround is ON by default ; but it just sets this variable and re-raises the Exception, so it wouldn't prevent your code from seeing it. From the logs, it seems no exception is raised up to the scraper's run() method ; which is confirmed by the Finished Zim log line and the fact your scraper returns 0.

It seems that scraper_generic.py's scrape_items() is catching all downwards exceptions and printing the traceback (which we can see in the log). In my opinion, this should raise (or halt the scraper) on RuntimeError as those are most likely raised by pylibzim and thus not recoverable. Or maybe you're fine with the duplicate entries ; in this case, I'd suggest you disable the workaround (workaround_nocancel=True passed to Creator().

benoit74 · 2022-05-30T19:04:37Z

Thank you to all of you for those explanation.

I confirm that there was some errors during processing which led to can_finish to be set to False.

The current logic is to catch those exceptions scraper-side, to count how many exceptions occurred and to stop processing / ZIM file creation only if too many errors occurs (threshold to be assessed). Maybe this is not the most appropriate technique, but I assumed it was better to have a fresher ZIM with few missing items due to errors than no new ZIM at all. But I understand this assumption might be wrong.

I will add a log checking clearly the "can_finish" status and displaying it in the log so that we can better understand the situation next time. Current logs are clearly misleading.

benoit74 · 2022-05-30T19:05:10Z

and I opened another issue for those errors which are not really expected indeed

kelson42 added bug Something isn't working question Further information is requested labels May 28, 2022

kelson42 added this to the 0.2.1 milestone May 28, 2022

rgaudin changed the title ~~Success in Zimfarm but no zim file for WikiHow ES~~ Success in Zimfarm but no zim file for iFixIt ES May 29, 2022

kelson42 assigned benoit74 May 29, 2022

benoit74 mentioned this issue May 30, 2022

Fixes for 0 2 1 #83

Merged

benoit74 changed the title ~~Success in Zimfarm but no zim file for iFixIt ES~~ Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured May 30, 2022

benoit74 closed this as completed in #83 May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured #79

Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured #79

kelson42 commented May 28, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

rgaudin commented May 29, 2022

benoit74 commented May 29, 2022

kelson42 commented May 29, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

mgautierfr commented May 29, 2022

rgaudin commented May 30, 2022

benoit74 commented May 30, 2022

benoit74 commented May 30, 2022 •

edited

Loading

Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured #79

Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured #79

Comments

kelson42 commented May 28, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

rgaudin commented May 29, 2022

benoit74 commented May 29, 2022

kelson42 commented May 29, 2022

benoit74 commented May 29, 2022

benoit74 commented May 29, 2022

mgautierfr commented May 29, 2022

rgaudin commented May 30, 2022

benoit74 commented May 30, 2022

benoit74 commented May 30, 2022 • edited Loading

benoit74 commented May 30, 2022 •

edited

Loading