Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured #79

Closed
kelson42 opened this issue May 28, 2022 · 13 comments · Fixed by #83
Assignees
Labels
bug Something isn't working question Further information is requested
Milestone

Comments

@kelson42
Copy link
Contributor

https://farm.openzim.org/pipeline/9d1f72ebc98ea2bced85e826

@kelson42 kelson42 added bug Something isn't working question Further information is requested labels May 28, 2022
@kelson42 kelson42 added this to the 0.2.1 milestone May 28, 2022
@benoit74
Copy link
Collaborator

Looking at the scraper logs, the termination of the process seems quite "abrupt".
Is there any logic in the Zimfarm to stop the process once the reported progress is 100% (the task logs seems to hint something like this). The scraper is reporting 100% once everything has been scraped, but there is still a bit of work to do, among which finishing the Zim production process properly + in this scraper reporting some URLs for debug.

@benoit74
Copy link
Collaborator

Oh no, I missed the log line stating "Finishing ZIM file" + "Finished Zim ifixit_es_all_2022-05.zim in /output"
@rgaudin: could you have a look at what happened in the farm more precisely? I see no obvious problem in the scraper logs.

@benoit74
Copy link
Collaborator

Looks like the process finished properly, there is just some stranger "phantom" logs about Images been scrapped with weird number of images.

@benoit74
Copy link
Collaborator

Who, I looked at another scrape I ran on my side and it looks like I have the same issue. The logs states that the final ZIM file has been produced but it is not there.

@rgaudin
Copy link
Member

rgaudin commented May 29, 2022

OK, seems similar ; let me know if there's anything I can check on my end.

@rgaudin rgaudin changed the title Success in Zimfarm but no zim file for WikiHow ES Success in Zimfarm but no zim file for iFixIt ES May 29, 2022
@benoit74
Copy link
Collaborator

I had a look again at the scraper logs and noticed the following.

When the scraper finishes, I usually have these logs:

[MainThread::2022-05-29 22:58:23,865] INFO:Finishing ZIM file
T:8; A:107; RA:3; CA:41; UA:63; C:0; CC:0; UC:0; WC:0
T:8; ResolveRedirectIndexes
Resolve redirect
T:8; Set entry indexes
set index
T:8; Resolve mimetype
T:8; Waiting for workers
T:9; 112 title index created
T:9; 2 clusters created
T:9; write zimfile :
T:9;  write mimetype list
T:9;  write directory entries
T:9;  write url prt list
T:9;  write cluster offset list
T:9;  write header
T:9;  write checksum
T:9; rename tmpfile to final one.
T:9; finish
[MainThread::2022-05-29 22:58:24,333] INFO:Finished Zim ifixit_fr_selection_2022-05.zim in /Users/benoit/Repos/ifixit2zim/output

From my understanding, all logs starting with "T:" are indeed logs from the zimscraperlib / libzim.

What I observed (from few logs, so it is maybe not a good generalization) is that:

  • when no ZIM file is produced, all final lines starting with "T:" are missing; we are moving directly from "Finishing ..." to "Finished ..."
  • when a ZIM file is produced, I have the lines starting with "T:" as above

@rgaudin : any idea what this could mean?

@kelson42
Copy link
Contributor Author

@mgautierfr I wonder if this is somehow linked to openzim/libzim#666

@benoit74
Copy link
Collaborator

@kelson42 same hint on my side, looks like something was calling finishZimCreation and it is not called anymore now
By the way @rgaudin, it is unclear to me what is supposed to call finishZimCreation from zimscraperlib / why zimscraperlib is not calling it explicitly in the 'finish' method.

@benoit74
Copy link
Collaborator

(pull requests with ID #666 should probably be refused anyway ^^)

@mgautierfr
Copy link

In is the python-scraperlib's Creator.finish() which call finishZimCreation. But the actual call may be discard if self.can_finish is False (https://github.com/openzim/python-scraperlib/blob/master/src/zimscraperlib/zim/creator.py#L225-L232)

And this is possible if there is a exception during the add_item : https://github.com/openzim/python-scraperlib/blob/master/src/zimscraperlib/zim/creator.py#L203

This may be the cause if some specific content making add_item fail (and it would explain why it works with some content and not with other)

@rgaudin
Copy link
Member

rgaudin commented May 30, 2022

@mgautierfr's right. Because finish() would lead to the creation of a .zim file – which itself would trigger the upload of the ZIM–, we added this workaround so that an Exception in the scraper would prevent the call to finish and thus save us from uploading an invalid ZIM.

The workaround is ON by default ; but it just sets this variable and re-raises the Exception, so it wouldn't prevent your code from seeing it. From the logs, it seems no exception is raised up to the scraper's run() method ; which is confirmed by the Finished Zim log line and the fact your scraper returns 0.

It seems that scraper_generic.py's scrape_items() is catching all downwards exceptions and printing the traceback (which we can see in the log). In my opinion, this should raise (or halt the scraper) on RuntimeError as those are most likely raised by pylibzim and thus not recoverable. Or maybe you're fine with the duplicate entries ; in this case, I'd suggest you disable the workaround (workaround_nocancel=True passed to Creator().

@benoit74
Copy link
Collaborator

Thank you to all of you for those explanation.

I confirm that there was some errors during processing which led to can_finish to be set to False.

The current logic is to catch those exceptions scraper-side, to count how many exceptions occurred and to stop processing / ZIM file creation only if too many errors occurs (threshold to be assessed). Maybe this is not the most appropriate technique, but I assumed it was better to have a fresher ZIM with few missing items due to errors than no new ZIM at all. But I understand this assumption might be wrong.

I will add a log checking clearly the "can_finish" status and displaying it in the log so that we can better understand the situation next time. Current logs are clearly misleading.

@benoit74
Copy link
Collaborator

benoit74 commented May 30, 2022

and I opened another issue for those errors which are not really expected indeed

@benoit74 benoit74 changed the title Success in Zimfarm but no zim file for iFixIt ES Report more clearly in the log when no ZIM is produced on-purpose + produce the ZIM even if some error occured May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants