New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the flow how we scrap (option 3) #1100
Comments
I don't really see the relation to the problem we try to fix. I don't really understand what is the architecture problem with the current solution. |
we have a stuck in the middle of scraping process. This way, if this happens, we'll get feedback from a problematic worker depleted and kill them. |
regarding the current issue, the problem is that we perform the work inside the main loop. We can handle errors, crashes, wrong responses, but cannot handle "nothing". There's no "process" outside yet where we can react to getting the working loop away. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
Related to #1043
(option 3)
We're trying to gather the clues for #1043. Here is my guess for another approach.
A) reorder the logical flow from
[mw metadata, articles[], deps[], files[]]
to[mw metadata, articles[json, deps[], files[]]]
.So the records in a zim won't be anymore
[article1, article2, ... css1, js1, ... gif1, jpeg1...]
but[article1, css1, js1, gif1, gif2 <end of article1> ... article2, css2, js2, video1, png1... ]
.See bellow what it's required for.
B) based on A, keep the list of articles that we success with (or list of remaining ones) in redis. For sure, keep rps throttling respected here.
C) turn that function/closure that actually do the work to self-sustaining idempotent chunk of logic (in place) that could be considered as unreliable. Not sure how to describe that in english, closest analogy by nature is a docker container that could be destroyed or lost and re-spawned with exactly same parameters quickly.
D) based on A, B and C, carry on with the resume logic (we had a try before #782 and we have the commented stub in the source code). That sounds not as a fix but as a workaround for this particular bug, but that would be of value by itself. The idea is to keep redis running inside the zimfarm task, but rerun mwoffliner as needed. For example, once the log is "depleted" - that means there are no more log records appear in stdout/stderr within a predefined time window. I already implemented that in one of previous projects. Check if we could handle previously opened / non-finalized zim and continue to append the records there, probably that should be implemented in node-libzim before.
Would this make sense?
The text was updated successfully, but these errors were encountered: