Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the flow how we scrap (option 3) #1100

Closed
midik opened this issue May 8, 2020 · 5 comments
Closed

Change the flow how we scrap (option 3) #1100

midik opened this issue May 8, 2020 · 5 comments
Assignees
Milestone

Comments

@midik
Copy link
Contributor

midik commented May 8, 2020

Related to #1043

(option 3)
We're trying to gather the clues for #1043. Here is my guess for another approach.

A) reorder the logical flow from
[mw metadata, articles[], deps[], files[]] to
[mw metadata, articles[json, deps[], files[]]].
So the records in a zim won't be anymore
[article1, article2, ... css1, js1, ... gif1, jpeg1...] but
[article1, css1, js1, gif1, gif2 <end of article1> ... article2, css2, js2, video1, png1... ].
See bellow what it's required for.

B) based on A, keep the list of articles that we success with (or list of remaining ones) in redis. For sure, keep rps throttling respected here.

C) turn that function/closure that actually do the work to self-sustaining idempotent chunk of logic (in place) that could be considered as unreliable. Not sure how to describe that in english, closest analogy by nature is a docker container that could be destroyed or lost and re-spawned with exactly same parameters quickly.

D) based on A, B and C, carry on with the resume logic (we had a try before #782 and we have the commented stub in the source code). That sounds not as a fix but as a workaround for this particular bug, but that would be of value by itself. The idea is to keep redis running inside the zimfarm task, but rerun mwoffliner as needed. For example, once the log is "depleted" - that means there are no more log records appear in stdout/stderr within a predefined time window. I already implemented that in one of previous projects. Check if we could handle previously opened / non-finalized zim and continue to append the records there, probably that should be implemented in node-libzim before.

Would this make sense?

@kelson42
Copy link
Collaborator

kelson42 commented May 8, 2020

I don't really see the relation to the problem we try to fix. I don't really understand what is the architecture problem with the current solution.

@midik
Copy link
Contributor Author

midik commented May 8, 2020

we have a stuck in the middle of scraping process. This way, if this happens, we'll get feedback from a problematic worker depleted and kill them.

@midik
Copy link
Contributor Author

midik commented May 8, 2020

what is the architecture problem

regarding the current issue, the problem is that we perform the work inside the main loop. We can handle errors, crashes, wrong responses, but cannot handle "nothing". There's no "process" outside yet where we can react to getting the working loop away.

@stale
Copy link

stale bot commented Jul 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jul 7, 2020
@stale stale bot removed the stale label Nov 14, 2020
@stale
Copy link

stale bot commented Jan 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jan 14, 2021
@kelson42 kelson42 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2023
@kelson42 kelson42 self-assigned this Jan 8, 2023
@kelson42 kelson42 added this to the 1.12.0 milestone Jan 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants