Change the flow how we scrap (option 3) #1100

midik · 2020-05-08T08:35:06Z

Related to #1043

(option 3)
We're trying to gather the clues for #1043. Here is my guess for another approach.

A) reorder the logical flow from
[mw metadata, articles[], deps[], files[]] to
[mw metadata, articles[json, deps[], files[]]].
So the records in a zim won't be anymore
[article1, article2, ... css1, js1, ... gif1, jpeg1...] but
[article1, css1, js1, gif1, gif2 <end of article1> ... article2, css2, js2, video1, png1... ].
See bellow what it's required for.

B) based on A, keep the list of articles that we success with (or list of remaining ones) in redis. For sure, keep rps throttling respected here.

C) turn that function/closure that actually do the work to self-sustaining idempotent chunk of logic (in place) that could be considered as unreliable. Not sure how to describe that in english, closest analogy by nature is a docker container that could be destroyed or lost and re-spawned with exactly same parameters quickly.

D) based on A, B and C, carry on with the resume logic (we had a try before #782 and we have the commented stub in the source code). That sounds not as a fix but as a workaround for this particular bug, but that would be of value by itself. The idea is to keep redis running inside the zimfarm task, but rerun mwoffliner as needed. For example, once the log is "depleted" - that means there are no more log records appear in stdout/stderr within a predefined time window. I already implemented that in one of previous projects. Check if we could handle previously opened / non-finalized zim and continue to append the records there, probably that should be implemented in node-libzim before.

Would this make sense?

The text was updated successfully, but these errors were encountered:

kelson42 · 2020-05-08T10:22:56Z

I don't really see the relation to the problem we try to fix. I don't really understand what is the architecture problem with the current solution.

midik · 2020-05-08T10:28:16Z

we have a stuck in the middle of scraping process. This way, if this happens, we'll get feedback from a problematic worker depleted and kill them.

midik · 2020-05-08T10:34:51Z

what is the architecture problem

regarding the current issue, the problem is that we perform the work inside the main loop. We can handle errors, crashes, wrong responses, but cannot handle "nothing". There's no "process" outside yet where we can react to getting the working loop away.

stale · 2020-07-07T20:47:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale · 2021-01-14T05:00:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

midik added the question label May 8, 2020

midik self-assigned this May 8, 2020

midik mentioned this issue May 8, 2020

WPEN all scraping (and a few others) is stuck at the end of the process #1043

Closed

midik mentioned this issue May 8, 2020

Put scraping logic somewhere outside the main loop (option 4) #1101

Open

stale bot added the stale label Jul 7, 2020

kelson42 unassigned midik Nov 14, 2020

stale bot removed the stale label Nov 14, 2020

stale bot added the stale label Jan 14, 2021

kelson42 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2023

kelson42 self-assigned this Jan 8, 2023

kelson42 added this to the 1.12.0 milestone Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the flow how we scrap (option 3) #1100

Change the flow how we scrap (option 3) #1100

midik commented May 8, 2020 •

edited

kelson42 commented May 8, 2020 •

edited

midik commented May 8, 2020

midik commented May 8, 2020 •

edited

stale bot commented Jul 7, 2020

stale bot commented Jan 14, 2021

Change the flow how we scrap (option 3) #1100

Change the flow how we scrap (option 3) #1100

Comments

midik commented May 8, 2020 • edited

kelson42 commented May 8, 2020 • edited

midik commented May 8, 2020

midik commented May 8, 2020 • edited

stale bot commented Jul 7, 2020

stale bot commented Jan 14, 2021

midik commented May 8, 2020 •

edited

kelson42 commented May 8, 2020 •

edited

midik commented May 8, 2020 •

edited