Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Squidwarc Frontier Mangament and long scalable crawls #5

Open
N0taN3rd opened this issue Aug 20, 2018 · 3 comments
Open

Comments

@N0taN3rd
Copy link
Contributor

One of the use cases I have wanted to support in Squidwarc is multiple worker crawlers populating and pulling from a single master frontier.

As well as a move from the current in memory frontier to a more scalable frontier scheme.

Since warcworker is light years ahead in this regard 😍 (i.e. frontend for Squidwarc with multiple crawler workers and expandability potential for managing long crawls), I thought it best to see it if warcworker has any interest in this functionality and if so to coordinate development 😃

@peterk
Copy link
Owner

peterk commented Aug 28, 2018

Sorry for the late reply @N0taN3rd ! Would love to be able to support further development although I fear Squidwarc code quality is way ahead of warcworker right now though. Pls share any ideas you have. I am also looking at building a collection front end (or adapt to SFM later).

@N0taN3rd
Copy link
Contributor Author

N0taN3rd commented Sep 5, 2018

A starting idea for this is to make a continuous crawl mode rather than the current one off crawls, i.e. start Squidwarc crawl and once it completes will wait for another config to be sent to it or is killed.

The best way to communicate with Squidwarc not sure. Thinking websockets to keep deps light(ish) for starters.

@peterk
Copy link
Owner

peterk commented Sep 21, 2018

Right now the worker is waiting for the next item in the queue so I guess it already works in that way (but isolated from Squidwarc)? Or are you thinking about something else?

Possibly related: @Segerberg was interested in looking at settings to dedup and append to warcs. Maybe there should be some kind of a "set" concept for which a single crawler was responsible for dedup, and warc appending for the set?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants