Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for fully customizable parsing #3

Closed
let4be opened this issue Jun 16, 2021 · 5 comments
Closed

Support for fully customizable parsing #3

let4be opened this issue Jun 16, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@let4be
Copy link
Owner

let4be commented Jun 16, 2021

Right now we use select for html parsing, it's nice and everything but there are different use cases

For low volume crawling it might be a very good fit, but for broad crawling I'm considering switching to something lower level. So crusty-core should support configurable html parsers(and properly propagate it to task_expanders via generics)

@let4be let4be added the enhancement New feature or request label Jun 16, 2021
@let4be let4be self-assigned this Jun 16, 2021
@artemyarulin
Copy link

Hi, found recently your project and trying to get familiar with that - great project and thanks for making it open source!

Related to that issue - I've noticed that HTML parsing is happening in it's own thread and wonder why? With crawler we don't care about latency but only about throughput of a whole system: Wouldn't we have the same result if we return parsing thread back to the crawler and do parsing right away in all the threads?

It would block IO that for sure, but after all throughput of a whole system should be on the same level, no? Assuming that we have some sort of back pressure from the parsing thread

@let4be
Copy link
Owner Author

let4be commented Jun 17, 2021

Hi, thanks for the input!

Doing parsing in all threads right in place(inside TaskProcessor) is certainly possible and this how it was implemented in the beginning. However it means unpredictably blocking current thread which may have other async code pending on it.

As you noticed we don't care much about latency but I like keeping it manageable after all we also have DB connections(clickhouse/redis, in the scope of Crusty) and if those happen to land on a misbehaving thread(say busy by some weirdly constructed content that takes way-way longer to process than a typical html page) we could have disconnects.
There's also a problem that we could have more disconnects overall to websites we crawl, so we'd probably have to raise connect timeouts

Also when done in it's own threadpool we can control how much work there is and put a cap on it(channel buffer).
In Broad Web Crawling(Crusty) I see the following situation all the time - we get a spike in average html page complexity either due to actual complexity increase or due to size increase which causes buffer to start to fill up but we do not slow down crawling when this happens in hopes that situation is temporary and will resolve itself.
=> it helps even out the load and utilize CPU more effectively, I find it's way easier to saturate hardware when we always have jobs in Parser buffer to keep all parser threads busy 99.99% of the time

@let4be
Copy link
Owner Author

let4be commented Jun 17, 2021

Probably will also need to put select parsing under a feature flag

@artemyarulin
Copy link

Thanks for such a detailed response!

I wonder if https://github.com/servo/html5ever has problems like that with parsing taking unpredictable amount of time. It uses callbacks so essentially every time it fires we can decide should we continue or give event loop a spin to avoid blocking thread for long amount of time

@let4be let4be closed this as completed in 851d63c Jun 17, 2021
@let4be
Copy link
Owner Author

let4be commented Jun 17, 2021

Feel free to open a separate issue, might be worth considering

  • if it's possible to abstract away threadpool/in-place parsing
  • if it's worth switching to a in-place parsing, do we waste any resources by sending tasks to a dedicated threadpool and if so how much, does it warrant switch to an in-place parsing or do we more benefits from it(like possibly more even resource saturation under some configurations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants