Support for fully customizable parsing #3

let4be · 2021-06-16T11:42:27Z

Right now we use select for html parsing, it's nice and everything but there are different use cases

For low volume crawling it might be a very good fit, but for broad crawling I'm considering switching to something lower level. So crusty-core should support configurable html parsers(and properly propagate it to task_expanders via generics)

The text was updated successfully, but these errors were encountered:

artemyarulin · 2021-06-17T07:08:10Z

Hi, found recently your project and trying to get familiar with that - great project and thanks for making it open source!

Related to that issue - I've noticed that HTML parsing is happening in it's own thread and wonder why? With crawler we don't care about latency but only about throughput of a whole system: Wouldn't we have the same result if we return parsing thread back to the crawler and do parsing right away in all the threads?

It would block IO that for sure, but after all throughput of a whole system should be on the same level, no? Assuming that we have some sort of back pressure from the parsing thread

let4be · 2021-06-17T08:00:36Z

Hi, thanks for the input!

Doing parsing in all threads right in place(inside TaskProcessor) is certainly possible and this how it was implemented in the beginning. However it means unpredictably blocking current thread which may have other async code pending on it.

As you noticed we don't care much about latency but I like keeping it manageable after all we also have DB connections(clickhouse/redis, in the scope of Crusty) and if those happen to land on a misbehaving thread(say busy by some weirdly constructed content that takes way-way longer to process than a typical html page) we could have disconnects.
There's also a problem that we could have more disconnects overall to websites we crawl, so we'd probably have to raise connect timeouts

Also when done in it's own threadpool we can control how much work there is and put a cap on it(channel buffer).
In Broad Web Crawling(Crusty) I see the following situation all the time - we get a spike in average html page complexity either due to actual complexity increase or due to size increase which causes buffer to start to fill up but we do not slow down crawling when this happens in hopes that situation is temporary and will resolve itself.
=> it helps even out the load and utilize CPU more effectively, I find it's way easier to saturate hardware when we always have jobs in Parser buffer to keep all parser threads busy 99.99% of the time

let4be · 2021-06-17T09:25:58Z

Probably will also need to put select parsing under a feature flag

artemyarulin · 2021-06-17T09:57:37Z

Thanks for such a detailed response!

I wonder if https://github.com/servo/html5ever has problems like that with parsing taking unpredictable amount of time. It uses callbacks so essentially every time it fires we can decide should we continue or give event loop a spin to avoid blocking thread for long amount of time

let4be · 2021-06-17T16:09:21Z

Feel free to open a separate issue, might be worth considering

if it's possible to abstract away threadpool/in-place parsing
if it's worth switching to a in-place parsing, do we waste any resources by sending tasks to a dedicated threadpool and if so how much, does it warrant switch to an in-place parsing or do we more benefits from it(like possibly more even resource saturation under some configurations)

let4be added the enhancement New feature or request label Jun 16, 2021

let4be self-assigned this Jun 16, 2021

let4be added a commit that referenced this issue Jun 17, 2021

initial support for customizable parsing, misc API cleanup, #3

086d30a

let4be closed this as completed in 851d63c Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for fully customizable parsing #3

Support for fully customizable parsing #3

let4be commented Jun 16, 2021

artemyarulin commented Jun 17, 2021

let4be commented Jun 17, 2021 •

edited

Loading

let4be commented Jun 17, 2021

artemyarulin commented Jun 17, 2021

let4be commented Jun 17, 2021 •

edited

Loading

Support for fully customizable parsing #3

Support for fully customizable parsing #3

Comments

let4be commented Jun 16, 2021

artemyarulin commented Jun 17, 2021

let4be commented Jun 17, 2021 • edited Loading

let4be commented Jun 17, 2021

artemyarulin commented Jun 17, 2021

let4be commented Jun 17, 2021 • edited Loading

let4be commented Jun 17, 2021 •

edited

Loading

let4be commented Jun 17, 2021 •

edited

Loading