-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for fully customizable parsing #3
Comments
Hi, found recently your project and trying to get familiar with that - great project and thanks for making it open source! Related to that issue - I've noticed that HTML parsing is happening in it's own thread and wonder why? With crawler we don't care about latency but only about throughput of a whole system: Wouldn't we have the same result if we return parsing thread back to the crawler and do parsing right away in all the threads? It would block IO that for sure, but after all throughput of a whole system should be on the same level, no? Assuming that we have some sort of back pressure from the parsing thread |
Hi, thanks for the input! Doing parsing in all threads right in place(inside As you noticed we don't care much about latency but I like keeping it manageable after all we also have DB connections(clickhouse/redis, in the scope of Also when done in it's own threadpool we can control how much work there is and put a cap on it(channel buffer). |
Probably will also need to put |
Thanks for such a detailed response! I wonder if https://github.com/servo/html5ever has problems like that with parsing taking unpredictable amount of time. It uses callbacks so essentially every time it fires we can decide should we continue or give event loop a spin to avoid blocking thread for long amount of time |
Feel free to open a separate issue, might be worth considering
|
Right now we use
select
for html parsing, it's nice and everything but there are different use casesFor low volume crawling it might be a very good fit, but for broad crawling I'm considering switching to something lower level. So
crusty-core
should support configurable html parsers(and properly propagate it totask_expanders
via generics)The text was updated successfully, but these errors were encountered: