Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concurrency with streams #330

Merged
merged 43 commits into from
Dec 1, 2021
Merged

Improve concurrency with streams #330

merged 43 commits into from
Dec 1, 2021

Conversation

mre
Copy link
Member

@mre mre commented Sep 14, 2021

The goal of this effort is to avoid blocking the runtime as much as possible.
This is a multi-step approach:

  • Change input::get_contents from returningVec<InputContent> to Stream<Item=Result<InputContent>>.
  • Change Collector::collect_links from returning Result<HashSet<Request>> to Stream<Item=Result<InputContent>>. (1)
  • Remove deadpool from the dependencies in lychee-lib and only work with streams throughout the call-stack.
    Add back connection pool #355
  • Remove ClientPool as reqwest already has a built-in pool.
  • Simplify main.rs by removing mpsc channels and iterating over streams directly.
  • Optional: Move the link cache out of the collector. Use a XOR filter instead of a HashSet. It uses less memory.

See #193, #21 (comment)

@TimoFreiberg fyi

Previously we collected all inputs in one vector
before checking the links, which is not ideal.
Especially when reading many inputs (e.g. by using a glob pattern),
this could cause issues like running out of file handles.

By moving to streams we avoid that scenario. This is also the first
step towards improving performance for many inputs.
@TimoFreiberg
Copy link
Contributor

Making Collector::collect_links return a Stream is pretty straightforward: TimoFreiberg@a05400b.
I'm tempted to add the progressbar changes from my previous PR, do you want them here or not?

I'm not 100% sure how the rework of main.rs will look like. I think we'll still need at least one Tokio Task?
Because changing one of the Tasks spawned in main.rs to handle Results is something I would like to discuss with you beforehand 😄

@mre
Copy link
Member Author

mre commented Sep 30, 2021

@TimoFreiberg your changes look good. I wonder how to proceed. Shall I just merge in the changes from your branch into this PR or shall we continue working on your fork?

@mre
Copy link
Member Author

mre commented Sep 30, 2021

The progressbar changes can be added here, too yes.

@TimoFreiberg
Copy link
Contributor

TimoFreiberg commented Sep 30, 2021 via email

Timo Freiberg added 2 commits October 6, 2021 14:15
Because we can't know the amount of links without blocking
To stay as close to the pre-stream behaviour, we want to stop processing
as soon as an Err value appears in the stream. This is easiest when the
stream is consumed in the main thread.
Previously, the stream was consumed in a tokio task and the main thread
waited for responses.
Now, a tokio task waits for responses (and displays them/registers
response stats) and the main thread sends links to the ClientPool.
To ensure that the main thread waits for all responses to have arrived
before finishing the ProgressBar and printing the stats, it waits for
the show_results_task to finish.
@TimoFreiberg
Copy link
Contributor

Alright, I have a proposal for a minimal change in the tokio tasks here: 98cdfba

@mre
Copy link
Member Author

mre commented Oct 6, 2021

This is great. I have to merge in your changes into my branch when I find a few minutes. 👍

@mre
Copy link
Member Author

mre commented Oct 7, 2021

@TimoFreiberg merged all changes in your master branch into this one.
For some reason it gets stuck "Extracting links" on a bigger test case and when I use it on my blog it's really fast but then never exits. It's stuck on the last link. Weird.
Do you observe the same or does it work for you. It probably has nothing to do with your changes and might have been broken before.

@TimoFreiberg
Copy link
Contributor

TimoFreiberg commented Oct 7, 2021

Yeah I'm pretty sure that's caused by my changes 😬 re-organizing tasks and adding a join is pretty dangerous I guess. I think I can debug it tomorrow

Hmm weird. It looks a lot like Stream returned from collect_links doesn't return None after the last link is yielded, so the while let Some() loop here blocks endlessly

@mre
Copy link
Member Author

mre commented Oct 8, 2021

Yeah that might be the case!

@TimoFreiberg
Copy link
Contributor

TimoFreiberg commented Oct 9, 2021

Oh. The progressbar covered the last eprintln! which mislead me into thinking the stream wasn't closed somehow. the stream is fine, but the send_req channel needed to be closed explicitly before waiting for the show_results_task.
The tests run fine now, they're approximately as fast as on master.
feel free to merge TimoFreiberg@28ad683

When testing cargo run --release -- README.md though, it seems like there's no concurrency in the stream branch.
I'm not completely convinced yet, but I kinda fear that this approach can't be truly concurrent, no matter what Client does internally. Each future is sequentially awaited until it produces a finished Response, I think it's completely impossible for the Client internals to make this concurrent.
I guess something like this could fix it (seemed to work locally)

            futures::StreamExt::for_each_concurrent(
                ReceiverStream::new(recv_req),
                max_concurrency,
                |req| async {
                    let resp = client.check(req).await.unwrap();
                    send_resp.send(resp).await.unwrap();
                },
            )
            .await;

What do you think?

@mre
Copy link
Member Author

mre commented Oct 10, 2021

feel free to merge TimoFreiberg/lychee@28ad683

Worked nicely on my machine as well. Thanks for fixing that!

What do you think?

Great idea. In the meantime we added the ClientPool back in #355 because we ran into the same issues. I'll benchmark your approach, to see how it compares to the ClientPool. My hope is that it's equally fast so that we can remove the pool to simplify the code and remove dependencies. 👍

I also merged a couple commits for basic directory support. Want to land that together with streams, because it's way easier and more elegant to implement it with yield instead of collecting all files into a Vector before returning (which kind of would defeat the point of walking files, heh).

@untitaker
Copy link
Collaborator

@mre since you're extending the API, which command line should I replace this with? I assume that "directory is now an input" means I can add it as posarg, and absence of globs may affect performance

lychee --offline -b . '**/*.htm*'

@untitaker
Copy link
Collaborator

untitaker commented Nov 29, 2021

time lychee --offline -n . from this branch on sentry-docs takes

real    0m2.888s
user    0m12.182s
sys     0m0.564s

while time lychee --offline -b . '**/*htm*' -n from master takes

real    0m5.379s
user    0m19.369s
sys     0m5.381s

time hyperlink --check-anchors . gets

real    0m1.282s
user    0m3.692s
sys     0m0.180s

this is a great improvement, and generally lychee seems to be in the same ballpark as hyperlink (somewhere between muffet/liche and hyperlink), but as for filesystem-only checking I think there are bigger areas of improvement that lychee can make than anything in relation to parallelism.

See this flamegraph. This is an interactive SVG with embedded JS generated via cargo flamegraph, for some reason you need to download it from github and open it as local file to get the most readable output. Maybe Content-Security-Policy? Who knows.

Two big blocks:

  • indicatif continues to be really slow. In fact the above test was closer to running at 1 minute real time before I added -n, which made it go closer to the 3-5 seconds you see above. I think this is not a good default experience. I don't know if this is a low-hanging fruit on the technical side, but it's too big of a perf improvement to.. uh.. be left hanging on the tree. IMO

  • html parsing is slow. I know that using html5ever's Tokenizer more than doubled hyperlink's runtime, which is why I didn't manage to move hyperlink away from quick-xml for a long time. I see that you're constructing an entire tree in lychee instead of iterating over tokens. Shamelessly plugging my own html parser at this point, but swapping out html5ever with literally anything else, or replacing the document parser with just the Tokenizer will give you some boost. Quickly skimming over what you do with the element tree, you don't really need anything but tokens.

Note that I really have no clue which thread that flamegraph comes from, I hope all of them are represented. I sort of just ran flamegraph -o lol.svg lychee ... and hoped for the best.

@MichaIng
Copy link
Member

I tested our two sites and here is the execution time release => stream branch:

  • 8s (613 links) => 7s (607 links)
  • 18s (3347 links) => 12s (3313 links)

So jep, it's faster 👍.

As can be seen, also the amount of checked links was reduced somehow, while the checked inputs didn't change. I know a cross-file cache wasn't implemented, but it has been changed from a link to a request cache, if I see right? May it be related to that that some links which were seen as distinct/different before are after processing seen as identical requests or so?

@mre
Copy link
Member Author

mre commented Nov 30, 2021

If we can somehow get https://users.rust-lang.org/t/async-stream-with-item-result/68066 resolved then I can push my local changes which give me a cool 2x improvement.
Using only the tokenizer is a great idea. We can then return a token stream from the extractor as well. Actually we could even par_iter on the tokens (with rayon) for another potential boost.
I remember that someone posted a link to a discussion about indicatif's performance issues. Something about locking but I forgot. I'll post it here if I can dig it up again.

@mre
Copy link
Member Author

mre commented Nov 30, 2021

@untitaker this should also work now:

lychee --offline -b . .

(not tested)

Note that this will include markdown files as well at the moment. I still need to add the overwrite option for that. (Something like --extensions html,htm could work in the future.)

@MichaIng
Copy link
Member

About HTML parsing: At least for some feature requests it is required, either on the inputs or even on the target HTML document:

But probably it can be done conditionally, only if an option/flag requires it.

@untitaker
Copy link
Collaborator

untitaker commented Nov 30, 2021 via email

@MichaIng
Copy link
Member

MichaIng commented Nov 30, 2021

I don't think it's required for anchors

Meant is to also check whether the URL https://foo.bar#fragment does not only respond well but also that the fragment is valid, hence that an element with the ID fragment exists in the target HTML. For this, the whole link target needs to be downloaded (at least when it is content-type: text/html) and parsed for a matching HTML element. Sounds heavy, but indeed it is useful when e.g. checking internal links in a documentation with large pages where you definitely want to link to a specific heading instead of the start of the page.

@untitaker
Copy link
Collaborator

For this, the whole link target needs to be downloaded (at least when it is content-type: text/html) and parsed for a matching HTML element.

I understand but that requires parsing more documents, not any particular kind of HTML parsing that a tokenizer wouldn't support. you need to find the start tag that contains the right id. any start tag is fine, you don't even need to find the right one. You don't need to understand hierarchy of elements for example.

@mre
Copy link
Member Author

mre commented Nov 30, 2021

Okay I think we're done here.

❯❯❯ hyperfine "lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public' --offline --dump --no-progress"
Benchmark 1: lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public' --offline --dump --no-progress
  Time (mean ± σ):      4.589 s ±  0.379 s    [User: 23.370 s, System: 7.124 s]
  Range (min … max):    3.995 s …  5.167 s    10 runs
Summary
  'lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public/**/*.html' --offline --dump' ran
    1.35 ± 0.11 times faster than 'lychee-old --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public/**/*.html' --offline --dump'

Using jwalk for directory traversal:

Summary
  'lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public' --offline --dump' ran
    1.38 ± 0.16 times faster than 'lychee-old --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public/**/*.html' --offline --dump'

So in summary we're around 35% to 50% percent faster with the stream-based version. I honestly expected a bit more, but at least the architecture is more promising now and the cores get fully used from the beginning with many files. This way the startup time is much shorter and it shouldn't get stuck when it runs out of memory.

I'm planning to merge this tomorrow so this is the "final comment period" for the changeset. Everything else that isn't covered yet (optimizing the extractor, caching) will be handled by a separate PR. Thanks all for your support!

@mre mre mentioned this pull request Dec 1, 2021
3 tasks
@mre mre marked this pull request as ready for review December 1, 2021 17:03
@mre mre merged commit 3d51356 into master Dec 1, 2021
@mre mre deleted the stream branch December 1, 2021 17:25
@mre
Copy link
Member Author

mre commented Dec 1, 2021

🥳 🥳 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants