Use tokenizer for extraction; add benchmark #424

mre · 2021-12-15T23:58:17Z

This avoids creating a DOM tree for link extraction and instead uses a TokenSink for on-the-fly extraction.
In my hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

I also added a few more element/attribute pairs which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a RawUri (stringy-types) and a Uri, which is a properly parsed URI type.
The extractor now only deals with extracting RawUris while the collector creates the request objects.

If someone wants to look into the code, I'd be happy for a review.
(I'll squash the commits before I merge of course. ^^)

Previously we collected all inputs in one vector before checking the links, which is not ideal. Especially when reading many inputs (e.g. by using a glob pattern), this could cause issues like running out of file handles. By moving to streams we avoid that scenario. This is also the first step towards improving performance for many inputs.

Because we can't know the amount of links without blocking

To stay as close to the pre-stream behaviour, we want to stop processing as soon as an Err value appears in the stream. This is easiest when the stream is consumed in the main thread. Previously, the stream was consumed in a tokio task and the main thread waited for responses. Now, a tokio task waits for responses (and displays them/registers response stats) and the main thread sends links to the ClientPool. To ensure that the main thread waits for all responses to have arrived before finishing the ProgressBar and printing the stats, it waits for the show_results_task to finish.

Replaced with `futures::StreamExt::for_each_concurrent`.

Tendril is not Send by default

This reverts commit 52e52bf.

This reverts commit 89d7566.

This reverts commit 9e2a500.

This reverts commit 8e2c676.

mre · 2021-12-16T09:01:44Z

The flamegraph shows a much more flat call stack. (It's interactive, but you have to download it and open it in a browser.)

mre and others added 30 commits September 13, 2021 19:57

Merge branch 'master' of github.com:lycheeverse/lychee into stream

aee5c9f

Fix formatting and lints

0c05acb

Merge branch 'master' of github.com:lycheeverse/lychee into stream

eecd8e2

Merge remote-tracking branch 'upstream/master' into stream

18448ce

Return collected links as Stream

a05400b

Initialize ProgressBar without length

d4b9bad

Because we can't know the amount of links without blocking

Merge branch 'master' of github.com:lycheeverse/lychee into stream

23df173

Cleanup

1471725

Merge branch 'master' of github.com:lycheeverse/lychee into stream

d111c0e

Add basic directory support

dfd0735

Merge branch 'master' of github.com:lycheeverse/lychee into stream

9ef2b7d

Fix deadlock

5f790bf

Clippy

d42bf3e

Add test for http protocol file type

5bea0c8

Remove deadpool (once again)

8ea4de6

Replaced with `futures::StreamExt::for_each_concurrent`.

Refactor main; fix tests

f33468e

Move commands into separate submodule

1bfeb0e

Reintegrate changes from master

fb2dde2

Simplify input handling

fec6f8f

Simplify collector

a69ea63

Remove unnecessary unwrap

c83429c

Simplify main

b231175

cleanup check

1d80866

clean up dump command

ee4dd9c

Move to String, which is Send

52e52bf

Tendril is not Send by default

Revert "Move to String, which is Send"

89d7566

This reverts commit 52e52bf.

Revert "Revert "Move to String, which is Send""

faf40a8

This reverts commit 89d7566.

Parallel stream awesomeness

63a8370

mre added 25 commits December 8, 2021 01:19

Move benches to separate crate

f6c7eae

docs

4e2cf5e

usage notes

5c9dac6

Add element

d2f9f08

Complete link element/attribute combinations

5f62bd5

cleanup

9c1e48a

Remove rayon and stdout lock

9e2a500

Revert "Remove rayon and stdout lock"

8e2c676

This reverts commit 9e2a500.

Revert "Revert "Remove rayon and stdout lock""

0502132

This reverts commit 8e2c676.

cleanup

ca727f9

refactor

251f1f4

Merge branch 'master' of github.com:lycheeverse/lychee into extractor

f42686a

cleanup

20bb6a1

fmt

5663e3d

clippy

e7aa681

remove version pin

ed6a372

cleanup

713c41e

refactor

6855562

wip

e5c7e42

Add back tests

a4dd086

refactor

a94941d

cleanup

10f5fe3

cleanup

b510f0f

fmt

0e0fc65

cleanup

0669bc4

mre merged commit 166c86c into master Dec 16, 2021

mre deleted the extractor branch December 16, 2021 17:45

This was referenced Dec 17, 2021

Stream-based extractor #414

Closed

Exclude URLs based on HTML tags #259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tokenizer for extraction; add benchmark #424

Use tokenizer for extraction; add benchmark #424

mre commented Dec 15, 2021

mre commented Dec 16, 2021

Use tokenizer for extraction; add benchmark #424

Use tokenizer for extraction; add benchmark #424

Conversation

mre commented Dec 15, 2021

mre commented Dec 16, 2021