Downloading the surrounding web pages #517

krojtous · 2021-05-12T11:51:11Z

krojtous
May 12, 2021

I want to download one website and all the web pages linked from that website.

I have only one seed, "avcr.cz" and the settings are as follows:
maxHops: 5
maxTransHops: 1
maxSpeculativeHops: 1

In this case, however, I only download the surrounding web pages from a few sites, particularly Youtube and Spotify. Almost completely missing all the web pages from standard websites.

For example, in the "scope" log, there is this entry:

2021-05-04T11:35:57.143Z 0 RejectDecideRule REJECT http://academia.cz/

The academia.cz page is linked to right from the first homepage of avcr.cz (which is where the crawler gets to after two redirects). So it would make sense to me that based on the maxTransHops rule, the crawler should accept it.

It kind of looks to me like crawler is only downloading embed pages, but I don't know that for sure.

Is there any setting for the crawler to download all web pages linked from the downloaded website?

Answered by ato

May 12, 2021

My understanding is the "trans" in maxTransHops is transclusion (which often people call 'embed'), so applies to stylesheets, images etc not normal links.

Is there any setting for the crawler to download all web pages linked from the downloaded website?

I believe setting alsoCheckVia property to true on the acceptSurts (SurtPrefixedDecideRule) bean does this. The documentation describes it as:

Whether to also make the configured decision if a URI's 'via' URI (the
URI from which it was discovered) in SURT form begins with any of the
established prefixes. For example, can be used to ACCEPT URIs that are
'one hop off' URIs fitting the SURT prefixes. Default is false.

View full answer

ato · 2021-05-12T14:16:08Z

ato
May 12, 2021
Maintainer

My understanding is the "trans" in maxTransHops is transclusion (which often people call 'embed'), so applies to stylesheets, images etc not normal links.

Is there any setting for the crawler to download all web pages linked from the downloaded website?

I believe setting alsoCheckVia property to true on the acceptSurts (SurtPrefixedDecideRule) bean does this. The documentation describes it as:

Whether to also make the configured decision if a URI's 'via' URI (the
URI from which it was discovered) in SURT form begins with any of the
established prefixes. For example, can be used to ACCEPT URIs that are
'one hop off' URIs fitting the SURT prefixes. Default is false.

0 replies

krojtous · 2021-05-13T09:23:12Z

krojtous
May 13, 2021
Author

Thanks a lot!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading the surrounding web pages #517

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Downloading the surrounding web pages #517

krojtous May 12, 2021

Replies: 2 comments

ato May 12, 2021 Maintainer

krojtous May 13, 2021 Author

krojtous
May 12, 2021

ato
May 12, 2021
Maintainer

krojtous
May 13, 2021
Author