-
I want to download one website and all the web pages linked from that website. I have only one seed, "avcr.cz" and the settings are as follows: In this case, however, I only download the surrounding web pages from a few sites, particularly Youtube and Spotify. Almost completely missing all the web pages from standard websites. For example, in the "scope" log, there is this entry: 2021-05-04T11:35:57.143Z 0 RejectDecideRule REJECT http://academia.cz/ The academia.cz page is linked to right from the first homepage of avcr.cz (which is where the crawler gets to after two redirects). So it would make sense to me that based on the maxTransHops rule, the crawler should accept it. It kind of looks to me like crawler is only downloading embed pages, but I don't know that for sure. Is there any setting for the crawler to download all web pages linked from the downloaded website? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
My understanding is the "trans" in maxTransHops is transclusion (which often people call 'embed'), so applies to stylesheets, images etc not normal links.
I believe setting
|
Beta Was this translation helpful? Give feedback.
-
Thanks a lot! |
Beta Was this translation helpful? Give feedback.
My understanding is the "trans" in maxTransHops is transclusion (which often people call 'embed'), so applies to stylesheets, images etc not normal links.
I believe setting
alsoCheckVia
property totrue
on theacceptSurts
(SurtPrefixedDecideRule
) bean does this. The documentation describes it as: