Skip to content

How to crawl only one page #571

Answered by ldko
nk-yuta asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @nk-yuta ,
To crawl only one page and its embedded content try setting maxHops to 0, which should download your seed but not take any navigational links from that seed:

 <bean id="scopeMaxHops" class="org.archive.modules.deciderules.TooManyHopsDecideRule">
   <property name="maxHops" value="0" />
 </bean>

Then use TransclusionDecideRule to indicate you want to get embedded content. For example you could set:

  <bean class="org.archive.modules.deciderules.TransclusionDecideRule">
    <property name="maxTransHops" value="2" />
    <property name="maxSpeculativeHops" value="1" />
   </bean>

Those are the main settings for getting one page with embeds. I recommend keeping maxDocumentsDow…

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ato
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants