How to crawl only one page #571

nk-yuta · 2023-11-16T09:31:47Z

nk-yuta
Nov 16, 2023

Hello, I am new to Heritirx.
I am new to Heritirx.
I am not familiar with Heritirx and its configuration.

I would like to have Heritirx crawl only one web page and the embedded content it contains.
How can I configure the settings to achieve this?

What I have tried so far:.

I set maxHops in TooManyHopsDecideRule to 1. → Did not work as desired.
I set maxDocumentsDownload in the crawlLimitEnforcer to 2. → Only one page could be crawled, but CSS etc. were not downloaded.

I used machine translation for this post.

Answered by ldko

Nov 16, 2023

Hi @nk-yuta ,
To crawl only one page and its embedded content try setting maxHops to 0, which should download your seed but not take any navigational links from that seed:

 <bean id="scopeMaxHops" class="org.archive.modules.deciderules.TooManyHopsDecideRule">
   <property name="maxHops" value="0" />
 </bean>

Then use TransclusionDecideRule to indicate you want to get embedded content. For example you could set:

  <bean class="org.archive.modules.deciderules.TransclusionDecideRule">
    <property name="maxTransHops" value="2" />
    <property name="maxSpeculativeHops" value="1" />
   </bean>

Those are the main settings for getting one page with embeds. I recommend keeping maxDocumentsDow…

View full answer

ldko · 2023-11-16T15:16:06Z

ldko
Nov 16, 2023

Hi @nk-yuta ,
To crawl only one page and its embedded content try setting maxHops to 0, which should download your seed but not take any navigational links from that seed:

 <bean id="scopeMaxHops" class="org.archive.modules.deciderules.TooManyHopsDecideRule">
   <property name="maxHops" value="0" />
 </bean>

Then use TransclusionDecideRule to indicate you want to get embedded content. For example you could set:

  <bean class="org.archive.modules.deciderules.TransclusionDecideRule">
    <property name="maxTransHops" value="2" />
    <property name="maxSpeculativeHops" value="1" />
   </bean>

Those are the main settings for getting one page with embeds. I recommend keeping maxDocumentsDownload at the default 0. Otherwise you are telling Heritrix to only download 2 URIs, which means it will stop the crawl before getting all of the URIs for the embedded content.

0 replies

nk-yuta · 2023-11-20T10:29:48Z

nk-yuta
Nov 20, 2023
Author

Thanks for the advice!
I will try crawling with that setting.

0 replies

nk-yuta · 2024-01-30T03:00:27Z

nk-yuta
Jan 30, 2024
Author

Thanks to your advice, I can submit my graduation thesis!
My subject of thesis is creating new web archiving system, so without this thread, I could not complete graduation reseach...
Unfortunately, I cannot share my graduation thesis, But I added your name to the acknowledgments.
I'm really thankful to you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to crawl only one page #571

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

How to crawl only one page #571

nk-yuta Nov 16, 2023

Replies: 3 comments

ldko Nov 16, 2023

nk-yuta Nov 20, 2023 Author

nk-yuta Jan 30, 2024 Author

nk-yuta
Nov 16, 2023

ldko
Nov 16, 2023

nk-yuta
Nov 20, 2023
Author

nk-yuta
Jan 30, 2024
Author