Question about the size of the 'state' directory #498

cgr71ii · 2022-09-07T13:29:30Z

Hi!

I've crawling with Heritrix for 5 hours and I've noticed that the state directory size is bigger even that my downloaded WARCs. My configuration, briefly:

Seeds: ~1200 domains
Downloading only text
Downloading from all subdomains from the provided domains
600 threads
100 GiB heap size
According to statistics (which are not correct: Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455):
- 3,369,654 downloaded + 166,045,099 queued = 169,415,353 total
- 389 GiB crawled (389 GiB novel, 0 B dupByHash, 0 B notModified)
- 153.46 URIs/sec (185.63 avg); 24,187 KB/sec (22,499 avg)

Sizes:

113G    build_1662469305/heritrix-3.4.0-SNAPSHOT/jobs/paracrawl9_experiment_without_classifier/state
107G    build_1662469305/heritrix-3.4.0-SNAPSHOT/jobs/paracrawl9_experiment_without_classifier/20220907073359/warcs

In the state directory there are, approximately, 12500 files.

Is the size of the state directory what someone would expect crawling with a configuration like the one I'm using? Is this state directory needed? What is the purpose of this directory and their files? Can be optimized in order to decrease the size?

Thank you!

The text was updated successfully, but these errors were encountered:

anjackson · 2022-09-07T14:02:33Z

Hi @cgr71ii,

From the information you gave, I can see your crawler currently has 166,045,099 URLs queued for download. This data, also called the crawl frontier, is what is taking up most of the crawler state folder.

If I've got my maths right, 113GB / 166,045,099 = 680 bytes for each URL. This seems pretty reasonable to me, given various bits of metadata are also held along with the URL.

So, yes, this is what I'd expect to see, given the size of your crawl frontier.

Note that if checkpointing is being used, the state folders just get larger and larger because all previous versions of the frontier and kept by default. In this situation, you can delete older checkpoints manually.

HTH,
Andy Jackson

ato · 2022-09-07T14:12:56Z

What is the purpose of this directory and their files?

The state directory is where Heritrix keeps the BDB databases tracking things like the state of the crawl, i.e. the set of URLs it has already seen (so it knows not to visit them again) and the queue of URLs to visit in future. 680 bytes per discovered URL doesn't seem unreasonable given the overheads of BDB and the generic serialization mechanism Heritrix uses. Looking at a recent large crawl here I'm seeing about 792 bytes per URL discovered.

Can be optimized in order to decrease the size?

Reducing the scope of the crawl should prevent the queue from growing so large. If you're really tight on space and can't reduce the scope you could divide up your seeds and crawl them in separate jobs, deleting the state directory of the previous job before starting the next one.

It would likely be a lot of work but there's almost certainly ways the code could be modified to use more efficient serialization (the extreme end of that would be swapping out BDB for a more space-efficient database like RocksDB), although there'll be a hard limit eventually. My gut feeling is it'd be somewhere around 100 bytes per URL without losing information/features.

Is this state directory needed?

During the crawl yes. You don't need to keep it after the crawl unless you're using the deduplication feature.

cgr71ii · 2022-09-07T19:08:12Z

Thank you both of you for the explanation and the alternatives! :)

anjackson closed this as completed Sep 7, 2022

internetarchive locked and limited conversation to collaborators Sep 30, 2022

ato converted this issue into discussion #533 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Question about the size of the 'state' directory #498

Question about the size of the 'state' directory #498

cgr71ii commented Sep 7, 2022

anjackson commented Sep 7, 2022

ato commented Sep 7, 2022

cgr71ii commented Sep 7, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Question about the size of the 'state' directory #498

Question about the size of the 'state' directory #498

Comments

cgr71ii commented Sep 7, 2022

anjackson commented Sep 7, 2022

ato commented Sep 7, 2022

cgr71ii commented Sep 7, 2022

This issue was moved to a discussion.