Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the size of the 'state' directory #498

Closed
cgr71ii opened this issue Sep 7, 2022 · 3 comments
Closed

Question about the size of the 'state' directory #498

cgr71ii opened this issue Sep 7, 2022 · 3 comments

Comments

@cgr71ii
Copy link

cgr71ii commented Sep 7, 2022

Hi!

I've crawling with Heritrix for 5 hours and I've noticed that the state directory size is bigger even that my downloaded WARCs. My configuration, briefly:

Sizes:

113G    build_1662469305/heritrix-3.4.0-SNAPSHOT/jobs/paracrawl9_experiment_without_classifier/state
107G    build_1662469305/heritrix-3.4.0-SNAPSHOT/jobs/paracrawl9_experiment_without_classifier/20220907073359/warcs

In the state directory there are, approximately, 12500 files.

Is the size of the state directory what someone would expect crawling with a configuration like the one I'm using? Is this state directory needed? What is the purpose of this directory and their files? Can be optimized in order to decrease the size?

Thank you!

@anjackson
Copy link
Collaborator

Hi @cgr71ii,

From the information you gave, I can see your crawler currently has 166,045,099 URLs queued for download. This data, also called the crawl frontier, is what is taking up most of the crawler state folder.

If I've got my maths right, 113GB / 166,045,099 = 680 bytes for each URL. This seems pretty reasonable to me, given various bits of metadata are also held along with the URL.

So, yes, this is what I'd expect to see, given the size of your crawl frontier.

Note that if checkpointing is being used, the state folders just get larger and larger because all previous versions of the frontier and kept by default. In this situation, you can delete older checkpoints manually.

HTH,
Andy Jackson

@ato
Copy link
Collaborator

ato commented Sep 7, 2022

What is the purpose of this directory and their files?

The state directory is where Heritrix keeps the BDB databases tracking things like the state of the crawl, i.e. the set of URLs it has already seen (so it knows not to visit them again) and the queue of URLs to visit in future. 680 bytes per discovered URL doesn't seem unreasonable given the overheads of BDB and the generic serialization mechanism Heritrix uses. Looking at a recent large crawl here I'm seeing about 792 bytes per URL discovered.

Can be optimized in order to decrease the size?

Reducing the scope of the crawl should prevent the queue from growing so large. If you're really tight on space and can't reduce the scope you could divide up your seeds and crawl them in separate jobs, deleting the state directory of the previous job before starting the next one.

It would likely be a lot of work but there's almost certainly ways the code could be modified to use more efficient serialization (the extreme end of that would be swapping out BDB for a more space-efficient database like RocksDB), although there'll be a hard limit eventually. My gut feeling is it'd be somewhere around 100 bytes per URL without losing information/features.

Is this state directory needed?

During the crawl yes. You don't need to keep it after the crawl unless you're using the deduplication feature.

@cgr71ii
Copy link
Author

cgr71ii commented Sep 7, 2022

Thank you both of you for the explanation and the alternatives! :)

@internetarchive internetarchive locked and limited conversation to collaborators Sep 30, 2022
@ato ato converted this issue into discussion #533 Sep 30, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants