Skip to content

Add --warc-digest-algorithm#449

Merged
NGTmeaty merged 3 commits intomainfrom
warc-digest
Aug 26, 2025
Merged

Add --warc-digest-algorithm#449
NGTmeaty merged 3 commits intomainfrom
warc-digest

Conversation

@CorentinB
Copy link
Copy Markdown
Collaborator

@CorentinB CorentinB commented Aug 26, 2025

This pull request introduces support for selecting the digest algorithm used for WARC record digests, allowing users to choose between sha1, sha256, and blake3. It updates command-line flags, configuration, and internal logic to handle this new option, and ensures the selected algorithm is validated and passed through to the WARC writer. Additionally, it updates dependencies to support the new functionality.

Digest Algorithm Selection and Validation:

  • Added a new persistent flag --warc-digest-algorithm (default: sha1) to the get command, allowing users to specify the digest algorithm for WARC records. Supported values are sha1, sha256, and blake3.
  • Extended the Config struct to include a WARCDigestAlgorithm field, and added validation in GenerateCrawlConfig() to ensure the specified algorithm is supported. [1] [2]
  • Updated the WARC writer initialization to pass the selected digest algorithm to the underlying WARC library by calling warc.GetDigestFromPrefix.

Dependency Updates:

  • Upgraded github.com/internetarchive/gowarc to v0.8.87 to support additional digest algorithms, and added indirect dependencies for github.com/zeebo/blake3 and github.com/klauspost/cpuid/v2. [1] [2] [3]

Minor Fixes:

  • Fixed a typo in the WARC rotator settings struct field name from WarcSize to WARCSize.
  • Added import of the gowarc package in the config file to access digest-related helpers.

PS: copilot wrote that, not that bad?

@CorentinB CorentinB requested review from NGTmeaty and Copilot August 26, 2025 13:29
@CorentinB CorentinB self-assigned this Aug 26, 2025
@CorentinB CorentinB added the enhancement New feature or request label Aug 26, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for configurable digest algorithms in WARC files by introducing the --warc-digest-algorithm command-line flag. The change allows users to specify different digest algorithms (sha1, sha256, blake3) for WARC record block and payload digests instead of being limited to the default sha1.

  • Added WARCDigestAlgorithm configuration field with validation
  • Updated WARC writer to use the configured digest algorithm
  • Upgraded gowarc dependency to support new digest functionality

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

File Description
internal/pkg/config/config.go Added WARCDigestAlgorithm field and validation logic
internal/pkg/archiver/warc.go Updated WARC writer to use configured digest algorithm
go.mod Upgraded gowarc dependency and added new transitive dependencies
cmd/get.go Added --warc-digest-algorithm command-line flag

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread internal/pkg/config/config.go Outdated
Comment thread internal/pkg/config/config.go
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
NGTmeaty
NGTmeaty previously approved these changes Aug 26, 2025
Copy link
Copy Markdown
Collaborator

@NGTmeaty NGTmeaty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good besides one comment!

Comment thread internal/pkg/config/config.go
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.43%. Comparing base (84e1d92) to head (96365c8).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/pkg/config/config.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #449      +/-   ##
==========================================
+ Coverage   55.45%   56.43%   +0.97%     
==========================================
  Files         120      128       +8     
  Lines        7364     7972     +608     
==========================================
+ Hits         4084     4499     +415     
- Misses       2956     3101     +145     
- Partials      324      372      +48     
Flag Coverage Δ
e2etests 39.44% <50.00%> (+2.27%) ⬆️
unittests 29.61% <0.00%> (-2.30%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NGTmeaty NGTmeaty merged commit 2ae11a9 into main Aug 26, 2025
5 checks passed
@NGTmeaty NGTmeaty deleted the warc-digest branch August 26, 2025 22:57
CorentinB added a commit that referenced this pull request Aug 27, 2025
* add: --warc-digest-algorithm

* fix: simplify WARC digest algorithm choice validation

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jake L <NGTmeaty@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants