Skip to content

More versatile HQ client init#442

Merged
willmhowes merged 1 commit intointernetarchive:mainfrom
vbanos:hq-config
Aug 28, 2025
Merged

More versatile HQ client init#442
willmhowes merged 1 commit intointernetarchive:mainfrom
vbanos:hq-config

Conversation

@vbanos
Copy link
Copy Markdown
Collaborator

@vbanos vbanos commented Aug 19, 2025

Pass HQKey, HQSecret, HQProject, HQAddress as params on HQ.New instead of loading them via config.Get() inside HQ.Start.

This way, we can have multiple HQ client instances with different parameters.

This is necessary to be able to create a 2nd HQ client instance which connects an "outlinks" HQ project to send outlinks there.

Pass `HQKey, HQSecret, HQProject, HQAddress` as params on `HQ.New`
instead of loading them via `config.Get()` inside `HQ.Start`.

This way, we can have multiple HQ client instances with different
parameters.

This is necessary to be able to create a 2nd HQ client instance which
connects an "outlinks" HQ project to send outlinks there.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.68%. Comparing base (b0c17fa) to head (14f7a8c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/pkg/source/hq/hq.go 0.00% 8 Missing ⚠️
internal/pkg/controler/pipeline.go 0.00% 1 Missing ⚠️
internal/pkg/source/hq/websocket.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #442      +/-   ##
==========================================
- Coverage   54.82%   54.68%   -0.14%     
==========================================
  Files         120      120              
  Lines        7333     7351      +18     
==========================================
  Hits         4020     4020              
- Misses       2987     3005      +18     
  Partials      326      326              
Flag Coverage Δ
e2etests 37.22% <0.00%> (-0.10%) ⬇️
unittests 30.83% <0.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@CorentinB
Copy link
Copy Markdown
Collaborator

Hi! How is that necessary compared to modifying HQ client to accept a new (optional) parameter (like HQOutlinksProject) that would be used instead of HQProject if specified?

@vbanos
Copy link
Copy Markdown
Collaborator Author

vbanos commented Aug 19, 2025

I can explain my plan:

In startPipeline() we will add the following:

    hqOutlinksFinishChan := makeStageChannel(config.Get().WorkersCount)
    hqOutlinksProduceChan := makeStageChannel(config.Get().WorkersCount)

    // Optional 2nd HQ instance just to gather outlinks to a different project
    if config.Get().UseHQ && config.Get().HQOutlinksProject != "" && config.Get().HQOutlinksHopLimit > 0 {
        hqOutlinks := hq.New(config.Get().HQKey, config.Get().HQSecret, config.Get().HQOutlinksProject, config.Get().HQAddress)
        hqOutlinks.Start(hqOutlinksFinishChan, hqOutlinksProduceChan)
    }

Also, the post processor will change from postprocessor.Start(inputChan, outputChan) to postprocessor.Start(inputChan, outputChan, hqOutlinksProduceChan) to include the new channel that will send outlinks to the different project.

Then, in postprocessor.go, we will add the following function

 // If options UseHQ, HQOutlinks & HQOutlinksHopLimit are selected, send outlinks
// to a different HQ project and don't return them for further processing.
func (p *postprocessor) sendToHQOutlinks(outlinks []*models.Item) []*models.Item {
    if config.Get().UseHQ && config.Get().HQOutlinksProject != "" && config.Get().HQOutlinksHopLimit > 0 {
        var filtered []*models.Item
        for i := range outlinks {
            if outlinks[i].GetURL().GetHops() >= config.Get().HQOutlinksHopLimit {
                p.hqOutlinksProduceCh <- outlinks[i]
            } else {
                filtered = append(filtered, outlinks[i])
            }
        }
        return filtered
    } else {
        return outlinks
    }
}

And finally, in postprocessor worker() we will use it after we extract all outlinks

outlinks := postprocess(workerID, seed)
outlinks = p.sendToHQOutlinks(outlinks)

And we are done. The required changes are minimal.

IMHO the current hq pkg is very nice and well tested, we shouldn't touch it. We may introduce bugs.

If we tried adding HQOutlinksProject to hq, we would have to over-complicate it by adding a 2nd client, wg, producer() function for outlinks and even more (I haven't considered everything yet).

@yzqzss
Copy link
Copy Markdown
Collaborator

yzqzss commented Aug 19, 2025

image

A possibly easier way is implementing this function on the HQ server side, not on Zeno side.

For example, modify the POST /api/projects/:project/urls handler to ignore a magic project prefix, such as "outlinks_". (Of course, we can implement more smart and live configurable routing logic on HQ server.)
This way, when Zeno is started with --hq-project outlinks_abcd, it will pull URLs from outlinks_abcd, but any newly discovered URLs will be redirected and feed to the abcd project.

Otherwise we would need to handle communication with multiple HQs in Zeno, which is a bit complicated IMO.

@vbanos
Copy link
Copy Markdown
Collaborator Author

vbanos commented Aug 19, 2025

@yzqzss there is an extra parameter that isn't covered by your suggested solution.
HQOutlinksHopLimit is an integer limit.
You can start a new crawling job with max hops = 3 and use HQOutlinksHopLimit=3. This means that hop 1, 2 outlinks will be in the current project and hop=3 will be send to the other project.
Besides that, it could work.

IMHO my solution is not complicated and requires few changes.

Anyway, its not my place to decide Zeno architecture.

@CorentinB CorentinB added the enhancement New feature or request label Aug 20, 2025
@CorentinB
Copy link
Copy Markdown
Collaborator

CorentinB commented Aug 20, 2025

Anyway, its not my place to decide Zeno architecture.

It totally is! :)

Although, it's clearly not mine when it comes to HQ, because I don't work at the Internet Archive anymore. More @NGTmeaty !
Just please be cautious to not make any drastic changes to the root of Zeno to fit the hq package, because we have our own source package instead of hq and it may break that. Always happy to discuss of course.

@willmhowes
Copy link
Copy Markdown
Collaborator

@vbanos For the time being, let's proceed with your proposal that keeps the logic in Zeno. Given that HQ isn't open source at this time, I'd rather the functionality be in Zeno where all users can take advantage of it

@willmhowes willmhowes merged commit e3b4d1e into internetarchive:main Aug 28, 2025
2 checks passed
CorentinB pushed a commit that referenced this pull request Aug 29, 2025
Pass `HQKey, HQSecret, HQProject, HQAddress` as params on `HQ.New`
instead of loading them via `config.Get()` inside `HQ.Start`.

This way, we can have multiple HQ client instances with different
parameters.

This is necessary to be able to create a 2nd HQ client instance which
connects an "outlinks" HQ project to send outlinks there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants