Skip to content

fix: WARCWritingClient nil ptr dereference#581

Open
AltayAkkus wants to merge 1 commit intointernetarchive:mainfrom
AltayAkkus:fix-warcwritingclient
Open

fix: WARCWritingClient nil ptr dereference#581
AltayAkkus wants to merge 1 commit intointernetarchive:mainfrom
AltayAkkus:fix-warcwritingclient

Conversation

@AltayAkkus
Copy link
Copy Markdown
Contributor

See #580

clients = append(clients, c)
}
globalArchiver.Client.Timeout = config.Get().HTTPTimeout
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because it used nowhere, and with one .Client we won't need any GetClients() shorthand soon

@AltayAkkus
Copy link
Copy Markdown
Contributor Author

I checked if the Proxy is actually used by adding ipinfo.io/json to the list, works4me

altay@Altays-MBP zeno % zeno --config-file jobs/zeno-config.yaml get list jobs/list.txt 
Using config file: jobs/zeno-config.yaml
2026/03/20 20:32:11 INFO User-Agent set to user-agent="Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/dfedbc7 warc/v0.8.98"
2026/03/20 20:32:11 INFO max content length is set, payload over X MiB would be discarded X=10
2026-03-20T20:32:11+01:00 INFO  reactor.go:57       | started component=reactor
2026-03-20T20:32:11+01:00 INFO  preprocessor.go:66  | started component=preprocessor
2026-03-20T20:32:11+01:00 INFO  worker.go:70        | bucket manager started component=archiver
2026-03-20T20:32:11+01:00 INFO  worker.go:92        | started component=archiver
2026-03-20T20:32:11+01:00 INFO  postprocessor.go:50 | started component=postprocessor
2026-03-20T20:32:11+01:00 INFO  pipeline.go:135     | starting source component=controler.StartPipeline source=Local Queue
2026-03-20T20:32:11+01:00 INFO  lq.go:63            | started component=lq
2026-03-20T20:32:11+01:00 INFO  finisher.go:55      | started component=finisher
2026-03-20T20:32:12+01:00 INFO  archiver.go:180     | url archived component=archiver.general.archive depth=0 hops=0 item_id=972e0 seed_id=972e0 url=https://example.com/ status=200
2026-03-20T20:32:12+01:00 INFO  archiver.go:180     | url archived component=archiver.general.archive depth=0 hops=0 item_id=c65e3 seed_id=c65e3 url=https://altayakkus.dev/ status=200
2026-03-20T20:32:12+01:00 INFO  archiver.go:180     | url archived component=archiver.general.archive depth=0 hops=0 item_id=9fbf6 seed_id=9fbf6 url=https://ipinfo.io/json status=200
2026-03-20T20:32:12+01:00 INFO  consumer.go:228     | crawl finished: no URLs in queue and no active work in reactor, triggering graceful shutdown component=lq.consumerFetcher
2026-03-20T20:32:12+01:00 INFO  signal.go:27        | received shutdown signal, stopping services... component=controler.signalWatcher
2026-03-20T20:32:12+01:00 INFO  reactor.go:86       | frozen component=reactor
2026-03-20T20:32:12+01:00 INFO  preprocessor.go:81  | stopped component=preprocessor
2026-03-20T20:32:14+01:00 INFO  worker.go:130       | stopped component=archiver
2026-03-20T20:32:14+01:00 INFO  worker.go:140       | closed bucket manager component=archiver
2026-03-20T20:32:14+01:00 INFO  worker.go:146       | stopped config related contexts component=archiver
2026-03-20T20:32:14+01:00 INFO  postprocessor.go:64 | stopped component=postprocessor
2026-03-20T20:32:14+01:00 INFO  finisher.go:73      | stopped component=finisher
2026-03-20T20:32:15+01:00 INFO  lq.go:91            | stopped component=lq
2026-03-20T20:32:15+01:00 INFO  reactor.go:77       | stopped component=reactor
2026-03-20T20:32:15+01:00 INFO  pipeline.go:207     | done, logs are flushing and will be closed component=controler.stopPipeline

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.56522% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.67%. Comparing base (dfedbc7) to head (01d1c77).

Files with missing lines Patch % Lines
internal/pkg/archiver/warc.go 65.00% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #581      +/-   ##
==========================================
+ Coverage   56.46%   56.67%   +0.20%     
==========================================
  Files         133      133              
  Lines        6742     6712      -30     
==========================================
- Hits         3807     3804       -3     
+ Misses       2556     2533      -23     
+ Partials      379      375       -4     
Flag Coverage Δ
e2etests 41.98% <69.56%> (+0.14%) ⬆️
unittests 29.28% <0.00%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NGTmeaty
Copy link
Copy Markdown
Collaborator

This is very interesting - there was once a concept of proxied requests and non-proxied requests in one Zeno client, the idea being we could specifically route some traffic over a proxy and some not based on simple domain rules, I suppose over some of our rewrites, this got a little lost, so this is technically the correct fix but misses the initial intentions behind ClientWithProxy.

Let me talk with @willmhowes to see if we can recommend something specific here.

@AltayAkkus
Copy link
Copy Markdown
Contributor Author

That makes sense!
But I think that the pattern of creating two separate WARCWritingClient, one proxied and one unproxied should be left behind.
#362 proposed having one .Client and one ClientWithProxy, and statically route items to one of these clients via a list of domains in Config.Get().BypassProxy. Essentially like this
image

#136 proposed mapping tlds -> proxies (!). If I understood Corentin right, that would mean that you can map

"de" -> german.proxy.global:1337 
"fr" -> france.proxy.global:1337
...

I would throw another wish into the pot, I would like to use a proxy to retry items that failed with 429 Too many requests or 451 Unavailable For Legal Reasons

We could all crawl in peace & happiness if we were able to define a Proxy on a per-request level with gowarc's Client.

This sunday I wasted about 6 hours of my life on trying to implement HTTP/2 in gowarc, to no avail.
I will take a shot at implementing this

@AltayAkkus
Copy link
Copy Markdown
Contributor Author

AltayAkkus commented Mar 24, 2026

Implemented per-request proxy definition with gowarc's HTTP client in internetarchive/gowarc#186

After due diligence & review we MAY continue here.
My first idea would be a proxyConf.json that is passed via Config, where you define rules that map items -> proxy

https://github.com/Icheka/go-rules-engine

{
  "condition": {
    "all": [
      {
        "identifier": "tld",
        "operator": "=",
        "value": "de"
      }
    ],
    "priority": 10
  },
  "event": {
    "type": "proxy",
    "payload": {
      "proxy":"user:pass@german.myproxies.eu:1337"
    }
  }
}
{
  "condition": {
    "all": [
      {
        "identifier": "retries",
        "operator": ">",
        "value": "3"
      }
    ],
    "priority": 5
  },
  "event": {
    "type": "proxy",
    "payload": {
      "proxy":"user:pass@global.myproxies.eu:1337"
    }
  }
}

Or we build something else on our own.

The rules are evaluated in the

func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) {

for retry := 0; retry <= config.Get().MaxRetry; retry++ {
// This is unused unless there is an error
retrySleepTime := time.Second * time.Duration(retry*2)
// Get and measure request time
getStartTime := time.Now()
// If WARC writing is asynchronous, we don't need a feedback channel
if !config.Get().WARCWriteAsync {
feedbackChan = make(chan struct{}, 1)
// Add the feedback channel to the request context
req = req.WithContext(warc.WithFeedbackChannel(req.Context(), feedbackChan))
}
wrappedConnChan = make(chan *warc.CustomConnection, 1)
req = req.WithContext(warc.WithWrappedConnection(req.Context(), wrappedConnChan))
resp, err = client.Do(req)

Depending on the outcome of the evaluation, we add the ContextProxyKey before Doing the request via gowarc

tl;dr

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants