Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting a sleep between each fetch request #71

Open
Derek-Jones opened this issue Jan 11, 2024 · 5 comments
Open

Inserting a sleep between each fetch request #71

Derek-Jones opened this issue Jan 11, 2024 · 5 comments

Comments

@Derek-Jones
Copy link

I would like to be nice to the Wayback Machine and space out my requests.

An option to insert a delay of x seconds between fetching each page would allow me to reduce the load.

It looks like the Wayback Machine does have a rate limiter, which causes the current non-delayed fetch to grind to a halt.

@jasonkarns
Copy link

Would love to leverage tools like watch but waybackpack doesn't exit with proper status code so it's difficult to script with sleep/watch or other utilities.

When waybackpack encounters an error, it still exits with a successful status code (0) instead of an error status. (which is a bug, IMO)

@Derek-Jones
Copy link
Author

I was thinking more along the lines of, say, calling time.sleep(NICE_INTERVAL) at the end of the for loop in the function download_to.

jsvine added a commit that referenced this issue Jan 17, 2024
@jsvine
Copy link
Owner

jsvine commented Jan 17, 2024

Hi @Derek-Jones, and thanks for the suggestion. I've now added --delay X (in the CLI, and delay=X in download_to), available in v0.6.0. This adds a pause of X seconds between fetches. Let me know if it works for you.

And thanks for the note @jasonkarns. To clarify, are you saying that if waybackpack itself fails (i.e., throws a Python error), you don't get exit=0? That'd surprise me, and require one kind of debugging.

Or are you saying that when an asset fails to fetch, waybackpack ultimately completes with exit=0? If so, that seems, at least from my perspective, to be more of a user-expectations question. With voluminous fetches, the Wayback Machine can be expected to fail occasionally, and I wouldn't necessarily want to call the whole process a failure. But perhaps this could be configurable, so that if you did want any failed fetch to lead to exit=1, you could specify that.

@Derek-Jones
Copy link
Author

Thanks for adding this suggestion, and doing it so quickly.

If I kick off a waybackpack, see below, the 20th Fetch appears to hang, and after some delay a variety of Python tracebacks appear.

Waiting, say 10 minutes, and rerunning produces the same behavior after fewer fetches (the --no-clobber option means that new, later, pages are fetched).

>waybackpack http://www.bsdstats.org/bt/cpus.html -d tway --max-retries 5 --no-clobber
INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20080813080244
INFO:waybackpack.pack: Writing to tway/20080813080244/www.bsdstats.org/bt/cpus.html
# ... lines deleted
INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20110725111836
INFO:waybackpack.pack: Writing to tway/20110725111836/www.bsdstats.org/bt/cpus.html

INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20110911091237
Traceback (most recent call last):
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 179, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/derek/.local/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20110911091237/http://www.bsdstats.org/bt/cpus.html (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/derek/.local/bin/waybackpack", line 8, in <module>
    sys.exit(main())
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/cli.py", line 144, in main
    pack.download_to(
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/pack.py", line 99, in download_to
    content = asset.fetch(session=self.session, raw=raw, root=root)
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/asset.py", line 53, in fetch
    res = session.get(url)
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/session.py", line 29, in get
    res = requests.get(
  File "/home/derek/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/adapters.py", line 553, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20110911091237/http://www.bsdstats.org/bt/cpus.html (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)'))
/home/web/blog/bsdstats>


@jsvine
Copy link
Owner

jsvine commented Jan 17, 2024

Ah, thanks for flagging. Looks like we need to handle ConnectTimeout (instead of just ConnectionError). Attempted fix now pushed in v0.6.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants