Behaviour of the `/submit/` endpoint #22

antonalekseev · 2020-05-01T18:18:06Z

Behaviour of https://archive.md/submit/ endpoint has changed recently. Now it returns WIP page in Refresh header (https://archive.md/wip/Z6uhm) which contains page capture progress and expects client to retry until the page is captured and proper memento URL (https://archive.md/Z6uhm) returned via Location. This way archiveis.capture() always returns URL of the WIP page.

This can be fixed either by retrying until proper URL is available (and somehow handling errors if it is not) or just stripping /wip/ from URL and hoping for the best.

>>> archive_url = archiveis.capture("https://example.com")
DEBUG:archiveis.api:Requesting https://archive.md/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "GET / HTTP/1.1" 200 4997
DEBUG:archiveis.api:Unique identifier: QxbCURgTX9qqOlJsvO7Qnp6OpwoRYUx3YErVZz1eLx4aUht3+iuOB+6Ili4WD2Y2
DEBUG:archiveis.api:Requesting https://archive.md/submit/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "POST /submit/ HTTP/1.1" 200 244
DEBUG:archiveis.api:Memento from Refresh header: https://archive.md/wip/Z6uhm

The text was updated successfully, but these errors were encountered:

palewire · 2020-05-05T15:39:23Z

Do you think stripping the /wip/ will work reliably?

antonalekseev · 2020-05-05T20:57:46Z

I reckon it will not be more unreliable than it was with current archiveis code and old-style (pre-wip-page) handling on the server side. Refresh: header was available as soon as Loading... page was, and it was returned by archiveis.capture() immediately and unconditionally. This way unsuccessful archivals in the cases of Error: time out., Error: Network error. and infinite Loading... were not handled anyway, and resulting link ultimately yielded 404. Stripping /wip/ should work the same way.

On the one hand bluntly ignoring errors is not an ideal approach, on the other hand waiting up to 3-5 minutes on each call is also not an option for many use cases. Maybe it makes sense to introduce some kind of archiveis.capture(..., strict=False) parameter which defaults to shortcut (and existing) behaviour, and optional strict=True mode which parses wip page for all kinds of errors and raises exceptions?

palewire · 2020-12-27T23:52:49Z

You have any idea on how we could implement this in the Python?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour of the `/submit/` endpoint #22

Behaviour of the `/submit/` endpoint #22

antonalekseev commented May 1, 2020

palewire commented May 5, 2020

antonalekseev commented May 5, 2020

palewire commented Dec 27, 2020

Behaviour of the /submit/ endpoint #22

Behaviour of the /submit/ endpoint #22

Comments

antonalekseev commented May 1, 2020

palewire commented May 5, 2020

antonalekseev commented May 5, 2020

palewire commented Dec 27, 2020

Behaviour of the `/submit/` endpoint #22

Behaviour of the `/submit/` endpoint #22