Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of the /submit/ endpoint #22

Open
antonalekseev opened this issue May 1, 2020 · 3 comments
Open

Behaviour of the /submit/ endpoint #22

antonalekseev opened this issue May 1, 2020 · 3 comments

Comments

@antonalekseev
Copy link

Behaviour of https://archive.md/submit/ endpoint has changed recently. Now it returns WIP page in Refresh header (https://archive.md/wip/Z6uhm) which contains page capture progress and expects client to retry until the page is captured and proper memento URL (https://archive.md/Z6uhm) returned via Location. This way archiveis.capture() always returns URL of the WIP page.

This can be fixed either by retrying until proper URL is available (and somehow handling errors if it is not) or just stripping /wip/ from URL and hoping for the best.

>>> archive_url = archiveis.capture("https://example.com")
DEBUG:archiveis.api:Requesting https://archive.md/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "GET / HTTP/1.1" 200 4997
DEBUG:archiveis.api:Unique identifier: QxbCURgTX9qqOlJsvO7Qnp6OpwoRYUx3YErVZz1eLx4aUht3+iuOB+6Ili4WD2Y2
DEBUG:archiveis.api:Requesting https://archive.md/submit/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "POST /submit/ HTTP/1.1" 200 244
DEBUG:archiveis.api:Memento from Refresh header: https://archive.md/wip/Z6uhm
@palewire
Copy link
Owner

palewire commented May 5, 2020

Do you think stripping the /wip/ will work reliably?

@antonalekseev
Copy link
Author

I reckon it will not be more unreliable than it was with current archiveis code and old-style (pre-wip-page) handling on the server side. Refresh: header was available as soon as Loading... page was, and it was returned by archiveis.capture() immediately and unconditionally. This way unsuccessful archivals in the cases of Error: time out., Error: Network error. and infinite Loading... were not handled anyway, and resulting link ultimately yielded 404. Stripping /wip/ should work the same way.

On the one hand bluntly ignoring errors is not an ideal approach, on the other hand waiting up to 3-5 minutes on each call is also not an option for many use cases. Maybe it makes sense to introduce some kind of archiveis.capture(..., strict=False) parameter which defaults to shortcut (and existing) behaviour, and optional strict=True mode which parses wip page for all kinds of errors and raises exceptions?

@palewire
Copy link
Owner

You have any idea on how we could implement this in the Python?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants