Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with pushing mementos into Internet Archive #43

Open
shawnmjones opened this issue Apr 1, 2020 · 6 comments
Open

Problems with pushing mementos into Internet Archive #43

shawnmjones opened this issue Apr 1, 2020 · 6 comments

Comments

@shawnmjones
Copy link
Member

I noticed this when I was using ArchiveNow this morning.

# archivenow www.foxnews.com
Error (The Internet Archive): 445 Client Error:  for url: https://web.archive.org/save/www.foxnews.com

If I add a user agent to the arguments to the requests.get on line 15 of archivenow/archivenow/handlers/ia_handler.py then it works.

r = requests.get(uri, timeout=120, allow_redirects=True)

I'm uncertain as to how you want to handle the user specifying their own user agent. The existing --agent argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers in archivenow/archivenow.py.

As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the push function on line 129 of archivenow/archivenow.py.

def push(URI, arc_id, p_args={}):
global handlers
global res_uris
try:
# push to all possible archives
res_uris_idx = str(uuid.uuid4())
res_uris[res_uris_idx] = []
### if arc_id == 'all':
### for handler in handlers:
### if (handlers[handler].api_required):
# pass args like key API
### res.append(handlers[handler].push(str(URI), p_args))
### else:
### res.append(handlers[handler].push(str(URI)))
### else:
# push to the chosen archives
threads = []
for handler in handlers:
if (arc_id == handler) or (arc_id == 'all'):
### if (arc_id == handler): ### and (handlers[handler].api_required):
#res.append(handlers[handler].push(str(URI), p_args))
#push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
threads.append(Thread(target=push_proxy, args=(handlers[handler],str(URI), p_args, res_uris_idx,)))
### elif (arc_id == handler):
### res.append(handlers[handler].push(str(URI)))
for th in threads:
th.start()
for th in threads:
th.join()
res = res_uris[res_uris_idx]
del res_uris[res_uris_idx]
return res
except:
del res_uris[res_uris_idx]
pass
return ["bad request"]

For example, we could have:

def push(URI, arc_id, p_args={}, headers={}):

where the user can override any of the request headers by assigning them as a dictionary to the headers parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.

I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.

@shawnmjones shawnmjones changed the title Issue pushing mementos into Internet Archive Problems with pushing mementos into Internet Archive Apr 1, 2020
@maturban
Copy link
Member

maturban commented Apr 2, 2020

Thanks for providing details about the problem.

Do you have any suggestion for how the user can provide headers? For example:

archivenow http://www.example.com --header='{"User-Agent": "Mozilla/5.0 (Windows NT 6.1)", "Accept-Charset": "utf-8"}'

@maturban
Copy link
Member

maturban commented Apr 2, 2020

The user-agent is hard coded in the Internet Archive handler (i.e., archivenow/archivenow/handlers/ia_handler.py) for now.

@machawk1
Copy link
Member

machawk1 commented Apr 2, 2020

@maturban MemGator has some logic of allowing users to specify user-agent through the command-line. I think simply allowing a string with some semantic CLI flag (e.g., MemGator's --agent/-a) would make specifying this value more straightforward to users.

@ibnesayeed might have an opinion on this as well.

@shawnmjones
Copy link
Member Author

Here are my suggestions after thinking about it this morning.

For command-line users

For the command line utility, something like this should suffice:

archivenow http://www.example.com --user-agent "mytool/1.0"

for comparison, wget has -U, --user-agent=AGENT and curl has -A, --user-agent <name>.

We don't have a use case for allowing command-line users to change all request headers, just user-agent.

For programmers (me, other WS-DL folks, and the world)

Programmers, on the other hand, may need to modify request headers. This is why I was suggesting that we alter def push in archivenow/archivenow.py to be something more like:

def push(URI, arc_id, p_args={}, headers={}):

and have the headers dictionary propagate to the appropriate handler and the request.get call that it makes.

I have an even better idea.

Because ArchiveNow employs the requests library, you could allow the programmer to set up a session object and send the session object as an argument.

If no session object is specified, the argument can default to a new one. Like this:

def push(URI, arc_id, p_args={}, session=requests.Session()):

This way, the programmer can set up the session once in their own code and just pass it. They may have changed the session object to include caching, timeouts, user-agents, request headers, etc, and ArchiveNow does not need to care what changes were made. It just calls session.get when the time comes.

You can even re-use this session object solution when changing the user-agent string while adding the user-agent argument for command-line users.

@ibnesayeed
Copy link
Member

MemGator CLI's user agent works as following:

  • By default, every request to each upstream archive includes a User-Agent header in the MemGator/{Version} <{CONTACT}> format where the value of the {Version} is the version of the MG binary used and the default value of {CONTACT} is set to the URI of the MG repo.
  • A user can specify a contact information or any other descriptive string to identify itself to upstream archives by providing a value using the --contact CLI parameter. This value will be placed in the default UA template.
  • A user may choose to overwrite the whole UA string to not include the name and version of the binary by supplying a custom string using the --agent CLI parameter.
  • A user may choose to use the --spoof flag, which will cause MG to use a random browser UA in each request. There are currently only three spoofing agents defined in the repo, but they can be expanded to have more to choose from.

@ibnesayeed
Copy link
Member

I would not suggest supplying python dictionaries from the CLI as CLIs should be language independent.

If you want to allow specifying generic request headers from the CLI (apart from a dedicated flag for the UA), you can use the append action (which will allow repetition of the same argument) to a CLI parameter like --header which accepts a value of the form "header-name: value" (this is how it is done in cURL).

As far as the internal API is concerned, I would certainly suggest taking @shawnmjones' advice on supporting custom session object. In addition to that, I would suggest you use wildcard keyword arguments (that start with **) to ensure the API signature does not change each time you include support for one more feature and it also allows forwarding arguments to internal function calls. Depending on the situation, you may introduce some sort of convention in the argument names to group them automatically (e.g., all the arguments received due to the wildcard **kwarg may have certain prefixes to to be treated one way or the other).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants