Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to upload crawler-beans.cxml with curl #282

Closed
lpla opened this issue Aug 27, 2019 · 4 comments
Closed

Unable to upload crawler-beans.cxml with curl #282

lpla opened this issue Aug 27, 2019 · 4 comments
Assignees
Labels

Comments

@lpla
Copy link

lpla commented Aug 27, 2019

Hi. We are trying to integrate heritrix in bitextor and we are having issues with REST API commands using curl. We are trying to automate the whole crawling process but after creating a job, uploading the config file crawler-beans.cxml is not working. Paste command and response:

me@me:~$ curl -v -T /home/lpla/crawler-beans.cxml -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob/jobdir/crawler-beans.cxml
*   Trying 127.0.0.1...  
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8443 (#0)
* ALPN, offering h2   
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs         
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):            
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):                                                                    
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol                                                                                                                                                   
* Server certificate:                           
*  subject: CN=Heritrix Ad-Hoc HTTPS Certificate
*  start date: Aug 27 09:51:13 2019 GMT
*  expire date: Aug 24 09:51:13 2029 GMT
*  issuer: CN=Heritrix Ad-Hoc HTTPS Certificate
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
> PUT /engine/job/myjob/jobdir/crawler-beans.cxml HTTP/1.1
> Host: localhost:8443
> User-Agent: curl/7.58.0
> Accept: */*
> Content-Length: 30831
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 401 Unauthorized
< Content-Type: text/html;charset=utf-8
< Date: Tue, 27 Aug 2019 12:03:33 GMT
< Server: Restlet-Framework/2.4.0
< WWW-Authenticate: Digest realm="Authentication Required", domain="/", nonce="MTU2NjkwNzQxMzI2ODphZGJlNzhiODY2YTU4ODczNzJhYWIzMDgwY2UxYzBlOQ==", algorithm=MD5, qop="auth"
< Content-Length: 424
<
* Ignoring the response-body
* Connection #0 to host localhost left intact
* Issue another request to this URL: 'https://localhost:8443/engine/job/myjob/jobdir/crawler-beans.cxml'
* Found bundle for host localhost: 0x561d73c16a50 [can pipeline]
* Re-using existing connection! (#0) with host localhost
* Connected to localhost (127.0.0.1) port 8443 (#0)
* Server auth using Digest with user 'admin'
> PUT /engine/job/myjob/jobdir/crawler-beans.cxml HTTP/1.1
> Host: localhost:8443
> Authorization: Digest username="admin", realm="Authentication Required", nonce="MTU2NjkwNzQxMzI2ODphZGJlNzhiODY2YTU4ODczNzJhYWIzMDgwY2UxYzBlOQ==", uri="/engine/job/myjob/jobdir/crawler-beans.cxml", cnon
ce="MmFiZTJlOTI2NzU0ZTFjMDFiZWFlODdjNTVhM2IyZGU=", nc=00000001, qop=auth, response="db0481c4cb2a6a4293dc41fd84dbf0d8", algorithm="MD5"
> User-Agent: curl/7.58.0
> Accept: */*
> Content-Length: 30831
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 405 Method Not Allowed
< Content-Type: text/html;charset=utf-8
< Date: Tue, 27 Aug 2019 12:03:33 GMT
< Accept-Ranges: bytes
< Server: Restlet-Framework/2.4.0
< Vary: Accept-Charset, Accept-Encoding, Accept-Language, Accept
< Content-Length: 141
<
<h1>An error occurred</h1>
You may be able to recover and try something else by going <a href='javascript:history.back();void(0);'>back</a>.
* Connection #0 to host localhost left intact
@ato
Copy link
Collaborator

ato commented Aug 27, 2019

Thanks for the bug report. Confirming I can reproduce this. It's most likely a regression from the Restlet 2 upgrade (#276). I will debug it and aim to have a fix shortly.

@ato
Copy link
Collaborator

ato commented Aug 27, 2019

I have a partial fix for the Method Not Allow error in 5612fa2 however unfortunately that in turn exposed another problem with PUT under Restlet 2. Surprisingly (to me at least) it appears Restlet actually removes the file extension from the filename and then replaces it:

https://github.com/restlet/restlet-framework-java/blob/2.3/modules/org.restlet/src/org/restlet/engine/local/FileClientHelper.java#L287-L358

This means PUT crawler-beans.cxml ends up creating a file incorrectly named crawler-beans.xml. I will continue working on this tomorrow as it's rather late here now.

@lpla in the meantime I suggest using the 3.4.0-20190418 release rather than master as it was prior to the restlet2 upgrade.

@ato ato self-assigned this Aug 27, 2019
@ato ato added the bug label Aug 27, 2019
@nlevitt
Copy link
Contributor

nlevitt commented Aug 27, 2019

This is very reminiscent of https://webarchive.jira.com/browse/HER-1907, filed in 2011:
"files uploaded to action directory with http have .bin extension added, causing heritrix to ignore them". At the time I wrote:

Reslet sets the file extension based on the content-type, in com.noelios.restlet.local.FileClientHelper.handleFilePut(). This is pretty well baked into the code, and appears to be a philosophical choice that hasn't really been questioned.
The restlet-ful around this would be to register media types for each of the file types whose upload we need to handle. Then clients would need to specify the right content-type when they do the PUT. E.g. in the org.archive.crawler.restlet.EngineApplication constructor:
getMetadataService().addExtension("seeds", MediaType.register("application/x-heritrix-seeds", "Heritrix seeds list"));

@ato ato mentioned this issue Aug 28, 2019
@ato
Copy link
Collaborator

ato commented Aug 28, 2019

I guess Restlet's philosophy kind of makes sense for updating static website content in multiple languages and such, although even for that I'm sceptical. It's also inconsistent in that other methods like DELETE don't manipulate the file extension. For Heritrix's use case of providing API access to the jobs directory I can't imagine any scenario where we'd want it do this. So I'm proposing we override PUT to disable the file extension manipulation entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants