Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with very strange redirects #35

Closed
briatte opened this issue Aug 4, 2015 · 8 comments
Closed

Working with very strange redirects #35

briatte opened this issue Aug 4, 2015 · 8 comments

Comments

@briatte
Copy link

briatte commented Aug 4, 2015

I have a very strange use case that I would like to submit for consideration because I get different results with curl_download than I do with curl from the shell. Please feel free to close the issue straight away if the problem is not related enough to curl.

I'm starting with redirect URLs like this one:

http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350

By using httr, I get the URL it redirects to:

http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17

Now, if I navigate to this page, I sometimes get a Web page for a male Italian MP, but I also sometimes get an empty page with content type application/json that simply says "format json not implemented yet". Here's the output with httr:

> GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17")
Response [http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17]
  Date: 2015-08-04 12:22
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 31 B
> GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17") %>% content
Error in parse_string(txt, bigint_as_char) : 
  lexical error: invalid string in json text.
                                       format json not implemented yet
                     (right here) ------^

So I turn to curl, thinking that it might save me from that very strange issue. Why? Because every time that I try curl -v on that last URL in the shell, I get HTML instead of the "JSON" thing:

curl -v http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17

My issue is, when I use curl_download, it does not get the HTML like curl in-the-shell does: it gets the wrong "JSON" thing.

Question, therefore, is: how do I emulate curl in-the-shell's settings with curl in R, in order to get HTML instead of "JSON" through these awful links?

The output of curl -v for the above example is copied in this Gist. My curl in-the-shell version is copied below.

curl 7.30.0 (x86_64-apple-darwin13.0) libcurl/7.30.0 SecureTransport zlib/1.2.5
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smtp smtps telnet tftp 
Features: AsynchDNS GSS-Negotiate IPv6 Largefile NTLM NTLM_WB SSL libz 
@hadley
Copy link
Collaborator

hadley commented Aug 4, 2015

You probably need to set the Accepts header to prioritise html over json.

@briatte
Copy link
Author

briatte commented Aug 4, 2015

That unfortunately did not work:

GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17", accept("text/html"))

still returns the "JSON" version of the page.

@jeroen
Copy link
Owner

jeroen commented Aug 4, 2015

They are running a misconfiged cache server so you are getting false hits. Try this:

req <- GET("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350", 
  add_headers(Accept = "text/html", "Cache-control" = "max-age=0"))
content(req, "text")

Sometimes it helps if you just add an arbitrary parameter to the URL to bypass the cache:

url <- paste0("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350&_random=", runif(1))
req <- GET(url, accept("text/html"))
content(req, "text")

@briatte
Copy link
Author

briatte commented Aug 4, 2015

Still no luck: I get a false hit, whatever the URL used.

On top of that, I've just discovered that curl -v always returns HTML, but the content of the page is often faulty ("page temporarily inaccessible, return later").

@jeroen
Copy link
Owner

jeroen commented Aug 4, 2015

I think this works:

library(httr)
url <- "http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17"
url <- paste0(url, "&_random=", rnorm(1))
req <- GET(url, accept("text/html"))
stopifnot(req$headers[["x-cache"]] == "MISS")
stopifnot(req$headers$age == "0")
content(req, "text")

Their server is really poorly configured, not only does it give false hits but it ignores the Cache-Control: no-cache request header. But slightly changing the URL usually forces the cache server to fetch a new copy.

@briatte
Copy link
Author

briatte commented Aug 4, 2015

It seems wot work indeed!

Thank you very much to both of you for your help.

How did you come with the &random= part?

@jeroen
Copy link
Owner

jeroen commented Aug 4, 2015

It's just something arbitrary that you add to the URL in order to trick the cache server into thinking that you are fetching a different page, so it cannot serve you a cached copy. It's a common trick to force bypassing any cache.

@jeroen jeroen closed this as completed Aug 4, 2015
@briatte
Copy link
Author

briatte commented Aug 4, 2015

Excellent. Thanks again and enjoy your days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants