Working with very strange redirects #35

briatte · 2015-08-04T12:33:19Z

I have a very strange use case that I would like to submit for consideration because I get different results with curl_download than I do with curl from the shell. Please feel free to close the issue straight away if the problem is not related enough to curl.

I'm starting with redirect URLs like this one:

http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350

By using httr, I get the URL it redirects to:

http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17

Now, if I navigate to this page, I sometimes get a Web page for a male Italian MP, but I also sometimes get an empty page with content type application/json that simply says "format json not implemented yet". Here's the output with httr:

> GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17")
Response [http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17]
  Date: 2015-08-04 12:22
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 31 B
> GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17") %>% content
Error in parse_string(txt, bigint_as_char) : 
  lexical error: invalid string in json text.
                                       format json not implemented yet
                     (right here) ------^

So I turn to curl, thinking that it might save me from that very strange issue. Why? Because every time that I try curl -v on that last URL in the shell, I get HTML instead of the "JSON" thing:

curl -v http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=30350&idLegislatura=17

My issue is, when I use curl_download, it does not get the HTML like curl in-the-shell does: it gets the wrong "JSON" thing.

Question, therefore, is: how do I emulate curl in-the-shell's settings with curl in R, in order to get HTML instead of "JSON" through these awful links?

The output of curl -v for the above example is copied in this Gist. My curl in-the-shell version is copied below.

curl 7.30.0 (x86_64-apple-darwin13.0) libcurl/7.30.0 SecureTransport zlib/1.2.5
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smtp smtps telnet tftp 
Features: AsynchDNS GSS-Negotiate IPv6 Largefile NTLM NTLM_WB SSL libz

The text was updated successfully, but these errors were encountered:

hadley · 2015-08-04T12:37:02Z

You probably need to set the Accepts header to prioritise html over json.

briatte · 2015-08-04T12:43:29Z

That unfortunately did not work:

GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17", accept("text/html"))

still returns the "JSON" version of the page.

jeroen · 2015-08-04T12:50:39Z

They are running a misconfiged cache server so you are getting false hits. Try this:

req <- GET("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350", 
  add_headers(Accept = "text/html", "Cache-control" = "max-age=0"))
content(req, "text")

Sometimes it helps if you just add an arbitrary parameter to the URL to bypass the cache:

url <- paste0("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350&_random=", runif(1))
req <- GET(url, accept("text/html"))
content(req, "text")

briatte · 2015-08-04T13:00:18Z

Still no luck: I get a false hit, whatever the URL used.

On top of that, I've just discovered that curl -v always returns HTML, but the content of the page is often faulty ("page temporarily inaccessible, return later").

jeroen · 2015-08-04T13:07:56Z

I think this works:

library(httr)
url <- "http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17"
url <- paste0(url, "&_random=", rnorm(1))
req <- GET(url, accept("text/html"))
stopifnot(req$headers[["x-cache"]] == "MISS")
stopifnot(req$headers$age == "0")
content(req, "text")

Their server is really poorly configured, not only does it give false hits but it ignores the Cache-Control: no-cache request header. But slightly changing the URL usually forces the cache server to fetch a new copy.

briatte · 2015-08-04T13:13:12Z

It seems wot work indeed!

Thank you very much to both of you for your help.

How did you come with the &random= part?

jeroen · 2015-08-04T13:19:24Z

It's just something arbitrary that you add to the URL in order to trick the cache server into thinking that you are fetching a different page, so it cannot serve you a cached copy. It's a common trick to force bypassing any cache.

briatte · 2015-08-04T13:21:39Z

Excellent. Thanks again and enjoy your days.

jeroen closed this as completed Aug 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with very strange redirects #35

Working with very strange redirects #35

briatte commented Aug 4, 2015

hadley commented Aug 4, 2015

briatte commented Aug 4, 2015

jeroen commented Aug 4, 2015

briatte commented Aug 4, 2015

jeroen commented Aug 4, 2015

briatte commented Aug 4, 2015

jeroen commented Aug 4, 2015

briatte commented Aug 4, 2015

Working with very strange redirects #35

Working with very strange redirects #35

Comments

briatte commented Aug 4, 2015

hadley commented Aug 4, 2015

briatte commented Aug 4, 2015

jeroen commented Aug 4, 2015

briatte commented Aug 4, 2015

jeroen commented Aug 4, 2015

briatte commented Aug 4, 2015

jeroen commented Aug 4, 2015

briatte commented Aug 4, 2015