-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with very strange redirects #35
Comments
You probably need to set the |
That unfortunately did not work: GET("http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17", accept("text/html")) still returns the "JSON" version of the page. |
They are running a misconfiged cache server so you are getting false hits. Try this: req <- GET("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350",
add_headers(Accept = "text/html", "Cache-control" = "max-age=0"))
content(req, "text") Sometimes it helps if you just add an arbitrary parameter to the URL to bypass the cache: url <- paste0("http://www.senato.it/loc/link.asp?tipodoc=CAM.DEP&leg=17&id=30350&_random=", runif(1))
req <- GET(url, accept("text/html"))
content(req, "text") |
Still no luck: I get a false hit, whatever the URL used. On top of that, I've just discovered that |
I think this works: library(httr)
url <- "http://www.camera.it/leg17/29?tipoAttivita=&tipoVisAtt=&tipoPersona=&shadow_deputato=300453&idLegislatura=17"
url <- paste0(url, "&_random=", rnorm(1))
req <- GET(url, accept("text/html"))
stopifnot(req$headers[["x-cache"]] == "MISS")
stopifnot(req$headers$age == "0")
content(req, "text") Their server is really poorly configured, not only does it give false hits but it ignores the |
It seems wot work indeed! Thank you very much to both of you for your help. How did you come with the |
It's just something arbitrary that you add to the URL in order to trick the cache server into thinking that you are fetching a different page, so it cannot serve you a cached copy. It's a common trick to force bypassing any cache. |
Excellent. Thanks again and enjoy your days. |
I have a very strange use case that I would like to submit for consideration because I get different results with
curl_download
than I do withcurl
from the shell. Please feel free to close the issue straight away if the problem is not related enough tocurl
.I'm starting with redirect URLs like this one:
By using
httr
, I get the URL it redirects to:Now, if I navigate to this page, I sometimes get a Web page for a male Italian MP, but I also sometimes get an empty page with content type
application/json
that simply says "format json not implemented yet". Here's the output withhttr
:So I turn to
curl
, thinking that it might save me from that very strange issue. Why? Because every time that I trycurl -v
on that last URL in the shell, I get HTML instead of the "JSON" thing:My issue is, when I use
curl_download
, it does not get the HTML likecurl
in-the-shell does: it gets the wrong "JSON" thing.Question, therefore, is: how do I emulate
curl
in-the-shell's settings withcurl
in R, in order to get HTML instead of "JSON" through these awful links?The output of
curl -v
for the above example is copied in this Gist. Mycurl
in-the-shell version is copied below.The text was updated successfully, but these errors were encountered: