-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error 404 when scrapping page accessible through browser. #64
Comments
Thanks for spotting this @ugran, I've investigated a bit and it looks like the server is rejecting us because it does not like our User-Agent string. When you use a browser, you get redirected to http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html Now, if I try to open that final URL with open-uri, the server responds with 404. But, if I tell open-uri to pass a User-Agent string of "Mozilla", the server responds with 200 OK: ➔ irb
2.0.0-p451 :001 > require 'open-uri'
=> true
2.0.0-p451 :002 > url = "http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html"
=> "http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html"
2.0.0-p451 :003 > open(url)
OpenURI::HTTPError: 404 Not Found
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:353:in `open_http'
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:689:in `open'
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:34:in `open'
from (irb):3
from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/bin/irb:12:in `<main>'
2.0.0-p451 :004 > open(url, "User-Agent" => "Mozilla")
=> #<Tempfile:/var/folders/jk/lwvr7vm954x62vzq6jcyv2gr0000gp/T/open-uri20140422-36432-1l4nnwl> This will be fixed when we merge this pull request by @ainformatico that will let us to specify User-Agent strings As a workaround, you could get the contents of the page with open-uri and pass them to MetaInspector using the |
Got it! thanks :) |
@ugran this should be fixed on 2.2.0, which has just been released -- you can now pass custom headers. Please reopen this issue if needed! |
Thanks! I'll update it right away :) |
Gives me an error: |
Sorry! This is fixed now in 2.2.1 |
HI, I am having the same problem, I am opening a html site scrapig getting the data properly then an error is firing after i get the data saying 404 error, do you knwo why this could be? thank you |
@TGots7 thanks for the feedback, but to investigate this please provide more details: what MetaInspector version you' re using, what URL you're scraping, what you expected and what you have instead. |
Hi,
There is a new kind of problem I've got into.
Several days ago I could scrappe pages from zara's website, but now when I try, it gives me an open uri error 404. There is a problem when I try to pin the url on pinterest also, but facebook does it with ease.
For example:
http://www.zara.com/us/en/woman/coats/loose-fit-trench-coat-c367501p1933522.html
The text was updated successfully, but these errors were encountered: