Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 404 when scrapping page accessible through browser. #64

Closed
ugran opened this issue Apr 22, 2014 · 8 comments
Closed

Error 404 when scrapping page accessible through browser. #64

ugran opened this issue Apr 22, 2014 · 8 comments

Comments

@ugran
Copy link

ugran commented Apr 22, 2014

Hi,

There is a new kind of problem I've got into.
Several days ago I could scrappe pages from zara's website, but now when I try, it gives me an open uri error 404. There is a problem when I try to pin the url on pinterest also, but facebook does it with ease.

For example:
http://www.zara.com/us/en/woman/coats/loose-fit-trench-coat-c367501p1933522.html

@jaimeiniesta
Copy link
Owner

Thanks for spotting this @ugran,

I've investigated a bit and it looks like the server is rejecting us because it does not like our User-Agent string.

When you use a browser, you get redirected to http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html

Now, if I try to open that final URL with open-uri, the server responds with 404. But, if I tell open-uri to pass a User-Agent string of "Mozilla", the server responds with 200 OK:

➔ irb
2.0.0-p451 :001 > require 'open-uri'
 => true
2.0.0-p451 :002 > url = "http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html"
 => "http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html"
2.0.0-p451 :003 > open(url)
OpenURI::HTTPError: 404 Not Found
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:353:in `open_http'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:689:in `open'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:34:in `open'
    from (irb):3
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/bin/irb:12:in `<main>'
2.0.0-p451 :004 > open(url, "User-Agent" => "Mozilla")
 => #<Tempfile:/var/folders/jk/lwvr7vm954x62vzq6jcyv2gr0000gp/T/open-uri20140422-36432-1l4nnwl>

This will be fixed when we merge this pull request by @ainformatico that will let us to specify User-Agent strings

#63

As a workaround, you could get the contents of the page with open-uri and pass them to MetaInspector using the :document option:

https://github.com/jaimeiniesta/metainspector#usage

@ugran
Copy link
Author

ugran commented Apr 22, 2014

Got it! thanks :)

@jaimeiniesta
Copy link
Owner

@ugran this should be fixed on 2.2.0, which has just been released -- you can now pass custom headers.

Please reopen this issue if needed!

@ugran
Copy link
Author

ugran commented Apr 30, 2014

Thanks!

I'll update it right away :)

@ugran
Copy link
Author

ugran commented Apr 30, 2014

Gives me an error:
uninitialized constant MetaInspector::VERSION

@jaimeiniesta
Copy link
Owner

Sorry! This is fixed now in 2.2.1

@TGots7
Copy link

TGots7 commented Jul 5, 2017

HI, I am having the same problem, I am opening a html site scrapig getting the data properly then an error is firing after i get the data saying 404 error, do you knwo why this could be? thank you

@jaimeiniesta
Copy link
Owner

@TGots7 thanks for the feedback, but to investigate this please provide more details: what MetaInspector version you' re using, what URL you're scraping, what you expected and what you have instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants