Error 404 when scrapping page accessible through browser. #64

ugran · 2014-04-22T00:00:33Z

Hi,

There is a new kind of problem I've got into.
Several days ago I could scrappe pages from zara's website, but now when I try, it gives me an open uri error 404. There is a problem when I try to pin the url on pinterest also, but facebook does it with ease.

For example:
http://www.zara.com/us/en/woman/coats/loose-fit-trench-coat-c367501p1933522.html

jaimeiniesta · 2014-04-22T09:04:22Z

Thanks for spotting this @ugran,

I've investigated a bit and it looks like the server is rejecting us because it does not like our User-Agent string.

When you use a browser, you get redirected to http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html

Now, if I try to open that final URL with open-uri, the server responds with 404. But, if I tell open-uri to pass a User-Agent string of "Mozilla", the server responds with 200 OK:

➔ irb
2.0.0-p451 :001 > require 'open-uri'
 => true
2.0.0-p451 :002 > url = "http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html"
 => "http://www.zara.com/?go=http%3A//www.zara.com/share/woman/coats/loose-fit-trench-coat-c367501p1933522.html"
2.0.0-p451 :003 > open(url)
OpenURI::HTTPError: 404 Not Found
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:353:in `open_http'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:689:in `open'
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/open-uri.rb:34:in `open'
    from (irb):3
    from /Users/jaime/.rvm/rubies/ruby-2.0.0-p451/bin/irb:12:in `<main>'
2.0.0-p451 :004 > open(url, "User-Agent" => "Mozilla")
 => #<Tempfile:/var/folders/jk/lwvr7vm954x62vzq6jcyv2gr0000gp/T/open-uri20140422-36432-1l4nnwl>

This will be fixed when we merge this pull request by @ainformatico that will let us to specify User-Agent strings

#63

As a workaround, you could get the contents of the page with open-uri and pass them to MetaInspector using the :document option:

https://github.com/jaimeiniesta/metainspector#usage

ugran · 2014-04-22T09:07:52Z

Got it! thanks :)

jaimeiniesta · 2014-04-30T12:41:47Z

@ugran this should be fixed on 2.2.0, which has just been released -- you can now pass custom headers.

Please reopen this issue if needed!

ugran · 2014-04-30T15:58:17Z

Thanks!

I'll update it right away :)

ugran · 2014-04-30T16:07:32Z

Gives me an error:
uninitialized constant MetaInspector::VERSION

jaimeiniesta · 2014-04-30T17:24:53Z

Sorry! This is fixed now in 2.2.1

TGots7 · 2017-07-05T14:26:44Z

HI, I am having the same problem, I am opening a html site scrapig getting the data properly then an error is firing after i get the data saying 404 error, do you knwo why this could be? thank you

jaimeiniesta · 2017-07-05T17:17:37Z

@TGots7 thanks for the feedback, but to investigate this please provide more details: what MetaInspector version you' re using, what URL you're scraping, what you expected and what you have instead.

jaimeiniesta closed this as completed Apr 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error 404 when scrapping page accessible through browser. #64

Error 404 when scrapping page accessible through browser. #64

ugran commented Apr 22, 2014

jaimeiniesta commented Apr 22, 2014

ugran commented Apr 22, 2014

jaimeiniesta commented Apr 30, 2014

ugran commented Apr 30, 2014

ugran commented Apr 30, 2014

jaimeiniesta commented Apr 30, 2014

TGots7 commented Jul 5, 2017

jaimeiniesta commented Jul 5, 2017

Error 404 when scrapping page accessible through browser. #64

Error 404 when scrapping page accessible through browser. #64

Comments

ugran commented Apr 22, 2014

jaimeiniesta commented Apr 22, 2014

ugran commented Apr 22, 2014

jaimeiniesta commented Apr 30, 2014

ugran commented Apr 30, 2014

ugran commented Apr 30, 2014

jaimeiniesta commented Apr 30, 2014

TGots7 commented Jul 5, 2017

jaimeiniesta commented Jul 5, 2017