Web::Scraper->scrape should honor 'base URL' for local files; Test Suite failures #6

ccurtis0 · 2012-06-15T19:02:07Z

[Followup from email: the 'base URL' is ignored for URI type scrape arguments]

To start, I cloned the main Web::Scraper repository and attempted to run all tests, but the "live" tests are failing:

$ git diff
$ TEST_ALL=1 prove -l t
t/00_compile.t .......... ok
[...]
t/07-live.t ............. 1/1

Failed test at t/07-live.t line 21.

Structures begin differing at:

$got->{url} = 'http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE'

$expected->{url} = 'http://d.hatena.ne.jp/keyword/%ba%b0%cc%ee%a4%a2%a4%b5%c8%fe'

[...]
t/18_http_response.t .... 2/2

Failed test 'Absolute URI'

at t/18_http_response.t line 27.

got: 'http://b.hatena.ne.jp/images/title_hotentry_curvebox-header.gif'

expected: 'http://b.hatena.ne.jp/images/logo1.gif'

Looks like you failed 1 test of 2.

[...]
t/19_decode_content.t ... 2/2

Failed test 'Absolute URI'

at t/19_decode_content.t line 28.

got: 'http://b.hatena.ne.jp/images/title_hotentry_curvebox-header.gif'

expected: 'http://b.hatena.ne.jp/images/logo1.gif'

Looks like you failed 1 test of 2.

I updated the expected values in the tests, and then made my change. After the change I reran the tests and they all continued to pass. However, there does not seem to be a pertinent test addressing my changes. Writing this test is a bit beyond my current skillset, so here's the situation from my email:

I've downloaded an HTML file from somewhere and cached it. The links are all relative within the file. When I ->scrape() the file, the links are converted into file:/// type URIs. But I am calling ->scrape( $file, 'http://example.org/source' );

The documentation says that the second argument is applied to relative links, but only if the first argument is text (as opposed to a URI). I could wrap every scraper call to 'fetch' the URL myself, but this seems like a reasonable thing for Web::Scraper to support natively.

For example, given
->scrape( GET('http://example.net/'), 'http://example.com:8888/test/' );

I think it is reasonable to expect this would convert links to '/foo' into 'http://example.com:8888/test/foo/'. The current implementation simple discards the second argument silently.

I'll attach the patch or paste it into a followup message.

miyagawa · 2012-06-16T06:52:46Z

you can ignore these test failures, the live tests would not be meant to run on the users environment.

miyagawa · 2012-06-17T21:05:45Z

This is an artifact that you're using GET (from HTTP::Request::Common I guess?) which returns an HTTP::Request object, in which case we retrieve the base URL from the request object. As you already figured out you can turn them into a string and it will work.

miyagawa closed this as completed Jun 17, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web::Scraper->scrape should honor 'base URL' for local files; Test Suite failures #6

Web::Scraper->scrape should honor 'base URL' for local files; Test Suite failures #6

ccurtis0 commented Jun 15, 2012

miyagawa commented Jun 16, 2012

miyagawa commented Jun 17, 2012

Web::Scraper->scrape should honor 'base URL' for local files; Test Suite failures #6

Web::Scraper->scrape should honor 'base URL' for local files; Test Suite failures #6

Comments

ccurtis0 commented Jun 15, 2012

Failed test at t/07-live.t line 21.

Structures begin differing at:

$got->{url} = 'http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE'

$expected->{url} = 'http://d.hatena.ne.jp/keyword/%ba%b0%cc%ee%a4%a2%a4%b5%c8%fe'

Failed test 'Absolute URI'

at t/18_http_response.t line 27.

got: 'http://b.hatena.ne.jp/images/title_hotentry_curvebox-header.gif'

expected: 'http://b.hatena.ne.jp/images/logo1.gif'

Looks like you failed 1 test of 2.

Failed test 'Absolute URI'

at t/19_decode_content.t line 28.

got: 'http://b.hatena.ne.jp/images/title_hotentry_curvebox-header.gif'

expected: 'http://b.hatena.ne.jp/images/logo1.gif'

Looks like you failed 1 test of 2.

miyagawa commented Jun 16, 2012

miyagawa commented Jun 17, 2012