Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web::Scraper->scrape should honor 'base URL' for local files; Test Suite failures #6

Closed
ccurtis0 opened this issue Jun 15, 2012 · 2 comments

Comments

@ccurtis0
Copy link

[Followup from email: the 'base URL' is ignored for URI type scrape arguments]

To start, I cloned the main Web::Scraper repository and attempted to run all tests, but the "live" tests are failing:

$ git diff
$ TEST_ALL=1 prove -l t
t/00_compile.t .......... ok
[...]
t/07-live.t ............. 1/1

Failed test at t/07-live.t line 21.

Structures begin differing at:

$got->{url} = 'http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE'

$expected->{url} = 'http://d.hatena.ne.jp/keyword/%ba%b0%cc%ee%a4%a2%a4%b5%c8%fe'

[...]
t/18_http_response.t .... 2/2

Failed test 'Absolute URI'

at t/18_http_response.t line 27.

got: 'http://b.hatena.ne.jp/images/title_hotentry_curvebox-header.gif'

expected: 'http://b.hatena.ne.jp/images/logo1.gif'

Looks like you failed 1 test of 2.

[...]
t/19_decode_content.t ... 2/2

Failed test 'Absolute URI'

at t/19_decode_content.t line 28.

got: 'http://b.hatena.ne.jp/images/title_hotentry_curvebox-header.gif'

expected: 'http://b.hatena.ne.jp/images/logo1.gif'

Looks like you failed 1 test of 2.

I updated the expected values in the tests, and then made my change. After the change I reran the tests and they all continued to pass. However, there does not seem to be a pertinent test addressing my changes. Writing this test is a bit beyond my current skillset, so here's the situation from my email:

I've downloaded an HTML file from somewhere and cached it. The links are all relative within the file. When I ->scrape() the file, the links are converted into file:/// type URIs. But I am calling ->scrape( $file, 'http://example.org/source' );

The documentation says that the second argument is applied to relative links, but only if the first argument is text (as opposed to a URI). I could wrap every scraper call to 'fetch' the URL myself, but this seems like a reasonable thing for Web::Scraper to support natively.

For example, given
->scrape( GET('http://example.net/'), 'http://example.com:8888/test/' );

I think it is reasonable to expect this would convert links to '/foo' into 'http://example.com:8888/test/foo/'. The current implementation simple discards the second argument silently.

I'll attach the patch or paste it into a followup message.

@miyagawa
Copy link
Owner

you can ignore these test failures, the live tests would not be meant to run on the users environment.

@miyagawa
Copy link
Owner

This is an artifact that you're using GET (from HTTP::Request::Common I guess?) which returns an HTTP::Request object, in which case we retrieve the base URL from the request object. As you already figured out you can turn them into a string and it will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants