Spidering pages with no content-type header #32

Closed
bcobb opened this Issue May 11, 2012 · 4 comments

Comments

Projects
None yet
2 participants
@bcobb

bcobb commented May 11, 2012

We ran into a scenario where we tried to spider a customer's site for certain keywords -- keywords that were present when we viewed the site in a browser -- but could not locate any of them by using some flavor of page.search('//body').text.include?(keyword). Ultimately, page.search('//body') returned an empty array because this customer's web server is not returning a content-type header, and is thus not being parsed into HTML or XML.

What are your thoughts on attempting to parse pages which have no content-type header as HTML? This matches the behavior of current web browsers, and at first glance makes this spider more intuitive to use. I'm happy to work on it, but I may be missing a compelling reason to simply ignore such pages.

@postmodern

This comment has been minimized.

Show comment Hide comment
@postmodern

postmodern May 12, 2012

Owner

This is interesting. Do you have a URI I can test, and did you check that page.body contained the HTML? The browser could be testing for a DOCTYPE, when Content-Type is missing. The other option could be the Web Server is returning empty responses, due to Spidr not having a User-Agent by default.

Owner

postmodern commented May 12, 2012

This is interesting. Do you have a URI I can test, and did you check that page.body contained the HTML? The browser could be testing for a DOCTYPE, when Content-Type is missing. The other option could be the Web Server is returning empty responses, due to Spidr not having a User-Agent by default.

@bcobb

This comment has been minimized.

Show comment Hide comment
@bcobb

bcobb May 14, 2012

The URI that exposed the issue was http://offfurn.com. curl -I http://offfurn.com shows:

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: fwww=a142366863a0d28237078f5fc6c3894262f4834bfc675afc7a85356742ff9705; Path=/
Date: Mon, 14 May 2012 16:14:23 GMT
Xonnection: close

I have no idea why they send Xonnection instead of Connection, but that's a discussion for someone else's issue tracker 😃

We do specify User-Agent when we spider sites, so I don't think it's an issue of missing a user agent:

# UA is our application's User-Agent string
>> size = 0
=> 0
>> Spidr.site('http://offfurn.com', :hosts => [/.*offfurn.com.*/], :user_agent => UA) do |spidr| 
     spidr.every_page { |page| size = page.body.size }
   end
=> #<Spidr::Agent:...>
>> size
=> 31885

page.body above matches the expected HTML, too. It's not quite equal to the output of curl but the difference is 5 characters:

% curl http://offfurn.com/ | wc
> 329    1866   31900

That should give a better idea of what we're seeing.

bcobb commented May 14, 2012

The URI that exposed the issue was http://offfurn.com. curl -I http://offfurn.com shows:

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: fwww=a142366863a0d28237078f5fc6c3894262f4834bfc675afc7a85356742ff9705; Path=/
Date: Mon, 14 May 2012 16:14:23 GMT
Xonnection: close

I have no idea why they send Xonnection instead of Connection, but that's a discussion for someone else's issue tracker 😃

We do specify User-Agent when we spider sites, so I don't think it's an issue of missing a user agent:

# UA is our application's User-Agent string
>> size = 0
=> 0
>> Spidr.site('http://offfurn.com', :hosts => [/.*offfurn.com.*/], :user_agent => UA) do |spidr| 
     spidr.every_page { |page| size = page.body.size }
   end
=> #<Spidr::Agent:...>
>> size
=> 31885

page.body above matches the expected HTML, too. It's not quite equal to the output of curl but the difference is 5 characters:

% curl http://offfurn.com/ | wc
> 329    1866   31900

That should give a better idea of what we're seeing.

@bcobb

This comment has been minimized.

Show comment Hide comment
@bcobb

bcobb May 14, 2012

And, for what it's worth, I see those same headers in the Chrome inspector.

bcobb commented May 14, 2012

And, for what it's worth, I see those same headers in the Chrome inspector.

@postmodern

This comment has been minimized.

Show comment Hide comment
@postmodern

postmodern Jun 1, 2012

Owner

If the server is returning non-compliant headers, I think the server is broken. :(

Owner

postmodern commented Jun 1, 2012

If the server is returning non-compliant headers, I think the server is broken. :(

@postmodern postmodern closed this Jun 1, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment