Fix broken scrapers-ca scrapers #40

jpmckinney · 2014-01-15T00:16:42Z

Work through the broken scrapers at http://scrapers.herokuapp.com/

ca_qc_mercier
ca_qc_montreal_est

Ignore:

jurisdictions ending in _municipalities
ca_on_toronto (website frequently fails)

Previous issue content

Reserved for Matthew Leon:

ca_mb

Priority (already in Represent):

The text was updated successfully, but these errors were encountered:

matthewleon · 2014-05-22T19:07:22Z

FYI I am unable to duplicate the "list index out of range" errors on a bunch of these scrapers. ca_on_guelph, ca_on_markham, ca_on_richmond_hill, ca_qc_saint_jerome all complete without errors on my machine.

matthewleon · 2014-05-22T19:10:24Z

ca_qc_mercier page is 403'ing the scraper but loading fine in browser. Maybe some kind of user-agent sniffing going on? Will try to check later.

matthewleon · 2014-05-22T19:11:11Z

same with ca_qc_montreal_est

jpmckinney · 2014-05-22T19:35:33Z

Yeah, there are four scrapers that only fail on Heroku, as in the issue description. When you call lxmlize, you can pass a user_agent string. ca_pe_stratford uses a string for IE10.

jpmckinney · 2014-05-23T20:01:06Z

I fixed all the Heroku-only failures. They were mostly around the use of things like [2] in XPath. For ca_nb, for example, where it was picking the wrong image, on Heroku, //img[2] means "the second img within the same parent." Locally, it's interpreted as "the second img anywhere in the document."

matthewleon · 2014-05-23T20:06:51Z

Why is there this difference in interpretation? Is there a difference version of lxml running on heroku?

jpmckinney · 2014-05-23T20:08:31Z

The Python package is the same version; maybe the C code is different? I assume only one of the two interpretations is correct, though.

matthewleon · 2014-05-23T20:08:52Z

Indeed. This is very strange.

Menerve · 2014-05-23T23:53:46Z

The local interpretation is the good one according to http://www.w3.org/TR/1999/REC-xpath-19991116/

jpmckinney · 2014-05-26T17:57:16Z

@matthewleon I added user agent strings for ca_qc_mercier and ca_qc_montreal_est. The scrapers now fail for a different reason (pupa.scrape.base.ScrapeError: no objects returned from people scrape) likely because the selectors don't work anymore.

jpmckinney added the fixes label Feb 5, 2014

jpmckinney added this to the Priority milestone Feb 10, 2014

jpmckinney removed this from the Priority milestone Feb 26, 2014

jpmckinney mentioned this issue Mar 21, 2014

Fix broken images #59

Closed

2 tasks

jpmckinney added this to the Priority milestone Apr 1, 2014

jpmckinney closed this as completed May 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken scrapers-ca scrapers #40

Fix broken scrapers-ca scrapers #40

jpmckinney commented Jan 15, 2014

matthewleon commented May 22, 2014

matthewleon commented May 22, 2014

matthewleon commented May 22, 2014

jpmckinney commented May 22, 2014

jpmckinney commented May 23, 2014

matthewleon commented May 23, 2014

jpmckinney commented May 23, 2014

matthewleon commented May 23, 2014

Menerve commented May 23, 2014

jpmckinney commented May 26, 2014

Fix broken scrapers-ca scrapers #40

Fix broken scrapers-ca scrapers #40

Comments

jpmckinney commented Jan 15, 2014

matthewleon commented May 22, 2014

matthewleon commented May 22, 2014

matthewleon commented May 22, 2014

jpmckinney commented May 22, 2014

jpmckinney commented May 23, 2014

matthewleon commented May 23, 2014

jpmckinney commented May 23, 2014

matthewleon commented May 23, 2014

Menerve commented May 23, 2014

jpmckinney commented May 26, 2014