Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken scrapers-ca scrapers #40

Closed
8 tasks done
jpmckinney opened this issue Jan 15, 2014 · 10 comments
Closed
8 tasks done

Fix broken scrapers-ca scrapers #40

jpmckinney opened this issue Jan 15, 2014 · 10 comments
Labels
Milestone

Comments

@jpmckinney
Copy link
Member

Work through the broken scrapers at http://scrapers.herokuapp.com/

  • ca_qc_mercier
  • ca_qc_montreal_est

Ignore:

  • jurisdictions ending in _municipalities
  • ca_on_toronto (website frequently fails)

Previous issue content

Reserved for Matthew Leon:

  • ca_mb

Priority (already in Represent):

  • ca_nb_frederiction
  • ca_nb_moncton
  • ca_on_richmond_hill
  • ca_qc_brossard
  • ca_qc_senneville
@jpmckinney jpmckinney added this to the Priority milestone Feb 10, 2014
@jpmckinney jpmckinney removed this from the Priority milestone Feb 26, 2014
@jpmckinney jpmckinney mentioned this issue Mar 21, 2014
2 tasks
@jpmckinney jpmckinney added this to the Priority milestone Apr 1, 2014
@matthewleon
Copy link
Contributor

FYI I am unable to duplicate the "list index out of range" errors on a bunch of these scrapers. ca_on_guelph, ca_on_markham, ca_on_richmond_hill, ca_qc_saint_jerome all complete without errors on my machine.

@matthewleon
Copy link
Contributor

ca_qc_mercier page is 403'ing the scraper but loading fine in browser. Maybe some kind of user-agent sniffing going on? Will try to check later.

@matthewleon
Copy link
Contributor

same with ca_qc_montreal_est

@jpmckinney
Copy link
Member Author

Yeah, there are four scrapers that only fail on Heroku, as in the issue description. When you call lxmlize, you can pass a user_agent string. ca_pe_stratford uses a string for IE10.

@jpmckinney
Copy link
Member Author

I fixed all the Heroku-only failures. They were mostly around the use of things like [2] in XPath. For ca_nb, for example, where it was picking the wrong image, on Heroku, //img[2] means "the second img within the same parent." Locally, it's interpreted as "the second img anywhere in the document."

@matthewleon
Copy link
Contributor

Why is there this difference in interpretation? Is there a difference version of lxml running on heroku?

@jpmckinney
Copy link
Member Author

The Python package is the same version; maybe the C code is different? I assume only one of the two interpretations is correct, though.

@matthewleon
Copy link
Contributor

Indeed. This is very strange.

@Menerve
Copy link
Contributor

Menerve commented May 23, 2014

The local interpretation is the good one according to http://www.w3.org/TR/1999/REC-xpath-19991116/

@jpmckinney
Copy link
Member Author

@matthewleon I added user agent strings for ca_qc_mercier and ca_qc_montreal_est. The scrapers now fail for a different reason (pupa.scrape.base.ScrapeError: no objects returned from people scrape) likely because the selectors don't work anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants