Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong encoding detection #15

Open
onilton opened this issue Jul 11, 2014 · 0 comments
Open

Wrong encoding detection #15

onilton opened this issue Jul 11, 2014 · 0 comments

Comments

@onilton
Copy link

onilton commented Jul 11, 2014

I'm using PyQuery, and I get wrong encode detection for this page:

http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html

The problem is that the html has this meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

But the page is actually utf-8

I get this info from the response headers:

Connection:close
Content-Length:29187
Content-Type:text/html;charset=UTF-8
Date:Fri, 11 Jul 2014 23:21:04 GMT
Last-Modified:Fri, 11 Jul 2014 23:21:05 GMT
Server:OpenCms/7.5.4

That's how the browser (chrome) is able to guess the right encoding and display the page with the right encoding. I work in a place that have to deal with a lot of different kinds of pages, and I can tell this is far from a rare case (especially in brazilian portuguese websites), so it would be nice to fix this in crawley.

So far I saw two solutions as proposed in this answer in SO, using chardet module or UnicodeDammit (from BeautifulSoup).

I've develop, locally, these two alternatives and tested them with PyQuery, seems to fix the problem.

I would like to hear your opinion on this issue and if you want, I can submit one of those solutions.

BTW, good work in building crawley, I'm having a very nice time using it! Hope I can contribute somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant