Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mech-dump --headers forces Latin1 on local files #270

Open
jidanni opened this issue Feb 3, 2019 · 2 comments
Open

mech-dump --headers forces Latin1 on local files #270

jidanni opened this issue Feb 3, 2019 · 2 comments

Comments

@jidanni
Copy link
Contributor

jidanni commented Feb 3, 2019

There is absolutely no way to get mech-dump to use the correct character
set for UTF-8 on local files.

$ wget jidanni.org
$ mech-dump --headers index.html | grep Title
Title: [...gobbledygook...] Dan Jacobson

@jidanni
Copy link
Contributor Author

jidanni commented Oct 4, 2020

Apparently, as no Title, comes with HTTP headers, which are expected to all be in ASCII,
so when Title is grabbed as a bonus for local files, nobody remembered that they might not be ASCII.

Let's have another look here after dumping parts of website https://jidanni.org/ onto local disk,
$ cd jidanni.org/
02:55 jidanni.org$ mech-dump --headers index.html
Content-Length: 3495
Content-Type: text/html
Last-Modified: Thu, 09 Jul 2020 12:59:11 GMT
Client-Date: Sun, 04 Oct 2020 18:56:05 GMT
Title: ç©ä¸¹å°¼ Dan Jacobson
X-Meta-Charset: utf-8
X-Meta-Viewport: width=device-width
02:56 jidanni.org$ mech-dump --headers location/paper_mailbox.html
Content-Language: zh-tw
Content-Length: 2109
Content-Type: text/html
Last-Modified: Mon, 27 Jan 2020 21:41:29 GMT
Client-Date: Sun, 04 Oct 2020 18:56:17 GMT
Title: ç´ä¿¡ç®±èªªæ Paper mailbox instructions
X-Meta-Viewport: width=device-width

Anyway, we see there are tons of clues for mech-dump to pick up on: X-Meta-Charset, etc. But it misses them.
$ mech-dump --version
2.01

@simbabque
Copy link
Contributor

I can't figure out where this title header is added, but I am pretty sure at that point the encoding is broken. The response knows this is utf-8 and generally mech-dump turns STDOUT into utf8 anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants