Wrong character encoding

Consider [this page](https://psxdatacenter.com/psp/plist.html), specifically this line from its html:

```html
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
```

From my testing, it looks like tidy doesn't respect that encoding, instead in src/clean.c:2316 it looks like it forcibly replaces that "windows-1252" value with "utf-8", if I read the code correctly.

The problem is that when processing the above html page, the output is not valid utf-8 - there is an accented character near the string "des Mondes", if you grep for it you should see it, that gets destroyed for example.
In my own code, this line

```cpp
tidyOptSetInt(tidyDoc, TidyInCharEncoding, TidyEncWin1252);
```

fixes the issue and I get valid utf-8 out with correct accents and all, but I can't hardcode that because now my code is incorrect for every other html page out there. I also can't get that information from the cleaned html anymore because tidy overwrites it.

I think one of these two things should happen:
1. tidy leaves the encoding as-is, I can try and find the relevant meta tag myself through xpath and manually convert to utf8 myself, using iconv
2. tidy reads and respects that encoding when converting to utf8

I don't know if I'm doing something wrong in the way I call tidy, but after trying several options I can't get it to give me a correctly converted utf8 html string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong character encoding #863

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong character encoding #863

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions