Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pages encoded UTF-8 with BOM not rendered correctly #1186

Closed
stephendrew opened this issue Mar 28, 2017 · 21 comments
Closed

Pages encoded UTF-8 with BOM not rendered correctly #1186

stephendrew opened this issue Mar 28, 2017 · 21 comments
Labels

Comments

@stephendrew
Copy link

stephendrew commented Mar 28, 2017

Hello,

I have a simple index markdown file:

# Project

## Overview

And it fails to render correctly in any theme.

Cinder
keqxxqn - imgur

ReadTheDocs
amsuaqe - imgur

Am I doing something wrong? If I change it to a level 2 heading, the same thing happens.

I can workaround it by placing a blank line at the start of the file, but then the heading appears too low:

6xtusui - imgur

Thanks.

@waylan
Copy link
Member

waylan commented Mar 28, 2017

Have you tried including a blank line between your headings? While not strictly necessary in all situations, failing to do so is generally bad form and can result in weird edge cases.

@stephendrew
Copy link
Author

Actually, there is a blank line - that was a typo in writing the issue. It is the same either way.

@stephendrew
Copy link
Author

It seems to only be occurring on the first page (index.md) in this project, although I seem to recall it doing it on multiple pages in another project (could be wrong).

@marcelkieser
Copy link

marcelkieser commented Jun 3, 2017

Same Problem here :( For every document in the repository.

mkdocs bug

This is the source of the file:

### Bug Template 

'''
[Short description of problem here]

**Reproduction Steps:**

1. [First Step]
2. [Second Step]
3. [Other Steps...]
...

Rendered HTML is

<article class="md-content__inner md-typeset">



  <h1>Bug template</h1>

<p>### Bug Template </p>

Tested with default theme and with material in mkdocs, version 0.16.3, running on Anaconda with Python 2.7.2 on Windows 10.

@facelessuser
Copy link
Contributor

Can you post your mkdocs.yml file? Maybe this is dependent on certain extensions? I copied and pasted your example, but the syntax is weird in your example, and it doesn't look to be the full source of the image you posted. It isn't using backticks for fences, maybe that was a mistake when you posted the example? When I change the code start and end to ```, it works fine for me. But I don't think your source example is right. Regardless, maybe it has nothing to do with the source and is related to extensions?

screen shot 2017-06-03 at 4 46 11 pm

@facelessuser
Copy link
Contributor

I also can't reproduce the opening post's example in readthedocs. I tired a number of extension combinations and even removed all extensions, and I am not seeing that behavior.

@marcelkieser
Copy link

Hmm, sorry, I did only post the header of the document, as I thought the header declaration would suffice. This is the complete document:

### Bug Template 

'''
[Short description of problem here]

**Reproduction Steps:**

1. [First Step]
2. [Second Step]
3. [Other Steps...]

**Expected behavior/content:**

[Describe expected behavior here]

**Observed behavior/content:**

[Describe observed behavior here]

**Screenshots and GIFs**

[Screenshots and GIFs which follow reproduction steps to demonstrate the problem]

**Additional information:**

* Problem started happening recently, didn't happen in an older version: [Yes/No]
* Problem can be reliably reproduced, doesn't happen randomly: [Yes/No]
* Problem happens with all files and projects, not only some files or projects: [Yes/No]
...
'''

Note that I have replaced the backticks in the document with ' so I can paste it correctly ;)

But this document was only the easiest example to use to provide the error description. The behaviour ocurrs for every document in our repository, regardless of the heading level. Every first heading is rendered as a html header (<h1>) following a paragraph containing the actual source heading (<p>### ...). The repository is big and contains about 60 - 100 markdown files in various folders and sub folders, so I can't provide a comprehensive example :(

The yml file is very basic. I have not created any pages mappings because of the mass of documents, I need to generate mkdocs just for every document in the repository with the given folder structure so I leave that unconfigured.

These are the settings currently configured:

site_name: Docs
site_url: url_to_site
repo_url: repo_url
repo_name: Docs
site_description: Docs
site_author: company_name
copyright: company_name
use_directory_urls: false
theme: 'material'

Note that I also used the default theme (which can't be used because of the mass of doucments, material works better for that) with the same result.

I will try to generate the pages mapping automatically, perhaps this changes the result. Will keep you posted about that.

@marcelkieser
Copy link

I have tested a bit more and could identify that the problem lies with the file encoding. If the file is encoded in UTF-8 BOM (Windows 😞) then the rendering issue ocurrs. If you use proper UTF-8 there the files are generated as expected.

@facelessuser
Copy link
Contributor

Well, that's good news. At least the issue has been clearly identified. As far as I know, MkDocs doesn't allow you to specify encoding, so as long as you use proper UTF-8, you should be fine. Maybe this should be more clearly stated in the documentation if it isn't already.

@facelessuser
Copy link
Contributor

facelessuser commented Jun 5, 2017

MkDocs could read files with encoding utf-8-sig which should strip out BOM if it is found, and if it is not found, it would read the file as a normal utf-8. I personally think this would end a lot of confusion in this regard. Issues like this are always unclear to debug, so just using an encoding that handles with and without BOM may be the easiest path forward to avoid issues like this in the future.

I also think it is perfectly reasonable for MkDocs to just clarify that they only accept utf-8 without BOM. But I suspect people will run into this again and again and have to have it explained to them; it may just be easier to use utf-8-sig and not worry about it.

@waylan waylan changed the title First heading not rendered correctly Pages encoded UTF-8 **with BOM** not rendered correctly Jun 5, 2017
@waylan waylan changed the title Pages encoded UTF-8 **with BOM** not rendered correctly Pages encoded UTF-8 with BOM not rendered correctly Jun 5, 2017
@waylan
Copy link
Member

waylan commented Jun 5, 2017

If the file is encoded in UTF-8 BOM (Windows 😞) then the rendering issue ocurrs. If you use proper UTF-8 there the files are generated as expected.

That makes sense. Can't believe I didn't think of asking about encoding before.

I also think it is perfectly reasonable for MkDocs to just clarify that they only accept utf-8 without BOM. But I suspect people will run into this again and again and have to have it explained to them; it may just be easier to use utf-8-sig and not worry about.

You may have a point. I suspect this has not been done given the following in the Python docs:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program... In UTF-8, the use of the BOM is discouraged and should generally be avoided.

First, I can't imagine anyone actually using Notepad to edit their files. Is there any other editors which actually do that by default? I've never encountered one.

That said, using utf-8-sig would just 'do the right thing' whether there is a BOM or not. In fact, a strict reading of the above is that writing the BOM to utf-8 is discouraged, not accounting for it when reading. As the docs mention:

On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.

So we could read using utf-8-sig and then write only with utf-8. That way, we can just ignore any BOM.

@facelessuser
Copy link
Contributor

So we could read using utf-8-sig and then write only with utf-8. That way, we can just ignore any BOM.

Yes, that is what I was suggesting, I should have emphasized read, but that was my intention with this suggestion.

@marcelkieser
Copy link

First, I can't imagine anyone actually using Notepad to edit their files. Is there any other editors which actually do that by default? I've never encountered one.

Well, I used Powershell's Out-File -Encoding utf8 to modify links in the markdown before calling mkdocs. As it seems Powershell's default setting for utf8 is with BOM ;)

@facelessuser
Copy link
Contributor

Well, I used Powershell's Out-File -Encoding utf8 to modify links in the markdown before calling mkdocs. As it seems Powershell's default setting for utf8 is with BOM ;)

Oh course that's what Windows is doing 🤦‍♂️. That's such a Windows thing to do.

@stephendrew
Copy link
Author

First, I can't imagine anyone actually using Notepad to edit their files. Is there any other editors which actually do that by default? I've never encountered one.

Can't remember what editor i was using (usually.vscode or atom) but do occasionally make minor edits in notepad cause it opens quicker :) Not sure if autocrlf was interfering or not...

waylan added a commit to waylan/mkdocs that referenced this issue Jun 5, 2017
Python simply discards the BOM with `utf-8-sig`. This way, users of
Microsoft text editors can have their files properly parsered. In all other
ways behaves as reading files using `utf-8` encoding. For more info see:
https://docs.python.org/2/library/codecs.html#encodings-and-unicode

Fixes mkdocs#1186.
waylan added a commit to waylan/mkdocs that referenced this issue Jun 6, 2017
Python simply discards the BOM with `utf-8-sig`. This way, users of
Microsoft text editors can have their files properly parsered. In all other
ways behaves as reading files using `utf-8` encoding. For more info see:
https://docs.python.org/2/library/codecs.html#encodings-and-unicode

Fixes mkdocs#1186.
waylan added a commit to waylan/mkdocs that referenced this issue Jun 6, 2017
Python simply discards the BOM with `utf-8-sig`. This way, users of
Microsoft text editors can have their files properly parsered. In all other
ways behaves as reading files using `utf-8` encoding. For more info see:
https://docs.python.org/2/library/codecs.html#encodings-and-unicode

Fixes mkdocs#1186.
waylan added a commit to waylan/mkdocs that referenced this issue Jun 6, 2017
Python simply discards the BOM with `utf-8-sig`. This way, users of
Microsoft text editors can have their files properly parsered. In all other
ways behaves as reading files using `utf-8` encoding. For more info see:
https://docs.python.org/2/library/codecs.html#encodings-and-unicode

Fixes mkdocs#1186.
@waylan
Copy link
Member

waylan commented Jun 6, 2017

Well, I have a fix for this in #1236, but the test fails in PyPy. I should note that only the new test fails. Presumably, everything else is working fine. PyPy simply appears to not like the file with a BOM. I can live with that, but how do I make the test pass? Should I skip the test for PyPy, or do something else? Any thoughts.

@facelessuser
Copy link
Contributor

Weird. Is it a PyPy 3.5 thing? I only have pypy 2.7 local on my mac, and I am able to write and read utf-8-sig.

@waylan
Copy link
Member

waylan commented Jun 6, 2017

No, its the PyPy2 tests. For some reason, the output is blank.

@facelessuser
Copy link
Contributor

I may need to pull the branch and try and run it in pypy2 because it seems utf-8-sig reads and writes are working in my pypy2. Probably don't have time tonight though.

@waylan
Copy link
Member

waylan commented Jun 7, 2017

This pypy thing has me stumped. I installed pypy locally and played around a little and AFAICT, everything works except in this test. The test does the following:

  1. Create two tempt dirs for docs_dir and site_dir.
  2. Write a Markdown file to docs_dir/index.md with a BOM (using utf-8-sig).
  3. Call mkdocs.build.build with the previously created temp dirs in a config.
  4. After build completes, read site_dir/index.html and confirm it worked properly.

In everything but PyPy, the tests works as expected. It fails before the change (MkDocs reads files with utf-8) and passes after (when MkDocs reads with `utf-8-sig). But it never works with PyPy. So I tried some debugging:

First I removed the BOM from index.md in step 2 above. But it still failed. I then removed all references to utf-8-sig and it still failed. Obviously, CPython passed just fine in both scenarios.

Then I tried adding some logging statements to track values through the build process. When building, PyPy is writing out to index.html the correct output. Trying different scenarios, it behaves the same as CPython.

It would seem the only failure is in step 4 above. When reading site_dir/index.html after the build, PyPy gets empty content every time (f.read() returns an empty string). I don't understand why that would be the case. And why only for PyPy?

@facelessuser
Copy link
Contributor

You beat me to it, but I was afraid that was the case. I wonder if the nosetest environment is a bit weird on pypy. I never really use pypy for anything, but it seems it has some quirks.

waylan added a commit to waylan/mkdocs that referenced this issue Oct 13, 2017
Python simply discards the BOM with `utf-8-sig`. This way, users of
Microsoft text editors can have their files properly parsered. In all other
ways behaves as reading files using `utf-8` encoding. For more info see:
https://docs.python.org/2/library/codecs.html#encodings-and-unicode

Fixes mkdocs#1186.
waylan added a commit that referenced this issue Oct 13, 2017
Python simply discards the BOM with `utf-8-sig`. This way, users of
Microsoft text editors can have their files properly parsered. In all other
ways behaves as reading files using `utf-8` encoding. For more info see:
https://docs.python.org/2/library/codecs.html#encodings-and-unicode

Fixes #1186.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants