Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Byte Order Mark (BOM) handling in markdown display and editor. #6716

Closed
1 of 3 tasks
Corey-M opened this issue Apr 23, 2019 · 14 comments · Fixed by #6727
Closed
1 of 3 tasks

Fix Byte Order Mark (BOM) handling in markdown display and editor. #6716

Corey-M opened this issue Apr 23, 2019 · 14 comments · Fixed by #6727
Labels
Milestone

Comments

@Corey-M
Copy link

Corey-M commented Apr 23, 2019

Description

When the README.md file contains a Unicode Byte Order Mark the file the first line of the file is
formatted incorrectly. This happens for all BOMs that I have tested - UTF8, Unicode and Unicode
Big Endian. The text of the file loads correctly in all cases, the only problem is with the BOM
being treated as file content and incorrectly displayed in both the rendered view and the editor.

This is a particular problem for developers using Visual Studio and other environments which
insert a BOM by default, and can be difficult to change without add-ons to the environment.

The various Unicode Byte Order Markers are metadata and should not be treated as file content. At a minimum the markdown renderer should detect and discard BOMs in the content. The markdown editor should likewise detect and remove BOMs during load and either replace them during save or save without markers.

I don't understand Go so I won't be submitting a pull request. From 10 minutes casting about in the code it looks like ToUTF8WithErr and ToUTF8WithFallback might be a place to
start. It looks like those are the main methods used to massage text file content into UTF8 for
both render and edit.

Screenshots

Screenshots taken from try.gitea.io site mentioned above.

  • Bad first line formatting, treats BOM as unknown non-printable character:
    image
  • Editor showing BOM as unknown/bad character:
    image
@silverwind
Copy link
Member

silverwind commented Apr 23, 2019

Would be interested in how GitHub solves this (or not). Generally, I think all BOMs should be abolished on save, but I guess some broken software might rely on it being present (cough Excel), so there might be some value in trying to preserving it (but never rendering it).

@zeripath
Copy link
Contributor

It's probably through some use of the chardet or similar.

It's probably reasonable to pass the stuff through this - but the trouble with the chardet library is that it's not nearly perfect.

@silverwind
Copy link
Member

BOM should be easily detectable via regex \uFEFF, no need for a library.

@zeripath
Copy link
Contributor

Ah but I bet that the next problem will be cp1252 related

@lafriks
Copy link
Member

lafriks commented Apr 23, 2019

Please do not use regexp for this, just check first two bytes

@silverwind
Copy link
Member

silverwind commented Apr 23, 2019

BOM is actually three bytes in UTF-8 and two bytes in UTF-16. Still think it's best to regex-test the unicode string instead like ^\uFEFF, it is a single character in unicode-aware regexp.

@lafriks
Copy link
Member

lafriks commented Apr 23, 2019

It does not matter it still be faster than doing regexp and converting byte array to string

@zeripath
Copy link
Contributor

OK so we are passing this content through a chardet.

OK so we need to adjust our "decoder" for utf-8 to simply remove the BOM if present.

@silverwind
Copy link
Member

silverwind commented Apr 23, 2019

I think the most elegant solution would be:

  1. Strip BOM from rendered markdown and web editor.
  2. When user saves a file in the web editor check if the previous file had a BOM and if so, add it again.

As I said, BOM preserval is important when dealing with certain Microsoft products which use the BOM as indicator whether a file is UTF-8 or ASCII, like Excel does.

@zeripath
Copy link
Contributor

@silverwind - OK I've got removal of the BOM on decoding sorted out. In terms of keeping the previous encoding that's a bit more difficult - we don't currently do that at all - all data created on the editor is assumed to be UTF-8 AFAIU (I certainly didn't write any encoding gadgets - doing it properly is a horrible experience.)

@silverwind
Copy link
Member

We're making the assumption that everything is UTF-8 so adding it back when committing from the UI shouldn't be too hard. Read the old file, check if its first three bytes match (maybe create a shared hasBOM function to use in the template renderer as well) and if they do, add them to the saved content.

Thought I won't block on this, stripping BOM is already an improvement.

@zeripath
Copy link
Contributor

OK done.

@zeripath
Copy link
Contributor

The attached pr will attempt to reencode to the detected charset and upon failure will default to utf8 with or without BOM as per original charset.

@silverwind
Copy link
Member

Thanks, will likely test this tomorrow.

@lafriks lafriks added this to the 1.9.0 milestone Apr 24, 2019
lafriks added a commit to zeripath/gitea that referenced this issue Apr 25, 2019
zeripath added a commit to zeripath/gitea that referenced this issue Apr 25, 2019
lafriks added a commit to zeripath/gitea that referenced this issue Apr 26, 2019
@lafriks lafriks modified the milestones: 1.9.0, 1.8.1 Apr 26, 2019
lafriks added a commit to zeripath/gitea that referenced this issue Apr 26, 2019
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants