Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve BOM in web editor #28935

Merged
merged 3 commits into from Jan 27, 2024
Merged

Preserve BOM in web editor #28935

merged 3 commits into from Jan 27, 2024

Conversation

silverwind
Copy link
Member

@silverwind silverwind commented Jan 26, 2024

The ToUTF8* functions were stripping BOM, while BOM is actually valid in UTF8, so the stripping must be optional depending on use case. This does:

  • Add a options struct to all ToUTF8* functions, that by default will strip BOM to preserve existing behaviour
  • Remove ToUTF8 function, it was dead code
  • Rename ToUTF8WithErr to ToUTF8
  • Preserve BOM in Monaco Editor
  • Remove a unnecessary newline in the textarea value. Browsers did ignore it, it seems but it's better not to rely on this behaviour.

Fixes: #28743
Related: #6716 which seems to have once introduced a mechanism that strips and re-adds the BOM, but from what I can tell, this mechanism was removed at some point after that PR.

The ToUTF8* functions were stripping BOM, while BOM is actually valid in
UTF8, so the stripping must be optional. This does:

- Add a options struct to all ToUTF8* functions, that by default will
  strip BOM to preserve existing behaviour
- Remove ToUTF8 function, it was dead code
- Rename ToUTF8WithErr to ToUTF8
- Preserve BOM in Monaco Editor
- Remove a unnecessary newline in the textarea value. Browsers did
  ignore it, it seems but it's better not to rely on this behaviour.

Fixes: go-gitea#28743
Related: go-gitea#6716
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jan 26, 2024
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 26, 2024
@lunny lunny added this to the 1.22.0 milestone Jan 26, 2024
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Jan 26, 2024
@lunny lunny added type/bug backport/v1.21 This PR should be backported to Gitea 1.21 labels Jan 26, 2024
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jan 26, 2024

I think this fix is not complete, and it is also related to #28119 (web editor EOL) and others, they are the sample problem.

The key problem is not "what MonacoEditor does", but it is "what Gitea should do".

Think about some files in a git repo:

  • A file without BOM
  • A file with UTF-8 BOM
  • A file with UTF-16 BOM
  • A file with "\n"
  • A file with "\r\n"

By existing approach, they could all be edited by MonacoEditor. The real problem is when "saving" the files, backend code does incorrect assumption: it only recognizes UTF-8 or "\n". So UTF-16 BOM and "\r\n" files are all changed unexpectedly.

So, the complete approach could be like this IMO: when saving, the backend code should first detect the file kind (BOM, EOL), then convert the submitted text to the expected format (correct BOM, correct EOL), then everything should work well.

@silverwind
Copy link
Member Author

silverwind commented Jan 27, 2024

So, the complete approach could be like this IMO: when saving, the backend code should first detect the file kind (BOM, EOL), then convert the submitted text to the expected format (correct BOM, correct EOL), then everything should work well.

I'm not sure such detection is neccesary. I would just ensure the file round trips through backend and frontend unmodified while editing.

EOL is a more complicated topic because of the possibility for mixed EOL, but BOM is imho quite a clearcut case: If it's there, keep it. Ideally even indicate its precence in the editor.

So for BOM, we should render it without BOM for display (see #6716 regarding rendering artifacts). Only when passing to the editor, preserve it and this is exactly what this PR does. I will test if UTF16 BOM works as well with this PR, I assume it will.

@silverwind
Copy link
Member Author

silverwind commented Jan 27, 2024

It seems UTF-16 rendering is broken in Gitea and a number of other tools. See this file. Gitea, VSCode, Monaco and Sublime Text all seem to convert it to UTF-8 and they render this:

image

GitHub renders the letters "test" correctly but shows a broken BOM rendering:

image

So I think UTF-16 is a diffferent topic for another PR. We should first render it correctly, then we can talk about BOM there.

Copy link
Contributor

@wxiaoguang wxiaoguang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there could be some improvements, eg: Remove ToUTF8WithFallback , and introduce "AutoFallback" option for "ConvertOpts"). And ToUTF8DropErrors looks strange, ideally there could be an option like IgnoreBroken.

And as the comment above, to resolve the UTF-16 BOM and "\r\n" problems, it needs a new approach.

Anyway, they are all legacy problems.

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Jan 27, 2024
@silverwind silverwind added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Jan 27, 2024
@silverwind silverwind enabled auto-merge (squash) January 27, 2024 17:39
@silverwind
Copy link
Member Author

Merging as-is. More improvements can come later. I agree we should move all function variants to the options struct.

@silverwind silverwind merged commit 60e4a98 into go-gitea:main Jan 27, 2024
25 checks passed
GiteaBot pushed a commit to GiteaBot/gitea that referenced this pull request Jan 27, 2024
The `ToUTF8*` functions were stripping BOM, while BOM is actually valid
in UTF8, so the stripping must be optional depending on use case. This
does:

- Add a options struct to all `ToUTF8*` functions, that by default will
strip BOM to preserve existing behaviour
- Remove `ToUTF8` function, it was dead code
- Rename `ToUTF8WithErr` to `ToUTF8`
- Preserve BOM in Monaco Editor
- Remove a unnecessary newline in the textarea value. Browsers did
ignore it, it seems but it's better not to rely on this behaviour.

Fixes: go-gitea#28743
Related: go-gitea#6716 which seems to
have once introduced a mechanism that strips and re-adds the BOM, but
from what I can tell, this mechanism was removed at some point after
that PR.
@GiteaBot GiteaBot added backport/done All backports for this PR have been created and removed reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. labels Jan 27, 2024
silverwind added a commit that referenced this pull request Jan 27, 2024
Backport #28935 by @silverwind

The `ToUTF8*` functions were stripping BOM, while BOM is actually valid
in UTF8, so the stripping must be optional depending on use case. This
does:

- Add a options struct to all `ToUTF8*` functions, that by default will
strip BOM to preserve existing behaviour
- Remove `ToUTF8` function, it was dead code
- Rename `ToUTF8WithErr` to `ToUTF8`
- Preserve BOM in Monaco Editor
- Remove a unnecessary newline in the textarea value. Browsers did
ignore it, it seems but it's better not to rely on this behaviour.

Fixes: #28743
Related: #6716 which seems to
have once introduced a mechanism that strips and re-adds the BOM, but
from what I can tell, this mechanism was removed at some point after
that PR.

Co-authored-by: silverwind <me@silverwind.io>
@silverwind silverwind deleted the bom branch January 27, 2024 22:23
zjjhot added a commit to zjjhot/gitea that referenced this pull request Jan 29, 2024
* giteaofficial/main:
  [skip ci] Updated licenses and gitignores
  Fix bug for generated repository object format (go-gitea#28969)
  Fixing small space missing in sample config file (go-gitea#28967)
  Fix inconsistent naming of OAuth 2.0 `ENABLE` setting (go-gitea#28951)
  Add screenshot for "Profile Readmes" to docs (go-gitea#28964)
  Simplify how git repositories are opened (go-gitea#28937)
  Preserve BOM in web editor (go-gitea#28935)
  Make loading animation less aggressive (go-gitea#28955)
  Fix SSPI user creation (go-gitea#28948)
  Strip `/` from relative links (go-gitea#28932)
  Fix non-alphabetic sorting of repo topics (go-gitea#28938)
  Don't remove all mirror repository's releases when mirroring (go-gitea#28817)
  Use new RPM constants (go-gitea#28931)
  Check for sha256 support to use --object-format flag (go-gitea#28928)
  fix: update enable_prune even if mirror_interval is not provided (go-gitea#28905)
  Implement `MigrateRepository` for the actions notifier (go-gitea#28920)
  Respect branch info for relative links (go-gitea#28909)
henrygoodman pushed a commit to henrygoodman/gitea that referenced this pull request Jan 31, 2024
The `ToUTF8*` functions were stripping BOM, while BOM is actually valid
in UTF8, so the stripping must be optional depending on use case. This
does:

- Add a options struct to all `ToUTF8*` functions, that by default will
strip BOM to preserve existing behaviour
- Remove `ToUTF8` function, it was dead code
- Rename `ToUTF8WithErr` to `ToUTF8`
- Preserve BOM in Monaco Editor
- Remove a unnecessary newline in the textarea value. Browsers did
ignore it, it seems but it's better not to rely on this behaviour.

Fixes: go-gitea#28743
Related: go-gitea#6716 which seems to
have once introduced a mechanism that strips and re-adds the BOM, but
from what I can tell, this mechanism was removed at some point after
that PR.
nrdufour added a commit to nrdufour/home-ops that referenced this pull request Feb 4, 2024
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [docker.io/gitea/gitea](https://github.com/go-gitea/gitea) | patch | `1.21.4` -> `1.21.5` |

---

### Release Notes

<details>
<summary>go-gitea/gitea (docker.io/gitea/gitea)</summary>

### [`v1.21.5`](https://github.com/go-gitea/gitea/releases/tag/v1.21.5)

[Compare Source](go-gitea/gitea@v1.21.4...v1.21.5)

-   SECURITY
    -   Prevent anonymous container access if `RequireSignInView` is enabled ([#&#8203;28877](go-gitea/gitea#28877)) ([#&#8203;28882](go-gitea/gitea#28882))
    -   Update go dependencies and fix go-git ([#&#8203;28893](go-gitea/gitea#28893)) ([#&#8203;28934](go-gitea/gitea#28934))
-   BUGFIXES
    -   Revert "Speed up loading the dashboard on mysql/mariadb ([#&#8203;28546](go-gitea/gitea#28546))" ([#&#8203;29006](go-gitea/gitea#29006)) ([#&#8203;29007](go-gitea/gitea#29007))
    -   Fix an actions schedule bug ([#&#8203;28942](go-gitea/gitea#28942)) ([#&#8203;28999](go-gitea/gitea#28999))
    -   Fix update enable_prune even if mirror_interval is not provided ([#&#8203;28905](go-gitea/gitea#28905)) ([#&#8203;28929](go-gitea/gitea#28929))
    -   Fix uploaded artifacts should be overwritten ([#&#8203;28726](go-gitea/gitea#28726)) backport v1.21 ([#&#8203;28832](go-gitea/gitea#28832))
    -   Preserve BOM in web editor ([#&#8203;28935](go-gitea/gitea#28935)) ([#&#8203;28959](go-gitea/gitea#28959))
    -   Strip `/` from relative links ([#&#8203;28932](go-gitea/gitea#28932)) ([#&#8203;28952](go-gitea/gitea#28952))
    -   Don't remove all mirror repository's releases when mirroring ([#&#8203;28817](go-gitea/gitea#28817)) ([#&#8203;28939](go-gitea/gitea#28939))
    -   Implement `MigrateRepository` for the actions notifier ([#&#8203;28920](go-gitea/gitea#28920)) ([#&#8203;28923](go-gitea/gitea#28923))
    -   Respect branch info for relative links ([#&#8203;28909](go-gitea/gitea#28909)) ([#&#8203;28922](go-gitea/gitea#28922))
    -   Don't reload timeline page when (un)resolving or replying conversation ([#&#8203;28654](go-gitea/gitea#28654)) ([#&#8203;28917](go-gitea/gitea#28917))
    -   Only migrate the first 255 chars of a Github issue title ([#&#8203;28902](go-gitea/gitea#28902)) ([#&#8203;28912](go-gitea/gitea#28912))
    -   Fix sort bug on repository issues list ([#&#8203;28897](go-gitea/gitea#28897)) ([#&#8203;28901](go-gitea/gitea#28901))
    -   Fix `DeleteCollaboration` transaction behaviour ([#&#8203;28886](go-gitea/gitea#28886)) ([#&#8203;28889](go-gitea/gitea#28889))
    -   Fix schedule not trigger bug because matching full ref name with short ref name ([#&#8203;28874](go-gitea/gitea#28874)) ([#&#8203;28888](go-gitea/gitea#28888))
    -   Fix migrate storage bug ([#&#8203;28830](go-gitea/gitea#28830)) ([#&#8203;28867](go-gitea/gitea#28867))
    -   Fix archive creating LFS hooks and breaking pull requests ([#&#8203;28848](go-gitea/gitea#28848)) ([#&#8203;28851](go-gitea/gitea#28851))
    -   Fix reverting a merge commit failing ([#&#8203;28794](go-gitea/gitea#28794)) ([#&#8203;28825](go-gitea/gitea#28825))
    -   Upgrade xorm to v1.3.7 to fix a resource leak problem caused by Iterate ([#&#8203;28891](go-gitea/gitea#28891)) ([#&#8203;28895](go-gitea/gitea#28895))
    -   Fix incorrect PostgreSQL connection string for Unix sockets ([#&#8203;28865](go-gitea/gitea#28865)) ([#&#8203;28870](go-gitea/gitea#28870))
-   ENHANCEMENTS
    -   Make loading animation less aggressive ([#&#8203;28955](go-gitea/gitea#28955)) ([#&#8203;28956](go-gitea/gitea#28956))
    -   Avoid duplicate JS error messages on UI ([#&#8203;28873](go-gitea/gitea#28873)) ([#&#8203;28881](go-gitea/gitea#28881))
    -   Bump `@github/relative-time-element` to 4.3.1 ([#&#8203;28819](go-gitea/gitea#28819)) ([#&#8203;28826](go-gitea/gitea#28826))
-   MISC
    -   Warn that `DISABLE_QUERY_AUTH_TOKEN` is false only if it's explicitly defined ([#&#8203;28783](go-gitea/gitea#28783)) ([#&#8203;28868](go-gitea/gitea#28868))
    -   Remove duplicated checkinit on git module ([#&#8203;28824](go-gitea/gitea#28824)) ([#&#8203;28831](go-gitea/gitea#28831))

Instances on **[Gitea Cloud](https://cloud.gitea.com)** will be automatically upgraded to this version during the specified maintenance window.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4xNjUuMCIsInVwZGF0ZWRJblZlciI6IjM3LjE2NS4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiJ9-->

Reviewed-on: https://git.home/nrdufour/home-ops/pulls/355
Co-authored-by: Renovate <renovate@ptinem.io>
Co-committed-by: Renovate <renovate@ptinem.io>
silverwind added a commit to silverwind/gitea that referenced this pull request Feb 20, 2024
The `ToUTF8*` functions were stripping BOM, while BOM is actually valid
in UTF8, so the stripping must be optional depending on use case. This
does:

- Add a options struct to all `ToUTF8*` functions, that by default will
strip BOM to preserve existing behaviour
- Remove `ToUTF8` function, it was dead code
- Rename `ToUTF8WithErr` to `ToUTF8`
- Preserve BOM in Monaco Editor
- Remove a unnecessary newline in the textarea value. Browsers did
ignore it, it seems but it's better not to rely on this behaviour.

Fixes: go-gitea#28743
Related: go-gitea#6716 which seems to
have once introduced a mechanism that strips and re-adds the BOM, but
from what I can tell, this mechanism was removed at some point after
that PR.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/done All backports for this PR have been created backport/v1.21 This PR should be backported to Gitea 1.21 lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Editor removes BOM
5 participants