New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect charset and convert non UTF-8 files for display #4950

Merged
merged 6 commits into from Sep 29, 2018

Conversation

@lafriks
Member

lafriks commented Sep 17, 2018

Fixes #4879

@lafriks lafriks added this to the 1.6.0 milestone Sep 17, 2018

@@ -72,6 +76,21 @@ func DetectEncoding(content []byte) (string, error) {
return result.Charset, err
}
// DetectEncodingAndConvert detects the encoding of content and coverts to UTF-8 if possible
func DetectEncodingAndConvert(content []byte) []byte {

This comment has been minimized.

@lunny

lunny Sep 18, 2018

Member

We already have the same function on templates.ToUTF8WithErr(buf)

This comment has been minimized.

@lafriks

lafriks Sep 18, 2018

Member

Yes but there are two problems with it. First it does return string this way I would have unneeded casting to string and back to byte array and second it does return error but I need to return default byte array in case of error. But I will rename my function and move to other module than

@bkcsoft bkcsoft added the lgtm/need 2 label Sep 18, 2018

@lunny lunny referenced this pull request Sep 18, 2018

Closed

Fix chardet bug #4880

@lafriks

This comment has been minimized.

Member

lafriks commented Sep 18, 2018

@lunny fixed

@codecov-io

This comment has been minimized.

codecov-io commented Sep 21, 2018

Codecov Report

Merging #4950 into master will increase coverage by 0.01%.
The diff coverage is 63.63%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4950      +/-   ##
=========================================
+ Coverage   37.39%   37.4%   +0.01%     
=========================================
  Files         306     306              
  Lines       45343   45373      +30     
=========================================
+ Hits        16956   16974      +18     
- Misses      25933   25945      +12     
  Partials     2454    2454
Impacted Files Coverage Δ
routers/repo/view.go 45.06% <100%> (+0.17%) ⬆️
modules/templates/helper.go 48.95% <35.71%> (-0.58%) ⬇️
modules/base/tool.go 74.52% <81.25%> (-0.62%) ⬇️
modules/process/manager.go 81.15% <0%> (+4.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6780661...92d7a69. Read the comment docs.

@bkcsoft bkcsoft added lgtm/need 1 and removed lgtm/need 2 labels Sep 21, 2018

@JonasFranzDEV

This comment has been minimized.

Member

JonasFranzDEV commented Sep 21, 2018

@lunny Need your review

@lunny

This comment has been minimized.

Member

lunny commented Sep 21, 2018

When the content is too short, the detection will be wrong randomly. I think maybe copy itself before detection.
See

diff --git a/modules/base/tool.go b/modules/base/tool.go
index 2dfd8ffec..19953c319 100644
--- a/modules/base/tool.go
+++ b/modules/base/tool.go
@@ -59,7 +59,17 @@ func DetectEncoding(content []byte) (string, error) {
                return "UTF-8", nil
        }

-       result, err := chardet.NewTextDetector().DetectBest(content)
+       var detectedContent []byte
+       if len(content) < 1024 {
+               times := 1024 / len(content)
+               detectedContent = make([]byte, 0, times*len(content))
+               for i := 0; i < times; i++ {
+                       detectedContent = append(detectedContent, content...)
+               }
+       } else {
+               detectedContent = content
+       }
+       result, err := chardet.NewTextDetector().DetectBest(detectedContent)
        if err != nil {
                return "", err
        }
@lunny

This comment has been minimized.

Member

lunny commented Sep 21, 2018

I also tested change 1024 to 100. It's still right.

@lafriks

This comment has been minimized.

Member

lafriks commented Sep 23, 2018

@lunny I think that should be fixed in other PR

@lunny

This comment has been minimized.

Member

lunny commented Sep 23, 2018

@lafriks but this PR will be back port to release/v1.5, I don't think we need two PRs to fix that. This problem should be a chardet library's bug, my code is a temporary method. And it's not easy to update vendor on release/v1.5.

@lafriks lafriks force-pushed the lafriks:fix/chardet_view_file branch from d62f88d to ffc2b68 Sep 24, 2018

@lafriks

This comment has been minimized.

Member

lafriks commented Sep 24, 2018

@lunny added your requested changes

@lunny

lunny approved these changes Sep 25, 2018

@bkcsoft bkcsoft added lgtm/done and removed lgtm/need 1 labels Sep 25, 2018

@lunny

This comment has been minimized.

Member

lunny commented Sep 25, 2018

But CI failed.

}
// If there is an error, we concatenate the nicely decoded part and the
// original left over. This way we won't loose data.

This comment has been minimized.

@techknowlogick

techknowlogick Sep 28, 2018

Member

typo: loose should be lose

@techknowlogick

This comment has been minimized.

Member

techknowlogick commented Sep 28, 2018

CI is failing because of following test:

assert.Error(t, err)

It fails to return an error detecting encoding of []byte{0xfa}

I think this is related to the code that has the if len(content) < 1024 { conditional.

@lafriks lafriks force-pushed the lafriks:fix/chardet_view_file branch from ffc2b68 to fab01bb Sep 29, 2018

@lafriks

This comment has been minimized.

Member

lafriks commented Sep 29, 2018

Fixed tests and comments

@lunny lunny merged commit 81702e6 into go-gitea:master Sep 29, 2018

2 checks passed

approvals/lgtm this commit looks good
continuous-integration/drone/pr the build was successful
Details
@lunny

This comment has been minimized.

Member

lunny commented Sep 29, 2018

Please send back port to v1.5

@lafriks lafriks deleted the lafriks:fix/chardet_view_file branch Sep 29, 2018

lafriks added a commit to lafriks/gitea that referenced this pull request Sep 29, 2018

Detect charset and convert non UTF-8 files for display (go-gitea#4950)
* Detect charset and convert non UTF-8 files for display

* Refactor and move function to correct module

* Revert unrelated changes

* More unrelated changes

* Duplicate content for small text to have better encoding detection

* Check if original content is valid before duplicating it

lunny added a commit that referenced this pull request Sep 30, 2018

Detect charset and convert non UTF-8 files for display (#4950) (#4994)
* Detect charset and convert non UTF-8 files for display

* Refactor and move function to correct module

* Revert unrelated changes

* More unrelated changes

* Duplicate content for small text to have better encoding detection

* Check if original content is valid before duplicating it

aswild added a commit to aswild/gitea that referenced this pull request Oct 24, 2018

Merge tag 'v1.6.0-rc1'
Prepare for wild/v1.6 branch

* BREAKING
  * Respect email privacy option in user search via API (go-gitea#4512)
  * Simply remove tidb and deps (go-gitea#3993)
  * Swagger.v1.json template (go-gitea#3572)
* FEATURE
  * Pull request review/approval and comment on code (go-gitea#3748)
  * Added dependencies for issues (go-gitea#2196) (go-gitea#2531)
  * Add the ability to have built in themes in Gitea and provide dark theme arc-green (go-gitea#4198)
  * Add sudo functionality to the API (go-gitea#4809)
  * Add oauth providers via cli (go-gitea#4591)
  * Disable merging a WIP Pull request (go-gitea#4529)
  * Force user to change password (go-gitea#4489)
  * Add letsencrypt to Gitea (go-gitea#4189)
  * Add push webhook support for mirrored repositories (go-gitea#4127)
  * Add csv file render support defaultly (go-gitea#4105)
  * Add Recaptcha functionality to Gitea (go-gitea#4044)
* BUGFIXES
  * Fix release creation via API (go-gitea#5076)
  * Remove links from topics in edit mode  (go-gitea#5026)
  * Fix missing AppSubUrl in few more templates (fixup) (go-gitea#5021)
  * Fix missing AppSubUrl in some templates (go-gitea#5020)
  * Hide outdated comments in file view (go-gitea#5017)
  * Upgrade gopkg.in/testfixtures.v2 (go-gitea#4999)
  * Disable debug routes unless PPROF is enabled in configuration (go-gitea#4995)
  * Fix user menu item styling (go-gitea#4985)
  * Fix layout of the topics editing form (go-gitea#4971)
  * Fix null pointer dereference in ParseCommitWithSignature (go-gitea#4962)
  * Fix url in discord webhook (go-gitea#4953)
  * Detect charset and convert non UTF-8 files for display (go-gitea#4950)
  * Make sure to catch the right error so it is displayed on the UI (go-gitea#4945)
  * Fix(topics): don't redirect to explore page. (go-gitea#4938)
  * Fix bug forget to remove Stopwatch when remove repository (go-gitea#4928)
  * Fix bug when repo remained bare if multiple branches pushed in single push (go-gitea#4923)
  * Fix: Let's Encrypt configuration settings (go-gitea#4911)
  * Fix: Crippled diff (go-gitea#4726) (go-gitea#4900)
  * Fix trimming of markup section names (go-gitea#4863)
  * Issues api allow pulls and fix go-gitea#4832 (go-gitea#4852)
  * Do not autocreate directory for new users/orgs (go-gitea#4828) (go-gitea#4849)
  * Fix redirect with non-ascii branch names (go-gitea#4764) (go-gitea#4810)
  * Fix missing release title in webhook (go-gitea#4783) (go-gitea#4796)
  * User shouldn't be able to approve or reject his/her own PR (go-gitea#4729)
  * Make sure to reset commit count in the cache on mirror syncing (go-gitea#4720)
  * Fixed bug where team with admin privelege type doesn't get any unit  (go-gitea#4719)
  * Fix incorrect caption of webhook setting (go-gitea#4701) (go-gitea#4717)
  * Allow WIP marker to contains < or > (go-gitea#4709)
  * Hide org/create menu item in Dashboard if user has no rights (go-gitea#4678) (go-gitea#4680)
  * Site admin could create repos even MAX_CREATION_LIMIT=0 (go-gitea#4645)
  * Fix custom templates being ignored (go-gitea#4638)
  * Fix starring icon after semantic ui update (go-gitea#4628)
  * Fix Split-View line adjustment (go-gitea#4622)
  * Fix integer constant overflows in tests (go-gitea#4616)
  * Push whitelist now doesn't apply to branch deletion (go-gitea#4601) (go-gitea#4607)
  * Fix bugs when too many IN variables (go-gitea#4594)
  * Fix failure on creating pull request with assignees (go-gitea#4419) (go-gitea#4583)
  * Fix panic issue on update avatar email (go-gitea#4580) (go-gitea#4581)
  * Fix status code label for a successful webhook (go-gitea#4540)
  * An inactive user shouldn't be able to be added as a collaborator (go-gitea#4535)
  * Don't fail silently if trying to add a collaborator twice (go-gitea#4533)
  * Fix incorrect MergeWhitelistTeamIDs check in CanUserMerge function (go-gitea#4519) (go-gitea#4525)
  * Fix out-of-transaction query in removeOrgUser (go-gitea#4521) (go-gitea#4522)
  * Fix migration from older releases (go-gitea#4495)
  * Accept 'Data:' in commit graph (go-gitea#4487)
  * Update xorm to latest version and fix correct `user` table referencing in sql (go-gitea#4473)
  * Relative URLs for LibreJS page (go-gitea#4460)
  * Redirect to correct page after using scratch token (go-gitea#4458)
  * Fix column droping for MSSQL that need new transaction for that (go-gitea#4440)
  * Replace src with raw to fix image paths (go-gitea#4377)
  * Add default merge options when creating new repository (go-gitea#4369)
  * Fix docker build (go-gitea#4358)
  * Fixes repo membership check in API (go-gitea#4341)
  * Dep upgrade mysql lib (go-gitea#4161)
  * Fix some issues with special chars in branch names (go-gitea#3767)
  * Responsive design fixes (go-gitea#4508)
* ENHANCEMENT
  * Fix milestones sorted wrongly (go-gitea#4987)
  * Allow api to create tags for releases if they don't exist (go-gitea#4890)
  * Fix go-gitea#4877 to follow the OpenID Connect Audiences spec (go-gitea#4878)
  * Enforce token on api routes [fixed critical security issue go-gitea#4357] (go-gitea#4840)
  * Update legacy branch and tag URLs in dashboard to new format (go-gitea#4812)
  * Slack webhook channel name cannot be empty or just contain an hashtag (go-gitea#4786)
  * Add whitespace handling to PR-comparsion (go-gitea#4683)
  * Make reverse proxy auth optional (go-gitea#4643)
  * MySQL TLS (go-gitea#4642)
  * Make sure to set PR split view when creating/previewing a pull request  (go-gitea#4617)
  * Log user in after a successful sign up (go-gitea#4615)
  * Fix typo IsPullReuqestBroken -> IsPullRequestBroken (go-gitea#4578)
  * Allow admin toggle forcing a password change for newly created users (go-gitea#4563)
  * Update jQuery to v1.12.4 (go-gitea#4551)
  * Env var GITEA_PUSHER_EMAIL (go-gitea#4516)
  * Feat(repo): support search repository by topic name (go-gitea#4505)
  * Small improvements to dependency UI (go-gitea#4503)
  * Make max commits in graph configurable (go-gitea#4498)
  * Add valid for lfs oid (go-gitea#4461)
  * Add shortcut to save wiki page (go-gitea#4452)
  * Allow administrator to create repository for any organization (go-gitea#4368)
  * Fix repository last updated time update when delete a user who watched the repo (go-gitea#4363)
  * Switch plaintext scratch tokens to use hash instead (go-gitea#4331)
  * Increase default TOTP secret size to 320 bits (go-gitea#4287)
  * Keep preseeded database password (go-gitea#4284)
  * Implemented hover text showing user FullName (go-gitea#4261)
  * Add ability to delete a token (go-gitea#4235)
  * Fix typos in i18n variable names. (go-gitea#4080)
  * Api: repos/search: add parameters to control the sort order (go-gitea#3964)
  * Add missing path in the Docker app.ini template (go-gitea#2181)
  * Add file name and branch to page title (go-gitea#4902)
  * Offline use of google fonts (go-gitea#4872)
  * Add missing History link to directory listings v2 (go-gitea#4829)
  * Locale for Edit and Remove due date issue (go-gitea#4802)
  * Disable 'May Import Local Repository' when is disabled by setting (Is… (go-gitea#4780)
  * API /admin/users/{username} missing parameter (go-gitea#4775)
  * Display error when adding a user to a team twice (go-gitea#4746)
  * Remove UsePrivilegeSeparation from the Docker sshd_config, see go-gitea#2876 (go-gitea#4722)
  * Focus title input when clicking helper link (go-gitea#4696)
  * Add vendor to user reserved words and format words list according alphabet (go-gitea#4685)
  * Add gitea/issues link to 500 page (go-gitea#4654)
  * Hide home button when landing page is not set to home (go-gitea#4651)
  * Remove link to GitHub issues in 404 template (go-gitea#4639)
  * Cmd/serve: pprof cpu and memory profile dumps to disk (go-gitea#4560)
  * Add flash message after an account has been successfully activated (go-gitea#4510)
  * Prevent html entity escaping on delete branch (go-gitea#4471)
  * Locale for button Edit on protected branch (go-gitea#4442)
  * Update notification icon (go-gitea#4343)
  * Added front-end topics validation (go-gitea#4316)
  * Don't display buttons if there are no system notifications (go-gitea#4280)
  * Issue due date api (go-gitea#3890)
* SECURITY
  * Improve URL validation for external wiki  and external issues (go-gitea#4710)
  * Make cookies HttpOnly and obey COOKIE_SECURE flag (go-gitea#4706)
  * Don't disclose emails of all users when sending out emails (go-gitea#4664)
  * Check that repositories can only be migrated to own user or organizations (go-gitea#4366)
* TRANSLATION
  * Fix punctuation in English translation (go-gitea#4958)
  * Fix translation (go-gitea#4355)

HoffmannP pushed a commit to HoffmannP/gitea that referenced this pull request Nov 14, 2018

Detect charset and convert non UTF-8 files for display (go-gitea#4950)
* Detect charset and convert non UTF-8 files for display

* Refactor and move function to correct module

* Revert unrelated changes

* More unrelated changes

* Duplicate content for small text to have better encoding detection

* Check if original content is valid before duplicating it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment