Update Linguist to v7.26.0 #169

github-actions · 2023-09-06T17:31:18Z

Automated Linguist update 🤖

This PR updates Linguist from bf853f1c to v7.26.0 (b5432ebc)

bzz · 2023-09-06T17:53:17Z

Tests validating the output on Linguist fixtures fail on a single case

 --- FAIL: Test_EnryOnLinguistCorpus (4.14s)
     --- FAIL: Test_EnryOnLinguistCorpus/TestLinguistSamples (2.53s)
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/Imperial Units Remover.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/abp-filters-anti-cv.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/anti-facebook.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/fake-news.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/test_rules.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/SQL/drop_stuff.sql"	expected: "SQL"	got: "SQL"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Vim Script/textobj-rubyblock.vba"	expected: "Vim Script"	got: "Vim Help File"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"             	            	
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/ff-man.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"             	            	
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/info.min.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/info.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/man.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:74: 		total files: 2877, ok: 2867, failed: 10, other: 0

It now need to be debugged to understand the origin of the failure and either fixed, or documented & added to the known corner-cases.

DecimalTurn · 2023-09-06T18:27:00Z

data/content.go

@@ -1579,13 +1690,30 @@ var ContentHeuristics = map[string]*Heuristics{
 	".stl": &Heuristics{
 		rule.Or(
 			rule.MatchingLanguages("STL"),
-			regex.MustCompileRuby(`\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)`),
+			regex.MustCompileRuby(`\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)`),


There is currently a patch to v7.26.0 that is on the way: github-linguist/linguist#6518.
It fixes a performance issue associated with this regex pattern.

I'm wondering if it's worth waiting for this patch to be incorporated into a future Linguist version or if the change suggested there could be simply applied manually by changing \A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s) to \A\s*solid(?:$|\s)[\s\S]*^endsolid(?:$|\s)

Thanks for digging into it!

Personally, I would wait for a patch to propagate instead of apply re-writes at runtime (as Enry policy is not to change the original regexps during code generation).

That being said, I'll be happy to accept a PR doing a (temporary) re-write at runtime, if we want a release before Linguist v7.26.1 is out.

Seems like the next release is coming soon (most likely early next week), so I'm assuming it's better to wait for it and have everything in order + the new changes coming in v7.27.0.

DecimalTurn · 2023-09-08T13:42:01Z

data/content.go

@@ -1315,16 +1413,29 @@ var ContentHeuristics = map[string]*Heuristics{
 	".plist": &Heuristics{
 		rule.Or(
 			rule.MatchingLanguages("XML Property List"),
-			regex.MustCompileMultiline(`<!DOCTYPE\s+plist`),
+			regex.MustCompileRuby(`^\s*(?:<\?xml\s|<!DOCTYPE\s+plist|<plist(?:\s+version\s*=\s*(["'])\d+(?:\.\d+)?\1)?\s*>\s*$)`),


The problems discussed in this comment are most likely coming from the fact that the new heuristic rule for XML Property List now contains a backreference (\1).
It's there to make sure the pattern matches the same type of quotation marks (single or double).

Seems like replacing \1 with ["'] would be a simple way to handle this. I'm not an expert in the syntaxes at hand, but I don't see how using any quotation marks could create any false positives.

Example in Rubular (link) using ff-man.plist introduced here):

Thank you for sharing these detailed - it's very useful to know that more Linguist upstream features rely on regex syntax that is definitely not going to be supported by the current default Go regex engine.

I don't think that re-writing cases like this on our side is a sustainable policy for enry project: so far as they were few, we just skipped such rules at runtime and documented the divergence for the default configuration (RE2). The clients who need the precision would build their applications with oniguruma engine (-tag oniguruma, see https://github.com/go-enry/go-enry/#faster-regexp-engine-optional ) that is faster and does support this feature but comes at a price of going though the native bindings.

As for the rule syntax change - if it does not introduce any changes in the results, not using backreferences is faster and thus it should be proposed as a PR to the Linguist itself.

It means that for the next release we'll probably have to go "the usual route" (skip&document) this case, as soon as oniguruma CI profile passes (after this PR is rebased on #170, that ATM is stuck as I was planning to look into Python CI next week - any help will be greatly appreciated).

This makes me think that in the future versions of enry it'll be important to have an option to use a feature-rich pure-Go regex engine! One option is adding https://github.com/dlclark/regexp2, measuring it's runtime performance impact (though our benchmark) and checking how many of the existing divergences that would get rid of to make a decision if it could be used as a default.

Hope this helps!

I don't think that re-writing cases like this on our side is a sustainable policy for enry project: so far as they were few, we just skipped such rules at runtime and documented the divergence for the default configuration (RE2).

Sounds good. When you say skipping, do you mean that all files with that extension will skip the regex-heuristic step and go directly to the baysian classifier?

As for the rule syntax change - if it does not introduce any changes in the results, not using backreferences is faster and thus it should be proposed as a PR to the Linguist itself.

Could you please create an issue in the Linguist project itself listing what regex features should be kept to a minimum like backreference and lookarounds and explain why it matters for Linguist, Enry and Github in general? It would be easier to have an issue to refer to when submitting PRs that try to remove those expressions where applicable.

all files with that extension will skip the regex-heuristic step

that would be the effect of skipping such rules - what I meant is they are not going to be evaluated (skipped) at the runtime, even though still present in the generated code.

All the incompatibilities are logged during code generation and are available on every CI run for Linguist update.

bzz · 2023-09-13T22:07:01Z

Rebased on the latest master.

test plan: * ENRY_TEST_REPO=".linguist" \ go test -run '^TestIsConfiguration$' github.com/go-enry/go-enry/v2

test plan - go test -run '^Test_EnryOnLinguistCorpus$' github.com/go-enry/go-enry/v2

bzz · 2023-09-22T12:46:51Z

Python tests are 🟢 now.

bzz · 2023-09-22T12:51:08Z

Code generation tests are flaky (as the order is not fixed) 🙄

            -	"HOSTS":                        {"Hosts File", "INI"},
            +	"HOSTS":                        {"INI", "Hosts File"},

Going to fix it here and as soon as CI is 🟢 - planing to merge and release v2.8.4

Otherwise, generator tests are flaky test plan * make code-generate * go test -run '^TestGetLanguagesByFilename$' github.com/go-enry/go-enry/v2

bzz · 2023-09-22T13:14:43Z

https://github.com/go-enry/go-enry/releases/tag/v2.8.5 is out 🚀

DecimalTurn · 2023-09-22T16:31:16Z

Edit: Now that the update was made for v7.27.0, this comment is irrelevant.

@bzz Thanks for making the update! I just hope there won't be any issues with large .stl files as discussed here: #169 (comment)

Was there a reason to update to v7.26 instead of the newly released v7.27?

bzz closed this Sep 6, 2023

bzz reopened this Sep 6, 2023

bzz mentioned this pull request Sep 6, 2023

Update Linguist to v7.26.0 #167

Merged

bzz added the enhancement New feature or request label Sep 6, 2023

DecimalTurn suggested changes Sep 6, 2023

View reviewed changes

DecimalTurn reviewed Sep 8, 2023

View reviewed changes

Updated Linguist to v7.26.0

84c996d

bzz force-pushed the feature/sync-linguist-bf853f1c branch from 3d8e655 to 84c996d Compare September 13, 2023 22:06

bzz added 2 commits September 22, 2023 14:42

IsConfiguration: add&fix failing Python case to Go

7db593c

test plan: * ENRY_TEST_REPO=".linguist" \ go test -run '^TestIsConfiguration$' github.com/go-enry/go-enry/v2

test: fix Python tests

8b8cc8a

bzz force-pushed the feature/sync-linguist-bf853f1c branch from c293df9 to 8b8cc8a Compare September 22, 2023 12:42

bzz added 4 commits September 22, 2023 14:46

test: usability in err msg on linguist clone

561ffd9

test: output format readability improvement

fc4d2aa

test: drop irrelevant corner case (fixed upstream)

cc878e3

test: add new corner cases for linguist v7.26

bd36304

test plan - go test -run '^Test_EnryOnLinguistCorpus$' github.com/go-enry/go-enry/v2

LanguagesByFilename: fix language order in generated code

dc1110e

Otherwise, generator tests are flaky test plan * make code-generate * go test -run '^TestGetLanguagesByFilename$' github.com/go-enry/go-enry/v2

bzz merged commit 6d9d5e9 into master Sep 22, 2023
28 checks passed

bzz deleted the feature/sync-linguist-bf853f1c branch September 22, 2023 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Linguist to v7.26.0 #169

Update Linguist to v7.26.0 #169

github-actions bot commented Sep 6, 2023

bzz commented Sep 6, 2023 •

edited

DecimalTurn Sep 6, 2023 •

edited

bzz Sep 6, 2023 •

edited

DecimalTurn Sep 7, 2023

DecimalTurn Sep 8, 2023 •

edited

bzz Sep 9, 2023 •

edited

DecimalTurn Sep 10, 2023 •

edited

bzz Sep 11, 2023 •

edited

bzz commented Sep 13, 2023

bzz commented Sep 22, 2023

bzz commented Sep 22, 2023 •

edited

bzz commented Sep 22, 2023

DecimalTurn commented Sep 22, 2023 •

edited

Update Linguist to v7.26.0 #169

Update Linguist to v7.26.0 #169

Conversation

github-actions bot commented Sep 6, 2023

bzz commented Sep 6, 2023 • edited

DecimalTurn Sep 6, 2023 • edited

Choose a reason for hiding this comment

bzz Sep 6, 2023 • edited

Choose a reason for hiding this comment

DecimalTurn Sep 7, 2023

Choose a reason for hiding this comment

DecimalTurn Sep 8, 2023 • edited

Choose a reason for hiding this comment

bzz Sep 9, 2023 • edited

Choose a reason for hiding this comment

DecimalTurn Sep 10, 2023 • edited

Choose a reason for hiding this comment

bzz Sep 11, 2023 • edited

Choose a reason for hiding this comment

bzz commented Sep 13, 2023

bzz commented Sep 22, 2023

bzz commented Sep 22, 2023 • edited

bzz commented Sep 22, 2023

DecimalTurn commented Sep 22, 2023 • edited

bzz commented Sep 6, 2023 •

edited

DecimalTurn Sep 6, 2023 •

edited

bzz Sep 6, 2023 •

edited

DecimalTurn Sep 8, 2023 •

edited

bzz Sep 9, 2023 •

edited

DecimalTurn Sep 10, 2023 •

edited

bzz Sep 11, 2023 •

edited

bzz commented Sep 22, 2023 •

edited

DecimalTurn commented Sep 22, 2023 •

edited