Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Linguist to v7.26.0 #169

Merged
merged 8 commits into from
Sep 22, 2023
Merged

Update Linguist to v7.26.0 #169

merged 8 commits into from
Sep 22, 2023

Conversation

github-actions[bot]
Copy link

@github-actions github-actions bot commented Sep 6, 2023

Automated Linguist update 馃

This PR updates Linguist from bf853f1c to v7.26.0 (b5432ebc)

@bzz bzz closed this Sep 6, 2023
@bzz bzz reopened this Sep 6, 2023
@bzz bzz mentioned this pull request Sep 6, 2023
@bzz
Copy link
Member

bzz commented Sep 6, 2023

Tests validating the output on Linguist fixtures fail on a single case

 --- FAIL: Test_EnryOnLinguistCorpus (4.14s)
     --- FAIL: Test_EnryOnLinguistCorpus/TestLinguistSamples (2.53s)
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/Imperial Units Remover.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/abp-filters-anti-cv.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/anti-facebook.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/fake-news.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Adblock Filter List/test_rules.txt"	expected: "Adblock Filter List"	got: "Text"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/SQL/drop_stuff.sql"	expected: "SQL"	got: "SQL"
         linguist_corpus_test.go:68: 		[corner case] file: "/tmp/linguist-389325045/samples/Vim Script/textobj-rubyblock.vba"	expected: "Vim Script"	got: "Vim Help File"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"             	            	
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/ff-man.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"             	            	
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/info.min.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/info.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:70: 
             	Error Trace:	/home/runner/work/go-enry/go-enry/linguist_corpus_test.go:70
             	Error:      	Not equal: 
             	            	expected: "XML Property List"
             	            	actual  : "OpenStep Property List"
             	Messages:   	file: "/tmp/linguist-389325045/samples/XML Property List/man.plist"	expected: "XML Property List"	got: "OpenStep Property List"
         linguist_corpus_test.go:74: 		total files: 2877, ok: 2867, failed: 10, other: 0

It now need to be debugged to understand the origin of the failure and either fixed, or documented & added to the known corner-cases.

@bzz bzz added the enhancement New feature or request label Sep 6, 2023
@@ -1579,13 +1690,30 @@ var ContentHeuristics = map[string]*Heuristics{
".stl": &Heuristics{
rule.Or(
rule.MatchingLanguages("STL"),
regex.MustCompileRuby(`\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)`),
regex.MustCompileRuby(`\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)`),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently a patch to v7.26.0 that is on the way: github-linguist/linguist#6518.
It fixes a performance issue associated with this regex pattern.

I'm wondering if it's worth waiting for this patch to be incorporated into a future Linguist version or if the change suggested there could be simply applied manually by changing \A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s) to \A\s*solid(?:$|\s)[\s\S]*^endsolid(?:$|\s)

Copy link
Member

@bzz bzz Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into it!

Personally, I would wait for a patch to propagate instead of apply re-writes at runtime (as Enry policy is not to change the original regexps during code generation).

That being said, I'll be happy to accept a PR doing a (temporary) re-write at runtime, if we want a release before Linguist v7.26.1 is out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the next release is coming soon (most likely early next week), so I'm assuming it's better to wait for it and have everything in order + the new changes coming in v7.27.0.

@@ -1315,16 +1413,29 @@ var ContentHeuristics = map[string]*Heuristics{
".plist": &Heuristics{
rule.Or(
rule.MatchingLanguages("XML Property List"),
regex.MustCompileMultiline(`<!DOCTYPE\s+plist`),
regex.MustCompileRuby(`^\s*(?:<\?xml\s|<!DOCTYPE\s+plist|<plist(?:\s+version\s*=\s*(["'])\d+(?:\.\d+)?\1)?\s*>\s*$)`),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problems discussed in this comment are most likely coming from the fact that the new heuristic rule for XML Property List now contains a backreference (\1).
It's there to make sure the pattern matches the same type of quotation marks (single or double).

Seems like replacing \1 with ["'] would be a simple way to handle this. I'm not an expert in the syntaxes at hand, but I don't see how using any quotation marks could create any false positives.

Example in Rubular (link) using ff-man.plist introduced here):
image

Copy link
Member

@bzz bzz Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing these detailed - it's very useful to know that more Linguist upstream features rely on regex syntax that is definitely not going to be supported by the current default Go regex engine.

I don't think that re-writing cases like this on our side is a sustainable policy for enry project: so far as they were few, we just skipped such rules at runtime and documented the divergence for the default configuration (RE2). The clients who need the precision would build their applications with oniguruma engine (-tag oniguruma, see https://github.com/go-enry/go-enry/#faster-regexp-engine-optional ) that is faster and does support this feature but comes at a price of going though the native bindings.

As for the rule syntax change - if it does not introduce any changes in the results, not using backreferences is faster and thus it should be proposed as a PR to the Linguist itself.

It means that for the next release we'll probably have to go "the usual route" (skip&document) this case, as soon as oniguruma CI profile passes (after this PR is rebased on #170, that ATM is stuck as I was planning to look into Python CI next week - any help will be greatly appreciated).

This makes me think that in the future versions of enry it'll be important to have an option to use a feature-rich pure-Go regex engine! One option is adding https://github.com/dlclark/regexp2, measuring it's runtime performance impact (though our benchmark) and checking how many of the existing divergences that would get rid of to make a decision if it could be used as a default.

Hope this helps!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that re-writing cases like this on our side is a sustainable policy for enry project: so far as they were few, we just skipped such rules at runtime and documented the divergence for the default configuration (RE2).

Sounds good. When you say skipping, do you mean that all files with that extension will skip the regex-heuristic step and go directly to the baysian classifier?

As for the rule syntax change - if it does not introduce any changes in the results, not using backreferences is faster and thus it should be proposed as a PR to the Linguist itself.

Could you please create an issue in the Linguist project itself listing what regex features should be kept to a minimum like backreference and lookarounds and explain why it matters for Linguist, Enry and Github in general? It would be easier to have an issue to refer to when submitting PRs that try to remove those expressions where applicable.

Copy link
Member

@bzz bzz Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all files with that extension will skip the regex-heuristic step

that would be the effect of skipping such rules - what I meant is they are not going to be evaluated (skipped) at the runtime, even though still present in the generated code.

All the incompatibilities are logged during code generation and are available on every CI run for Linguist update.

@bzz bzz force-pushed the feature/sync-linguist-bf853f1c branch from 3d8e655 to 84c996d Compare September 13, 2023 22:06
@bzz
Copy link
Member

bzz commented Sep 13, 2023

Rebased on the latest master.

test plan:
 * ENRY_TEST_REPO=".linguist" \
    go test -run '^TestIsConfiguration$' github.com/go-enry/go-enry/v2
@bzz bzz force-pushed the feature/sync-linguist-bf853f1c branch from c293df9 to 8b8cc8a Compare September 22, 2023 12:42
@bzz
Copy link
Member

bzz commented Sep 22, 2023

Python tests are 馃煝 now.

@bzz
Copy link
Member

bzz commented Sep 22, 2023

Code generation tests are flaky (as the order is not fixed) 馃檮

            -	"HOSTS":                        {"Hosts File", "INI"},
            +	"HOSTS":                        {"INI", "Hosts File"},

Going to fix it here and as soon as CI is 馃煝 - planing to merge and release v2.8.4

Otherwise, generator tests are flaky

test plan
 * make code-generate
 * go test -run '^TestGetLanguagesByFilename$' github.com/go-enry/go-enry/v2
@bzz bzz merged commit 6d9d5e9 into master Sep 22, 2023
28 checks passed
@bzz bzz deleted the feature/sync-linguist-bf853f1c branch September 22, 2023 13:11
@bzz
Copy link
Member

bzz commented Sep 22, 2023

@DecimalTurn
Copy link

DecimalTurn commented Sep 22, 2023

Edit: Now that the update was made for v7.27.0, this comment is irrelevant.

@bzz Thanks for making the update! I just hope there won't be any issues with large .stl files as discussed here: #169 (comment)

Was there a reason to update to v7.26 instead of the newly released v7.27?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

None yet

2 participants