-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Linguist to v7.26.0 #169
Conversation
Tests validating the output on Linguist fixtures fail on a single case
It now need to be debugged to understand the origin of the failure and either fixed, or documented & added to the known corner-cases. |
@@ -1579,13 +1690,30 @@ var ContentHeuristics = map[string]*Heuristics{ | |||
".stl": &Heuristics{ | |||
rule.Or( | |||
rule.MatchingLanguages("STL"), | |||
regex.MustCompileRuby(`\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)`), | |||
regex.MustCompileRuby(`\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently a patch to v7.26.0 that is on the way: github-linguist/linguist#6518.
It fixes a performance issue associated with this regex pattern.
I'm wondering if it's worth waiting for this patch to be incorporated into a future Linguist version or if the change suggested there could be simply applied manually by changing \A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)
to \A\s*solid(?:$|\s)[\s\S]*^endsolid(?:$|\s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into it!
Personally, I would wait for a patch to propagate instead of apply re-writes at runtime (as Enry policy is not to change the original regexps during code generation).
That being said, I'll be happy to accept a PR doing a (temporary) re-write at runtime, if we want a release before Linguist v7.26.1 is out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the next release is coming soon (most likely early next week), so I'm assuming it's better to wait for it and have everything in order + the new changes coming in v7.27.0.
@@ -1315,16 +1413,29 @@ var ContentHeuristics = map[string]*Heuristics{ | |||
".plist": &Heuristics{ | |||
rule.Or( | |||
rule.MatchingLanguages("XML Property List"), | |||
regex.MustCompileMultiline(`<!DOCTYPE\s+plist`), | |||
regex.MustCompileRuby(`^\s*(?:<\?xml\s|<!DOCTYPE\s+plist|<plist(?:\s+version\s*=\s*(["'])\d+(?:\.\d+)?\1)?\s*>\s*$)`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problems discussed in this comment are most likely coming from the fact that the new heuristic rule for XML Property List now contains a backreference (\1
).
It's there to make sure the pattern matches the same type of quotation marks (single or double).
Seems like replacing \1
with ["']
would be a simple way to handle this. I'm not an expert in the syntaxes at hand, but I don't see how using any quotation marks could create any false positives.
Example in Rubular (link) using ff-man.plist
introduced here):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing these detailed - it's very useful to know that more Linguist upstream features rely on regex syntax that is definitely not going to be supported by the current default Go regex engine.
I don't think that re-writing cases like this on our side is a sustainable policy for enry project: so far as they were few, we just skipped such rules at runtime and documented the divergence for the default configuration (RE2). The clients who need the precision would build their applications with oniguruma engine (-tag oniguruma
, see https://github.com/go-enry/go-enry/#faster-regexp-engine-optional ) that is faster and does support this feature but comes at a price of going though the native bindings.
As for the rule syntax change - if it does not introduce any changes in the results, not using backreferences is faster and thus it should be proposed as a PR to the Linguist itself.
It means that for the next release we'll probably have to go "the usual route" (skip&document) this case, as soon as oniguruma CI profile passes (after this PR is rebased on #170, that ATM is stuck as I was planning to look into Python CI next week - any help will be greatly appreciated).
This makes me think that in the future versions of enry it'll be important to have an option to use a feature-rich pure-Go regex engine! One option is adding https://github.com/dlclark/regexp2, measuring it's runtime performance impact (though our benchmark) and checking how many of the existing divergences that would get rid of to make a decision if it could be used as a default.
Hope this helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that re-writing cases like this on our side is a sustainable policy for enry project: so far as they were few, we just skipped such rules at runtime and documented the divergence for the default configuration (RE2).
Sounds good. When you say skipping, do you mean that all files with that extension will skip the regex-heuristic step and go directly to the baysian classifier?
As for the rule syntax change - if it does not introduce any changes in the results, not using backreferences is faster and thus it should be proposed as a PR to the Linguist itself.
Could you please create an issue in the Linguist project itself listing what regex features should be kept to a minimum like backreference and lookarounds and explain why it matters for Linguist, Enry and Github in general? It would be easier to have an issue to refer to when submitting PRs that try to remove those expressions where applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all files with that extension will skip the regex-heuristic step
that would be the effect of skipping such rules - what I meant is they are not going to be evaluated (skipped) at the runtime, even though still present in the generated code.
All the incompatibilities are logged during code generation and are available on every CI run for Linguist update.
3d8e655
to
84c996d
Compare
Rebased on the latest master. |
test plan: * ENRY_TEST_REPO=".linguist" \ go test -run '^TestIsConfiguration$' github.com/go-enry/go-enry/v2
c293df9
to
8b8cc8a
Compare
test plan - go test -run '^Test_EnryOnLinguistCorpus$' github.com/go-enry/go-enry/v2
Python tests are 馃煝 now. |
Code generation tests are flaky (as the order is not fixed) 馃檮
Going to fix it here and as soon as CI is 馃煝 - planing to merge and release v2.8.4 |
Otherwise, generator tests are flaky test plan * make code-generate * go test -run '^TestGetLanguagesByFilename$' github.com/go-enry/go-enry/v2
Edit: Now that the update was made for v7.27.0, this comment is irrelevant. @bzz Thanks for making the update! I just hope there won't be any issues with large Was there a reason to update to v7.26 instead of the newly released v7.27? |
Automated Linguist update 馃
This PR updates Linguist from bf853f1c to v7.26.0 (b5432ebc)