Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster IgnoredError::shouldIgnore() #737

Closed
wants to merge 3 commits into from

Conversation

staabm
Copy link
Contributor

@staabm staabm commented Oct 25, 2021

this fix is not very elegant, but as it improves the analysis by 7-8% I think its at least worth discussing.

running phpstan 1505153 leads to the following profile:

grafik

this means we still see a bottleneck caused by regex-matching of ignore rules.

this PR suggests a fast-exit condition based on the error-messages first character.
if the first char does not match - and its not a regex-meta char, the following regex-matching-logic can never match.

this optimization leads to the following comparison profile:

grafik

we can see a ~1s improvement while also saving ~20% memory.

@dktapps
Copy link
Contributor

dktapps commented Oct 25, 2021

I think at this point considering the prevalence of generated baselines, we should start considering a non-regex-based ignoreErrors, if regex is this much of a bottleneck.

Regex based error filtering made sense when you had to write all error patterns by hand, but now that we're generating baseline so much it doesn't make much sense anymore.

@staabm
Copy link
Contributor Author

staabm commented Oct 25, 2021

Agree. For the baseline case we should not at all need regex matching. @ondrejmirtes do you agree?

Would you accept a pr changing baseline generation and matching with plain strings instead of regex?

@dktapps
Copy link
Contributor

dktapps commented Oct 25, 2021

To be clear, regex should still be supported, but using regex for baseline makes no sense.
Also, supporting plain string in ignoreErrors would make it easier to manually ignore specific errors (no need to fiddle around manually adding \ to things, just copy paste the error directly).

@staabm staabm closed this Oct 27, 2021
@staabm staabm deleted the faster-ignore branch October 27, 2021 05:49
@staabm
Copy link
Contributor Author

staabm commented Oct 27, 2021

I will give it a try

@ondrejmirtes
Copy link
Member

Hi, how many ignoreErrors do you have? I don't really consider regexes as a bottleneck.

@staabm
Copy link
Contributor Author

staabm commented Oct 27, 2021

how many ignoreErrors do you have?

~15 000 (as can be seen in the profiles of this PRs description, in my case its the slowest part of the analysis run)

@ondrejmirtes
Copy link
Member

Wouldn't it be better to have a lower level and smaller baseline? In this case you have to regenerate it very often I'd say.

In my experience baseline works really well if you have hundreds of errors in it, not thousands. Because it assumes that you have to be able to write new code cleanly which is hard on level 7 and 8 if the rest of the application isn't typed well.

@staabm
Copy link
Contributor Author

staabm commented Oct 27, 2021

to be precise: we have 15000 lines of baseline, which means ~3750 errors
our codebase is on level 6 and we are in the process of adding types everywhere.

IMO if the code-base is "big enough" you will have that much errors. you have to start somewhere and going below 6 will leave too much room in new code IMO.
(we fixed big chunks of errors on level 1,2,3,4 before though)

@ondrejmirtes
Copy link
Member

Alright, so how much time does the analysis of the whole codebase takes (without result cache) and how much time from that is spent on filtering ignored errors?

@staabm
Copy link
Contributor Author

staabm commented Oct 27, 2021

php ../phpstan-src/bin/phpstan clear-result-cache -c phpstan.neon.dist &&
blackfire run --ignore-exit-status php ../phpstan-src/bin/phpstan analyze -c phpstan.neon.dist

the whole run is 27-28secs. error ignoring is 2 secs on 2e32030

grafik


with a populated result cache relatively its way more:

the whole run is 11secs. error ignoring is 2 secs on 2e32030

grafik

@dktapps
Copy link
Contributor

dktapps commented Oct 27, 2021

FWIW, I didn't see any clear evidence that the regex is a bottleneck in the PM case (less than 1% of CPU time spent, even if a full result cache is hit). But PM has somewhere in the ballpark of 1-2K errors and widely distributed over many files, with usually only a handful of errors per affected file. I'd imagine that given the file segregation it wouldn't have much impact unless you have large ratio of errors to number of files, since it only tries to match error patterns designated for that file specifically.

Some numbers for context (these are from Windows, both of them assuming that the nette.configurator cache is not flooded):

  • Full analysis ~13sec (with 16 workers)
  • Analysis with result cache 1.3sec

So, less than 1% in either of these cases is pretty insignificant, so I don't really have a dog in this fight.

@staabm
Copy link
Contributor Author

staabm commented Nov 1, 2021

btw: seems like we are not the only ones with such a big baseline: https://twitter.com/Kalle_/status/1455146801632272393

@dktapps
Copy link
Contributor

dktapps commented Nov 1, 2021

I don't think the size of baseline matters. What matters more is the number of ignored errors per file. You can see here it only attempts to check against ignoreError patterns with a matching filename:

$errors = array_values(array_filter($errors, function (Error $error) use ($processIgnoreError): bool {
$filePath = $this->fileHelper->normalizePath($error->getFilePath());
if (isset($this->ignoreErrorsByFile[$filePath])) {
foreach ($this->ignoreErrorsByFile[$filePath] as $ignoreError) {
$i = $ignoreError['index'];
$ignore = $ignoreError['ignoreError'];
$result = $processIgnoreError($error, $i, $ignore);
if (!$result) {
return false;
}
}
}

However, the algorithm used does appear to have an O(n^2) complexity.

Some off-the-bat maths:

  • Assume you have 1 file with 1000 unique errors
  • You have 1000 unique error patterns for this file
  • You could end up matching 1M times (1000 errors to 1000 patterns, worst case).

I think at most PM has maybe 20-30 errors in the worst file, so this doesn't really affect PM's analysis time at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants