New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Abstract away optional AST traversals (first step) #1426

Merged
merged 4 commits into from Nov 26, 2018

Conversation

Projects
None yet
4 participants
@oowekyala
Copy link
Member

oowekyala commented Nov 2, 2018

First PR for 7.0.0! 馃槃 馃

Basically this is a first step to simplify the way we handle optional AST traversals like typeresolution or symbol table. It's necessary to detect XPath dependencies automatically, and globally will make our life easier.

Why we need a change

Type resolution, DFA, etc. all use a visitor that runs on the AST, wrapped in not-so-useful "fa莽ades" and provided by a LanguageVersionHandler. At the moment:

  • Every language must have all of them, even if it's just a dummy. This is bad because these stages are ultimately language-specific ---e.g. not all languages will implement their type res with a preliminary qname resolution like Java
    • Each language should be able to define themselves the visits they want to schedule independently of how other languages do it
  • They are all executed in the same order for all language (logic is in SourceCodeProcessor), even though the order is an implementation detail which should rest... with the language implementation.
  • Adding a new processing stage which only some rules activate is useful, yet requires a change to the LanguageVersionHandler interface, to its abstract class, to the SourceCodeProcessor, to the Rule interface, etc. The change must cascade from Rule to RuleSet to RuleSets. It also requires a change to the ruleset schema. Hmm spaghetti 馃崫
    • These optional AST visits should be handled uniformly by pmd-core, and adding or removing them only entail changes to the language implementation, not pmd-core
  • Current timing logic and check for dependency is duplicated for dfa, typeres, multifile, etc.

AST processing stages all have in common that:

  • They take an AST + other config (e.g. a classloader) and perform side effects on the AST (e.g. to populate qnames).
  • They all run after the parsing stage and before rule application

This forms a basis for abstracting them.

What's in this PR

  • Processing stages are reified under the interface AstProcessingStage.
  • LanguageVersionHandlers know the full set of AstProcessingStage a language version supports.
  • AstProcessingStages may declare dependencies on other stages. E.g. Java typeres depends on qname resolution.
  • AstProcessingStages are implemented by an enum for each language that supports some. This is quite natural for ordering reasons. Adding or removing AST visits is easy, so is specifying dependencies.
  • The Rule interface has a new method dependsOn(AstProcessingStage), that returns whether the rule depends on the stage or not.
    • Having it there instead of on the AstProcessingStage allows us to implement it differently in XPath rules and Java rules.
    • E.g. Java rules may use annotations and XPath rules may use some analysis of the XPath expression (#437)

What's not in this PR

  • The actual way a rule declares a dependency on an stage is still unclear. Probably there will be a method on the Rule interface that returns the AstProcessingStages the rule depends on. This can then be implemented by abstract rule classes.
    • E.g. Java rules may use annotations and XPath rules may use some analysis of the XPath expression (#437)
    • This is left for future work, so is kind of WIP in this PR.
  • The processing stages are not wired into SourceCodeProcessor for now.
    • TimedOperationCategory needs to be refactored to a class to be extensible. This is probably the next PR.
    • The fact that RuleSets is mutable poses a problem to avoiding unnecessary computation. Either we make it immutable, or we have a new immutable class that takes care of precomputing all that can be done in advance. I'd really like them to make them immutable, because it will be much simpler to build them and reason about them, but this will change the API significantly. Wdyt?
    • This is further explained in a comment in SourceCodeProcessor.
  • Since it's not wired I haven't removed the former API, just deprecated it. Deprecations will need to be ported to the master branch.

@oowekyala oowekyala force-pushed the oowekyala:extensible-ast-processing-stages branch from da64520 to 847f294 Nov 2, 2018

@pmd-test

This comment has been minimized.

Copy link

pmd-test commented Nov 2, 2018

1 Warning
鈿狅笍 Running pmdtester failed, this message is mainly used to remind the maintainers of PMD.

Generated by 馃毇 Danger

oowekyala added some commits Nov 9, 2018

@adangel

This comment has been minimized.

Copy link
Member

adangel commented Nov 20, 2018

Some general thoughts:

They are all executed in the same order for all language (logic is in SourceCodeProcessor), even
though the order is an implementation detail which should rest... with the language implementation.

Fully agree. We have also some unit tests, that manually call the parser and if they need typeres, they then manually need the different facades - and they need to call them in the correct order (e.g. qname before typeres). And it is definitely a language implementation detail.

AstProcessingStages may declare dependencies on other stages. E.g. Java typeres depends on
qname resolution.

Do we need this detail? If the language handler knows all the stages, it should also know the order, in which they are executed. If an AstProcessingStage depends on another stage, maybe we shouldn't have it separated at all?
I'm just trying to avoid the logic of determining the correct ordering of the AstProcessingStages based on the dependency - it would be much easier, if the LanguageHandler impl would just define the correct order.

The Rule interface has a new method dependsOn(AstProcessingStage), that returns whether the
rule depends on the stage or not.

This would allow to have maybe some performance improvements if (and only if) you are only executing rules, that don't need all stages. I would argue for removing this feature:

  • Most likely, you are executing more than one rule - so, the more rules you have in the ruleset, the higher the chances are, that you need all stages. So, the performance benefit would only be visible in unit tests, where you might execute rule by rule.
  • It's error prone to declare for each rule, which stages are needed. And as you pointed out: adding a new stage might result in changing the ruleset schema (if we want some type safety down to this and not just strings).
  • You only get the full power of analysis, if you make use of all stages - so, I'd rather enable all stages by default always. If we have performance issues, we should work on improving the processing and not removing some stages, that we might not need for a specific rule, but the stage would be enabled anyways, because a different rule in the ruleset needs it...

The changes you proposed make sense. I'll look into the code now. Thanks!

* @since 7.0.0
*/
@Experimental
default boolean dependsOn(AstProcessingStage<?> stage) {

This comment has been minimized.

@adangel

adangel Nov 20, 2018

Member

As described in my general comment, I'd argue for not introducing this feature to enable/disable specific stages...

// basically:
// 1. make the union of all stage dependencies of each rule, by language, for the Rulesets
// 2. order them by dependency
// 3. run them and time them if needed

This comment has been minimized.

@adangel

adangel Nov 20, 2018

Member

This would be simplified to just: "run them all and time them"

This comment has been minimized.

@jsotuyod

jsotuyod Jan 13, 2019

Member

This is enticing in theory, but I think we should first address the performance issues, and then enable it by default

// The approach I'd like to take is either
// * to create a new RunnableRulesets class which is immutable, and performs all these preliminary
// computations upon construction.
// * or to modify Ruleset and Rulesets to be immutable. This IMO is a better option because it makes

This comment has been minimized.

@adangel

adangel Nov 20, 2018

Member

Yes, I think, immutability is the way to go. Ruleset is already immutable, isn't it? So, we would just need make Rulesets immutable - this would be then the configured rules, that are to be executed.

Since executing rules runs in multiple threads and rules do not need to be thread-safe, we'll still need to be able to clone a complete Rulesets instance.

This comment has been minimized.

@jsotuyod

jsotuyod Jan 13, 2019

Member

I'd make RuleSets immutable. Deprecate mutation methods and move fordward

*
* @return VisitorStarter
*/
// TODO should we deprecate? Not much use to it.

This comment has been minimized.

@adangel

adangel Nov 20, 2018

Member

DumpFacade: It's more a tool to help during development. We could deprecated it and point to the Designer, that should be used instead to get a look at the AST....

This comment has been minimized.

@jsotuyod

jsotuyod Jan 13, 2019

Member

definitely!

* after parsing is done. Each of these stages implicitly
* depends on the parser stage.
*
* <p>An analysis on a file goes through the following stages:

This comment has been minimized.

@adangel

adangel Nov 20, 2018

Member

Where would you see the multifile analysis? Is this a stage as well?
For multifile analysis, we probably can only run the rules, that need it, after all files have been fully parsed. So, the collection of the data could be done in a stage, that is executed individually for each file, but I think, the rules, that make use of it, need to be executed separately, after rulechain rules and other rules have been executed... Wdyt?

This comment has been minimized.

@oowekyala

oowekyala Nov 23, 2018

Member

I think multifile wouldn't fit neatly into this framework. I'd like it more to be something transparent to the rules.

I imagine we could have a separate class of visitors (not rules) that run after parsing and before rule application. These are declared by rules, and each gather some information relevant to the rule that declares them. This could be a stage, but the actual visitors being applied depends on the contents of the ruleset, so it probably needs special treatment...

Upon rule application the mutlifile rules query the contents of a global index, knowing what keys their data collection visitors registered and being able to access that info for all files of the analysis. Multifile rules wouldn't need an AST to run, so they should also be treated very specially.

I read a bit about IntelliJ's way of doing it, which is very different although very powerful. We could apply some of the things they do to our codebase. We can perhaps talk about it in the next meeting

This comment has been minimized.

@adangel

adangel Nov 23, 2018

Member

Multifile rules wouldn't need an AST to run, so they should also be treated very specially.

Right, multifile rules would be very different to "normal" rules...

public RuleViolationFactory getRuleViolationFactory() {
throw new UnsupportedOperationException("getRuleViolationFactory() is not supported for C++");
protected String getLanguageName() {
return "C++";

This comment has been minimized.

@adangel

adangel Nov 20, 2018

Member

We should probably use here net.sourceforge.pmd.lang.cpp.CppLanguageModule.NAME. Somewhere I've read your question about why we have separate CPD and PMD language modules...
Well, for now, we use the java.util.ServiceLoader<Language> facility to locate the available languages on the classpath - we have a CPD language class net.sourceforge.pmd.cpd.Language and a PMD language class net.sourceforge.pmd.lang.Language. We could unify them and move into the class the information, whether it's CPD only or PMD only.
There was once also the discussion, whether CPD should be part of PMD or a completely separate tool. If we unify them, we merge the two tools even closer together...

This comment has been minimized.

@adangel

adangel Nov 26, 2018

Member

The complete CppHandler class is unnecessary and can be removed with 7.0.0. See #1481.

@adangel adangel self-assigned this Nov 26, 2018

@adangel adangel merged commit 938efd1 into pmd:pmd/7.0.x Nov 26, 2018

1 of 2 checks passed

Codacy/PR Quality Review Hang in there, Codacy is reviewing your Pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@oowekyala oowekyala deleted the oowekyala:extensible-ast-processing-stages branch Nov 26, 2018

* Configuration relevant to e.g. an {@link AstProcessingStage}.
*
* @author Cl茅ment Fournier
* @since 6.10.0

This comment has been minimized.

@jsotuyod

jsotuyod Jan 13, 2019

Member

7.0.0?

* @throws IllegalArgumentException if this rule is not a rule for Java.
*/
@Experimental
public final boolean ruleDependsOnThisStage(Rule rule) {

This comment has been minimized.

@jsotuyod

jsotuyod Jan 13, 2019

Member

this is really weird and cumbersome...

The AbstractJavaRule checks if the stage is a JavaProcessingStage to call this method and pass itself, so we can come here, check if the passed rule is a java rule (even hardwiring the language terse name") to actually call dependsOnImpl which either returns a constant or goes back to the rule to ask if it really needs it...

This comment has been minimized.

@oowekyala

oowekyala Jan 13, 2019

Member

Yeah but it's the only way I know to implement double-dispatch in java... It's just like a visitor pattern tbh, the data flow is also unintuitive there.

This comment has been minimized.

@jsotuyod

jsotuyod Jan 13, 2019

Member

wouldn't it be best if we have the rule directly inform a:

Set<AstProcessingStage<T>> getRequestedProcessingStages() 

T would have to be bound at the base abstract rule per language.

This way, the rule knows the stages it needs, without the stages to need to know the rules; and breaking the tight coupling. This should be fine, as rules are dynamic (ie: can be added through plugins), but processing stages cannot, as they are part of the enum.

I'm unsure we actually need double dispatching here.

This comment has been minimized.

@oowekyala

oowekyala Jan 13, 2019

Member

Double dispatch here is mostly a way to group the implementation of "how does a rule declare a dependency on a stage" with the implementation of the stage, ie in the enum instead of the rule. I thought the code would be clearer that way but ultimately you're right, we only need to dispatch on the rule's type. Ignoring the enum's ruleDependsOnThisStage for XPath rules also seems sloppy. If you'd like to do the change yourself please go ahead, otherwise let me know

I'm not sure adding a type parameter is worth it though. The processing stages will nearly exclusively be handled in pmd-core, where no information on the value of the type argument is available. The extra type safety would disappear, and force us to work around it with wildcards and stuff. So we couldn't avoid runtime checks anyway.


// Also, the benchmarking framework needs a small refactor. TimedOperationCategory needs to be
// made extensible -> probably should be turned to a class with static constants + factory methods
// and not an enum.

This comment has been minimized.

@jsotuyod

jsotuyod Jan 21, 2019

Member

actually, I wouldn't do this. The current implementation allows to track under any given category with a label, and the report will take labeled entries and group them together, along with a grand total (this is how rules are tracked). So, we could simply drop the symbol table, dfa and type resolution members from the enum, have a PREPROCESSING_STAGE or similar, and have each one track itself under a label (terse name). It will even show how many times each stage was actually run by default.

The current implementation should be flexible enough while keeping uniqueness of entries and order (thanks to them being an enum).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment