Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[java] RFC: new layer of abstraction atop PrimaryExpressions #497

Closed
oowekyala opened this issue Jul 11, 2017 · 7 comments
Closed

[java] RFC: new layer of abstraction atop PrimaryExpressions #497

oowekyala opened this issue Jul 11, 2017 · 7 comments
Labels
in:ast About the AST structure or API, the parsing step in:metrics Affects the metrics framework in:type-resolution Affects the type resolution code

Comments

@oowekyala
Copy link
Member

Hi,

For my work on the metrics framework, I need a way to find the qualified names of the methods that are called and fields that are accessed in a block of code. This is necessary to resolve accurately the usage of methods and data across a multi file project, and I believe it's the next big step of my project.

Type resolution could greatly help to do so, but after a discussion with @WinterGrascph, it appears it's not exhaustive enough yet: it currently populates the type of AST nodes, and the parser groups some elements in the same nodes. For instance, an expression such as a.field.foo() would be parsed into Primary[Prefix[a.field.foo], Suffix[()]], and type resolution would only populate the type of the prefix (type of foo method invocation) and the suffix containing the () arguments and the Primary expression would have the same types as well. That means we miss the type of a, and a.field. Similarly, when a similar expression is parenthesized ((a).field.foo()), it would be parsed into Primary[Prefix[Expression[(a)]], Suffix[field], Suffix[foo], Suffix[()]], here the prefix with (a) would have the type of a, suffix field would have field's type assigned and so on.

There's a couple of thing here that deserve attention:

  • These two expressions, while being semantically equivalent, are parsed completely differently. This complicates the task of rules that want to, eg, forbid calls to some methods, as they have to parse the nodes' images and consider every edge case to be exhaustive. Moreover, the parsing rules are quite counter intuitive and hard to grasp for a rule designer.
  • Type resolution is able to resolve the type of the entire chain, however these types are not kept in the first case, just because there's no node to bear them. We therefore don't exploit type resolution to its full power. That's too bad for the metrics framework too, as some field accesses are not considered in this case.

Since we can't change the way the AST is parsed, one solution to these problems would be to create another layer of abstraction atop PrimaryExpressions. This layer could bind a data structure that would represent the expression's structure in a more intuitive way. This would

  • Ensure semantically equivalent expressions have an identical representation even though the AST nodes are not identical
  • Provide a higher level interface for programmers, who wouldn't need to parse images anymore
  • Enable type resolution to store the type of all subexpressions in a call chain (good for metrics), whatever the underlying AST

I imagine such a representation could look like the following for expression a.field.foo():

  • MethodCall
    • image="a.field.foo()"
    • type=<foo's return type>
    • methodName="foo"
    • arguments=...
    • leftHandSide={FieldAccess}
      • image="a.field"
      • type=<type of field>
      • fieldName="field"
      • leftHandSide={Variable}
        • name="a"
        • type=<type of a>

The representation would probably be created and populated by the type resolution visitor. Since the implementation of this feature is necessary for the metrics framework to move forward, I guess I could work on it with @WinterGrascph, which would need some organisation

Comments? @adangel @jsotuyod

@adangel adangel added in:metrics Affects the metrics framework in:type-resolution Affects the type resolution code labels Jul 14, 2017
@oowekyala
Copy link
Member Author

A design question here: would this layer of abstraction be constructed by the type resolution visitor, or by a dedicated visitor? I think the latter would be preferable, because we could still use that representation without type resolution (it would lack type info but could still be usable --- not for metrics though), and it would be less entangled into type resolution. We'd need to see if that visitor would always execute, or would be conditioned on whether it's used by a rule of the ruleset, like type resolution or metrics.

Another point is that using such a feature for metrics would make Metrics depend on type resolution. Benchmarking PMD on its own sources with a simple ruleset showed an execution time of 0.092 seconds for the metrics visitation vs 2.270 seconds for type resolution. So it might make sense to try to limit that dependency to those metrics that need it, keeping in mind that type resolution will see some more performance improvements in the near future

@jsotuyod
Copy link
Member

Since we can't change the way the AST is parsed

I'm not so certain about this assertion. It's true we have refrained from doing so in the past, but given we are now using semantic versioning, we may consider changing the AST on major releases such as 6.0.0. Particularly on cases such as this one, where our parsers is not in line with the grammar defined by the JSL.

I agree with you, the type resolution for intermediate nodes being lost is not ideal. Not only for metrics, but for other rules too.

Non the less, I'd love to hear what @adangel and @ryan-gustafson may have to say.

So it might make sense to try to limit that dependency to those metrics that need it

By default all java rules enable type resolution. This shouldn't be an issue on itself. And yes, PMD will keep on getting better performance as we keep working.

@adangel
Copy link
Member

adangel commented Jul 16, 2017

Hm... yes, I didn't think about changing the AST, but yes - this would indeed be possible for 6.0.0. And if we move towards a more JSL compliant grammar, even better! The existing rule tests should show us, what we need to adjust after changing the grammar.

I'm wondering, whether there is a more light-weight solution (e.g. without a separate visitor) possible: What about enhancing the ASTPrimaryExpression class with a method isMethodCall() for starters and getMethodCallChain() which could return some representation like you described. The methods would analyze the AST ad-hoc (if performance will be a problem, we could cache the result). This will depend on typresolution, too - and also on the symboltable. In your example above, you could lookup the variable a in the symboltable/scope - and take the type of the variable declaration (might be local or a field). Looking up a.field / field would only be possible, if a's type could be resolved...
Instead of a method call chain, you might also find just accesses to fields - field chain. The LawOfDemeterRule already identifies method chains...
If we change the grammar, ideally only the isMethodCall() / getMethodCallChain() methods need to be adjusted.

@oowekyala
Copy link
Member Author

oowekyala commented Jul 17, 2017

@adangel For starters I may use the representation of LawOfDemeterRule and the symbol table to resolve method calls and field types to get a working prototype. But ideally I think a higher level, unified API would be more adapted, so as not to couple metrics with LawOfDemeterRule. I did a first draft for such an API here. In my idea, expressions would be represented as a doubly linked list, which could be navigated both ways with getLeftHandSide and getSuperExpression. The representation of an expression could be obtained for instance with a getVirtualExpression() method invoked on the ASTPrimaryExpression. Downcasting a VirtualExpression to its concrete class would allow us to use the more specific methods of the class.

I like the idea of not using a visitor, and build this representation on demand. That would entail very minimal changes to the type resolution visitor, which would not have the responsibility of building or populating the representation. The changes to the AST could be very lightweight too, at the very least we could store a list of the types accessed in each ASTPrimaryPrefix for example, and let the VirtualExpression use these types to populate its representation when it's being built. If we decided to change the grammar to e.g. add more nodes to carry types, our parsers would get even farther from the JLS as they are now though...

By default all java rules enable type resolution.

@jsotuyod Will this behaviour be continued? I mean, there's less and less ground to disable type resolution by default as the performance of this utility improves, but what would be the point of having a specific xml attribute for it then?

@jsotuyod
Copy link
Member

jsotuyod commented Jul 17, 2017

Looking up a.field / field would only be possible, if a's type could be resolved...

Yes, I foresee a time where symbol table / type resolution will be tightly coupled / merged. Each requires one-another at different times. Having type resolution being able to work on-demand on subtrees may help on this, but that's still to be seen.

If we change the grammar, ideally only the isMethodCall() / getMethodCallChain() methods need to be adjusted.

@adangel this is assuming we don't change the grammar right now, but do so in the future, right? otherwise I am not fully following your comments...

If we decided to change the grammar to e.g. add more nodes to carry types, our parsers would get even farther from the JLS as they are now though...

@oowekyala I'm not sure what changes you are referring to here... the change I'm proposing would actually better follow the JSL.. check this

Primaries are parsed with left recursion.

Your example a.field.foo() would be parsed as:

Primary[Primary[Primary[Primary[a], Suffix[field]], Suffix[foo]], Suffix[()]]

That would retain the same node-separation as the JSL ('though not the same names), and would be actually more similar to what we currently do.

Also, we would "consume" the '.' tokens and not have them as part of the images, which would allow to simplify several rules currently handling this on their own.

@jsotuyod Will this behaviour be continued? I mean, there's less and less ground to disable type resolution by default as the performance of this utility improves, but what would be the point of having a specific xml attribute for it then?

Yes, I foresee that being kept around. The attribute was added back when it was still on a very early stage. I am okay with keeping it around for backwards compatibility (we may deprecate it at some point). Non the less, keeping it around for other languages that have currently no support for type resolution but may in the future would be nice (Scala, Groovy, Kotlin, etc.).

@oowekyala
Copy link
Member Author

the change I'm proposing would actually better follow the JSL

@jsotuyod Ok I misunderstood. The changes you're proposing totally make sense and would make Primaries much easier to parse. Just wondering, could we add specific nodes for each kind of primary? Like, one for method invocation, one for field access, etc. That would allow us to define specific methods for each kind and make each one much easier to use. This could perhaps be done by adding a specific child for each primary, eg for a.field.foo():

Primary[MethodInvocation[Primary[FieldAccess[Primary[Variable[a]], Suffix[field]]], Suffix[foo], Suffix[()]]]

Maybe Suffix nodes are too generic too. Would it be possible to add more specific suffixes, roughly named as in the JLS? We already have some of them, as e.g. a suffix can contain an ASTAllocationExpression, or a ASTArguments, but for an Identifier, the name is in the image of the suffix and would probably better identifiable using a dedicated subnode of type e.g. ASTIdentifier.

keeping it around for other languages

Okay that makes sense to me now. Thanks for explaining :)

@oowekyala
Copy link
Member Author

Closing this, as the new expression grammar is detailed enough to store all types precisely. Refs #1759, #1979. Typeres support for the new grammar is not yet merged, but this lies outside the scope of this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in:ast About the AST structure or API, the parsing step in:metrics Affects the metrics framework in:type-resolution Affects the type resolution code
Projects
None yet
Development

No branches or pull requests

3 participants