[java] RFC: new layer of abstraction atop PrimaryExpressions #497

oowekyala · 2017-07-11T09:35:11Z

Hi,

For my work on the metrics framework, I need a way to find the qualified names of the methods that are called and fields that are accessed in a block of code. This is necessary to resolve accurately the usage of methods and data across a multi file project, and I believe it's the next big step of my project.

Type resolution could greatly help to do so, but after a discussion with @WinterGrascph, it appears it's not exhaustive enough yet: it currently populates the type of AST nodes, and the parser groups some elements in the same nodes. For instance, an expression such as a.field.foo() would be parsed into Primary[Prefix[a.field.foo], Suffix[()]], and type resolution would only populate the type of the prefix (type of foo method invocation) and the suffix containing the () arguments and the Primary expression would have the same types as well. That means we miss the type of a, and a.field. Similarly, when a similar expression is parenthesized ((a).field.foo()), it would be parsed into Primary[Prefix[Expression[(a)]], Suffix[field], Suffix[foo], Suffix[()]], here the prefix with (a) would have the type of a, suffix field would have field's type assigned and so on.

There's a couple of thing here that deserve attention:

These two expressions, while being semantically equivalent, are parsed completely differently. This complicates the task of rules that want to, eg, forbid calls to some methods, as they have to parse the nodes' images and consider every edge case to be exhaustive. Moreover, the parsing rules are quite counter intuitive and hard to grasp for a rule designer.
Type resolution is able to resolve the type of the entire chain, however these types are not kept in the first case, just because there's no node to bear them. We therefore don't exploit type resolution to its full power. That's too bad for the metrics framework too, as some field accesses are not considered in this case.

Since we can't change the way the AST is parsed, one solution to these problems would be to create another layer of abstraction atop PrimaryExpressions. This layer could bind a data structure that would represent the expression's structure in a more intuitive way. This would

Ensure semantically equivalent expressions have an identical representation even though the AST nodes are not identical
Provide a higher level interface for programmers, who wouldn't need to parse images anymore
Enable type resolution to store the type of all subexpressions in a call chain (good for metrics), whatever the underlying AST

I imagine such a representation could look like the following for expression a.field.foo():

MethodCall
- image="a.field.foo()"
- type=<foo's return type>
- methodName="foo"
- arguments=...
- leftHandSide={FieldAccess}
  - image="a.field"
  - type=<type of field>
  - fieldName="field"
  - leftHandSide={Variable}
    - name="a"
    - type=<type of a>

The representation would probably be created and populated by the type resolution visitor. Since the implementation of this feature is necessary for the metrics framework to move forward, I guess I could work on it with @WinterGrascph, which would need some organisation

Comments? @adangel @jsotuyod

The text was updated successfully, but these errors were encountered:

oowekyala · 2017-07-15T16:34:23Z

A design question here: would this layer of abstraction be constructed by the type resolution visitor, or by a dedicated visitor? I think the latter would be preferable, because we could still use that representation without type resolution (it would lack type info but could still be usable --- not for metrics though), and it would be less entangled into type resolution. We'd need to see if that visitor would always execute, or would be conditioned on whether it's used by a rule of the ruleset, like type resolution or metrics.

Another point is that using such a feature for metrics would make Metrics depend on type resolution. Benchmarking PMD on its own sources with a simple ruleset showed an execution time of 0.092 seconds for the metrics visitation vs 2.270 seconds for type resolution. So it might make sense to try to limit that dependency to those metrics that need it, keeping in mind that type resolution will see some more performance improvements in the near future

jsotuyod · 2017-07-15T23:28:41Z

Since we can't change the way the AST is parsed

I'm not so certain about this assertion. It's true we have refrained from doing so in the past, but given we are now using semantic versioning, we may consider changing the AST on major releases such as 6.0.0. Particularly on cases such as this one, where our parsers is not in line with the grammar defined by the JSL.

I agree with you, the type resolution for intermediate nodes being lost is not ideal. Not only for metrics, but for other rules too.

Non the less, I'd love to hear what @adangel and @ryan-gustafson may have to say.

So it might make sense to try to limit that dependency to those metrics that need it

By default all java rules enable type resolution. This shouldn't be an issue on itself. And yes, PMD will keep on getting better performance as we keep working.

adangel · 2017-07-16T17:00:23Z

Hm... yes, I didn't think about changing the AST, but yes - this would indeed be possible for 6.0.0. And if we move towards a more JSL compliant grammar, even better! The existing rule tests should show us, what we need to adjust after changing the grammar.

I'm wondering, whether there is a more light-weight solution (e.g. without a separate visitor) possible: What about enhancing the ASTPrimaryExpression class with a method isMethodCall() for starters and getMethodCallChain() which could return some representation like you described. The methods would analyze the AST ad-hoc (if performance will be a problem, we could cache the result). This will depend on typresolution, too - and also on the symboltable. In your example above, you could lookup the variable a in the symboltable/scope - and take the type of the variable declaration (might be local or a field). Looking up a.field / field would only be possible, if a's type could be resolved...
Instead of a method call chain, you might also find just accesses to fields - field chain. The LawOfDemeterRule already identifies method chains...
If we change the grammar, ideally only the isMethodCall() / getMethodCallChain() methods need to be adjusted.

oowekyala · 2017-07-17T12:14:05Z

@adangel For starters I may use the representation of LawOfDemeterRule and the symbol table to resolve method calls and field types to get a working prototype. But ideally I think a higher level, unified API would be more adapted, so as not to couple metrics with LawOfDemeterRule. I did a first draft for such an API here. In my idea, expressions would be represented as a doubly linked list, which could be navigated both ways with getLeftHandSide and getSuperExpression. The representation of an expression could be obtained for instance with a getVirtualExpression() method invoked on the ASTPrimaryExpression. Downcasting a VirtualExpression to its concrete class would allow us to use the more specific methods of the class.

I like the idea of not using a visitor, and build this representation on demand. That would entail very minimal changes to the type resolution visitor, which would not have the responsibility of building or populating the representation. The changes to the AST could be very lightweight too, at the very least we could store a list of the types accessed in each ASTPrimaryPrefix for example, and let the VirtualExpression use these types to populate its representation when it's being built. If we decided to change the grammar to e.g. add more nodes to carry types, our parsers would get even farther from the JLS as they are now though...

By default all java rules enable type resolution.

@jsotuyod Will this behaviour be continued? I mean, there's less and less ground to disable type resolution by default as the performance of this utility improves, but what would be the point of having a specific xml attribute for it then?

jsotuyod · 2017-07-17T19:03:18Z

Looking up a.field / field would only be possible, if a's type could be resolved...

Yes, I foresee a time where symbol table / type resolution will be tightly coupled / merged. Each requires one-another at different times. Having type resolution being able to work on-demand on subtrees may help on this, but that's still to be seen.

If we change the grammar, ideally only the isMethodCall() / getMethodCallChain() methods need to be adjusted.

@adangel this is assuming we don't change the grammar right now, but do so in the future, right? otherwise I am not fully following your comments...

If we decided to change the grammar to e.g. add more nodes to carry types, our parsers would get even farther from the JLS as they are now though...

@oowekyala I'm not sure what changes you are referring to here... the change I'm proposing would actually better follow the JSL.. check this

Primaries are parsed with left recursion.

Your example a.field.foo() would be parsed as:

Primary[Primary[Primary[Primary[a], Suffix[field]], Suffix[foo]], Suffix[()]]

That would retain the same node-separation as the JSL ('though not the same names), and would be actually more similar to what we currently do.

Also, we would "consume" the '.' tokens and not have them as part of the images, which would allow to simplify several rules currently handling this on their own.

@jsotuyod Will this behaviour be continued? I mean, there's less and less ground to disable type resolution by default as the performance of this utility improves, but what would be the point of having a specific xml attribute for it then?

Yes, I foresee that being kept around. The attribute was added back when it was still on a very early stage. I am okay with keeping it around for backwards compatibility (we may deprecate it at some point). Non the less, keeping it around for other languages that have currently no support for type resolution but may in the future would be nice (Scala, Groovy, Kotlin, etc.).

oowekyala · 2017-07-18T09:52:08Z

the change I'm proposing would actually better follow the JSL

@jsotuyod Ok I misunderstood. The changes you're proposing totally make sense and would make Primaries much easier to parse. Just wondering, could we add specific nodes for each kind of primary? Like, one for method invocation, one for field access, etc. That would allow us to define specific methods for each kind and make each one much easier to use. This could perhaps be done by adding a specific child for each primary, eg for a.field.foo():

Primary[MethodInvocation[Primary[FieldAccess[Primary[Variable[a]], Suffix[field]]], Suffix[foo], Suffix[()]]]

Maybe Suffix nodes are too generic too. Would it be possible to add more specific suffixes, roughly named as in the JLS? We already have some of them, as e.g. a suffix can contain an ASTAllocationExpression, or a ASTArguments, but for an Identifier, the name is in the image of the suffix and would probably better identifiable using a dedicated subnode of type e.g. ASTIdentifier.

keeping it around for other languages

Okay that makes sense to me now. Thanks for explaining :)

oowekyala · 2020-04-25T15:16:21Z

Closing this, as the new expression grammar is detailed enough to store all types precisely. Refs #1759, #1979. Typeres support for the new grammar is not yet merged, but this lies outside the scope of this issue

adangel added in:metrics Affects the metrics framework in:type-resolution Affects the type resolution code labels Jul 14, 2017

oowekyala mentioned this issue Jul 14, 2017

[java] Metrics memoization tests #499

Merged

jsotuyod mentioned this issue Aug 28, 2017

[java] New rule in migrating ruleset: ForLoopCanBeForeach #566

Merged

This was referenced Sep 18, 2017

how to use PMD 5.8.1 Java API to create a rule and to raise a ruleviolation #617

Closed

[java] [apex] Separate multifile analysis from metrics #602

Merged

oowekyala mentioned this issue Apr 6, 2018

[java] Breaking Java Grammar changes for PMD 7.0.0 #1019

Closed

15 tasks

adangel added the in:ast About the AST structure or API, the parsing step label Mar 11, 2019

oowekyala mentioned this issue Jul 1, 2019

[java] About operator nodes #1661

Closed

oowekyala mentioned this issue Aug 15, 2019

[java] Make binary operators left-recursive #1979

Merged

oowekyala closed this as completed Apr 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[java] RFC: new layer of abstraction atop PrimaryExpressions #497

[java] RFC: new layer of abstraction atop PrimaryExpressions #497

oowekyala commented Jul 11, 2017

oowekyala commented Jul 15, 2017

jsotuyod commented Jul 15, 2017

adangel commented Jul 16, 2017

oowekyala commented Jul 17, 2017 •

edited

jsotuyod commented Jul 17, 2017 •

edited

oowekyala commented Jul 18, 2017

oowekyala commented Apr 25, 2020

[java] RFC: new layer of abstraction atop PrimaryExpressions #497

[java] RFC: new layer of abstraction atop PrimaryExpressions #497

Comments

oowekyala commented Jul 11, 2017

oowekyala commented Jul 15, 2017

jsotuyod commented Jul 15, 2017

adangel commented Jul 16, 2017

oowekyala commented Jul 17, 2017 • edited

jsotuyod commented Jul 17, 2017 • edited

oowekyala commented Jul 18, 2017

oowekyala commented Apr 25, 2020

oowekyala commented Jul 17, 2017 •

edited

jsotuyod commented Jul 17, 2017 •

edited