Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apex] Integrate nawforce/ApexLink to build robust Unused rule #2667

Closed
rsoesemann opened this issue Jul 27, 2020 · 17 comments · Fixed by #2830
Closed

[apex] Integrate nawforce/ApexLink to build robust Unused rule #2667

rsoesemann opened this issue Jul 27, 2020 · 17 comments · Fixed by #2830
Assignees
Labels
a:new-rule Proposal to add a new built-in rule
Milestone

Comments

@rsoesemann
Copy link
Member

rsoesemann commented Jul 27, 2020

Placeholder for my upcoming discussion discussion with Kevin Jones @nawforce from FinancialForce to make use ApexLink in PMD

@rsoesemann rsoesemann added the a:new-rule Proposal to add a new built-in rule label Jul 27, 2020
@rsoesemann rsoesemann self-assigned this Jul 27, 2020
@rsoesemann rsoesemann changed the title [apex] Build more robust Unused rule by integrating nawforce/ApexLink [apex] Integrate nawforce/ApexLink to build robust Unused rule Jul 27, 2020
@rsoesemann rsoesemann assigned rsoesemann and unassigned rsoesemann Jul 27, 2020
@rsoesemann
Copy link
Member Author

rsoesemann commented Jul 27, 2020

@nawforce maybe you like the idea to discuss potential cooperations towards a more robust PMD here ;-)
We could integrate ApexLinks Scala code to improve and extend Unused Rules like:
https://pmd.github.io/latest/pmd_rules_apex_bestpractices.html#unusedlocalvariable

@adangel, @jsotuyod, @oowekyala this ticket is about evaluating if we can overcome the shortcomings of the existing Apex parser by using a second Antlr Parser from @nawforce. Kevin has built his own Apex grammar parser and managed to built a very good Unused code finder which I would like to incorporate in PMD.

@nawforce
Copy link
Contributor

Let's try shall we ;-)

The challenge here I think is that approach I have taken is quite different from a file by file analysis of Apex. After parsing Apex and other metadata I run an analysis pass to validate the Apex Code. The analysis is exact in the sense that I check references to other types exists and are in the correct form. For example, on a method call, the analysis will either identify the method that will be called or it will error, similar for fields, custom objects etc. To do this 'global' analysis you obviously require that all the metadata needed has been parsed and loaded.

A dependency graph is built between the in-memory representations of the classes as a side effect of the analysis phase. This graph is where I think most of the additional value would come from. For example, the unused method analysis is simply implemented by scanning all methods to determine which if any have no incoming dependency edges. Similarly I have shown in other work that you can easily do call graph type analysis from say entry point to DML operation by following the graphs dependencies.

When I looked at how this might be integrated with PMD earlier I thought the best option was likely to make it appear to PMD as though a file by file analysis was actually being used. On the largest Salesforce package I work with this would likely mean having a hidden delay on processing the first file of up to about 90 seconds. There is a disk cache model in my library that will bring this back to ~20sec for a second analysis on the same/similar code. In either case the max heap required is about 300MB.

Once the initial boot hit has been taken then there is no further costs beyond navigating the parsed metadata. At the moment the library API only provides access to a 'summary' form that helps identify where things like fields, methods and nested classes are in the source file. Work would be needed to expose the internal AST and graph navigations parts to PMD in a stable way. The mechanics of exposing the internal AST should be fairly straight forward but compatibility with the existing approach would likely take some effort I think.

Of course we may choose to look at this an entirely different way as well, but this was my thinking so far...

@rsoesemann
Copy link
Member Author

rsoesemann commented Jul 28, 2020

@nawforce Have a look at #2124 where I with an architect of Salesforce already discussed call graph flow ideas. There was an old POC that worked a bit for Java. Maybe something where we could steal Integration ideas from.

Looks like @adangel and @oowekyala worked on something for pmd 7.0 and API for multi pass runs. Maybe both can chime in what the status is on this issue #1426.

@oowekyala
Copy link
Member

This sounds very interesting. The fact PMD only processes single files is very limiting... It really needs the ability to process all analysed files ahead of time to build global data structures like indices, or your call graph. This could be useful for any language.

At the moment the most limiting factor is our implementation of incremental analysis. This feature makes the promise that we won't even parse files that haven't changed from one run to another. This forces us to rely on data other than the analysed files themselves if we want to get information about other files. For example in Java, we use compiled classes to get info about other files. The entire cache is invalidated if the classpath has changed. OTOH if a file changes, but not the classpath, only the file is reparsed, but not files that depend on it (eg those that use its methods, or extend a class defined in the file).

So this is both too coarse-grained, as we drop more data than necessary if the classpath changes, and too inflexible, as it doesn't let us register dependencies between analysed files in a language- or application-specific way.

If we step back and ignore incremental analysis for a while, then we can sketch out a way to implement this:

  1. Languages could declare a "preprocessing step". For the PMD runner, the input of this operation is the set of files we will analyse later (not parsed), its output is some side-effects on the Language instance ([core] Language properties #2518 is about making Language instances store analysis-global data).
    • So you could parse and build your in-memory data structure before any file is analysed. Then you save it in the Language.
  2. After this, we process files as we do today (one-by-one). But the pmd-apex API would include some ways to access the global data.
    • For example, as part of the preprocessing step, you can precompute which methods in the call graph are unused. Say you store a list of unique identifiers for these methods in the Language instance. Then when the hypothetical rule UnusedMethod, implemented by a regular visitor rule, visits an ASTMethod, it can ask whether the method is unused. Eg methodNode.isUsed(), where isUsed calls back to the Language instance with the identifier of the method node.

This would be fine for a first prototype I think. It confines your library behind the existing AST API, so that we don't have to use 2 different AST structures to write a rule.

The following other points can be expanded later:

  • Your library would be confined behind the existing AST API in a first step. This allows us to be able to adjust the degree to which it's integrated in PMD later (eg, should users have access to the other AST? Should the Jorje parser be replaced by this other parser?)
  • Integration with incremental analysis. We can sort-of ignore the big picture in a first step and only focus on Apex. One useful thing is the on-disk cache of your call graph. So on the second run, the preprocessing step would use the cache + reparse the files that have changed and complete the graph. We could compute which files that have not changed should be reparsed anyway and reanalysed too (eg which files depend on some changed files transitively). Finer-grained dependency tracking can be devised later. Language-independent schemes too.

In terms of project management, this should be done on the branch for PMD 7, since it will depend on big changes in pmd-core. One downside is we don't currently have an ETA for that release. It also depends on part of #2518

@nawforce Have a look at #2124 where I with an architect of Salesforce already discussed call graph flow ideas. There was an old POC that worked a bit for Java. Maybe something where we could steal Integration ideas from.

The data flow codebase doesn't build a call graph, and it's so terrible that we'll be removing it entirely (#2681)

Looks like @adangel and @oowekyala worked on something for pmd 7.0 and API for multi pass runs. Maybe both can chime in what the status is on this issue #1426.

These separate passes are still confined to a single file. In other words, it's only about layering the logic that parses a single file into several named passes. It's not processing all files several times. I'm very much hoping to throw that away before 7.0 and never release it though. It's too complicated a solution, and improvements to the Parser API (as well as removing the DFA code) will make it unnecessary.

@nawforce
Copy link
Contributor

This sounds like a very smart way to approach this to get some early value.

I have spent quite a bit of time working out how to avoid specifying files individually but it is often conceptually the better model for people. Much of the work for me has gone into reimplementing handling that Salesforce provide through the sfdx cli to identify where source files are located in a project and which files should be ignored. I don't think we would need to worry about this now beyond understanding that some files may not have the additional analysis information because they are excluded due to the way the Salesforce project has been configured.

As an aside, I was thinking I really should split this handling off into a separate library at some point which would make it possible to consume at a higher level. Part of the problem in this area is the 'correct' interpretation of the Salesforce metadata is shifting quite rapidly and I have been trying to minimise the issues caused by handling it in even subtly different ways. My next focus in this area is around multiple package directories but there are also some new analysis issues to look at with 2nd generation packaging around class/method accessibility.

In the cache layer there is handling for rejection based on dependencies. It will reject cache entries where the source has not changed but a significant dependency has changed because that might of course invalidate information currently being stored in the cache. Anything that can't be recovered from the cache is then loaded via parsing. There are some bits of Salesforce metadata I don't yet handle this way but they can be loaded quickly enough that the value of caching them is rather limited. At the moment the rejection model is a bit too aggressive but that just something for me to address.

Thanks for the help on understanding how to approach this. I will start getting familiar with the existing code and experimenting with where bits will need to go. I have quite a lot of other work on related to this project so it may take me some time to do this but I think there would be a lot of value in being able to find a way to make this available.

@rsoesemann
Copy link
Member Author

You rock @nawforce. Please let me or @oowekyala know when and how we can help you.

@nawforce
Copy link
Contributor

@rsoesemann @oowekyala Apologies for letting this sit for a while. I have built a little POC with an unused method rule following the approach discussed above. There were a number of niggles in the implementation but nothing really of note, mostly things to do with my API that I can resolve.

The way the POC code hooks into the existing Apex support is very hacky but that was always going to be the case at this stage with the single->multi file difference. Looking at #2518 I was wondering if it makes any sense for me to put some time into it as it feels like it's a blocker to having a design we could be comfortable with. Obviously I don't have much experience working on PMD, but so far has it has been pretty straight forward so I might be able to make some progress on it.

@rsoesemann
Copy link
Member Author

@nawforce that is awesome! Even if it is just a POC I think its a great success to make a rule work with somewhat external code. Not sure what #2518 really is about. @oowekyala will know. Maybe also the API plans for PMD 7 which Clement is driving could benefit from using ApexLink as first core API consumer. Not sure.

@jbartolotta-sfdc FYI

@nawforce
Copy link
Contributor

One thing I was wondering about was what to do with errors that get generated as a side effect of the unused analysis. We could just ignore these but they tell you about the deployability of the code. Maybe a better way to wrap them up would be to add a single DeployabilityRule which will trigger on any reason we think the code is not in a deployable state (excluding syntax issues that jorje will catch)?

The technical difference between this and an unused method/field rule is just that we don't know upfront what kind of AST node we need for marking the violation, but I would expect we could always find a way of picking a suitable node in the AST based on overlapping location information.

@rsoesemann
Copy link
Member Author

I like the delployability rule. Looks like we will soon have a dynamic analysis ruleset in the static analysis engine pmd.

@oowekyala
Copy link
Member

Sorry for the late feedback

I have built a little POC with an unused method rule following the approach discussed above. There were a number of niggles in the implementation but nothing really of note, mostly things to do with my API that I can resolve.

Sounds great ! You can open a PR if you want, I wonder what this looks like.

Looking at #2518 I was wondering if it makes any sense for me to put some time into it as it feels like it's a blocker to having a design we could be comfortable with.

I think solving this issue is really too far ahead for now. I think it's ok if you push code now that will be refactored later to fit the improved model

One thing I was wondering about was what to do with errors that get generated as a side effect of the unused analysis.

PMD 7 will have a concept of "semantic errors":

* TODO pull that up to PMD core, use for {@link LanguageLevelChecker}
* as well
*/
public interface SemanticChecksLogger {

In the java framework this allows rejecting some programs that have non-sensical semantics before they can cause an error in later analysis passes (including rules). For example, a cycle in an inheritance hierarchy might throw some other code into an infinite loop. The idea is that these warnings/ errors are reported after parsing, but before rules apply, and hence are more important than rules. Pulling that up into PMD core is in my local branches (here), but #2807 needs to be merged first.

So maybe these errors could be reported as semantic errors/warnings? Or if you prefer the idea of a rule, maybe just save a list of the lines + error message, and report them as violations on the root node (no need to fetch a specific ast node)?

@nawforce
Copy link
Contributor

nawforce commented Oct 1, 2020

Thanks @oowekyala. The Semantic errors sounds like just what I was searching for. Thinking I should create the PR for unused and then maybe look into that later. I will do some tidy up on the code first so likely PR will pop up a week or so for you take take a look at.

@nawforce
Copy link
Contributor

Opened a PR for the POC work. The main issues with it I think is getting the analysis to in a better phase. There are plenty of other things to improve but they all feel pretty straight forward.

@rsoesemann
Copy link
Member Author

Sorry for my ignorance @oowekyala and @nawforce will this be something we plan for PMD 7? Are there any blockers or open questions where I can help with?

@nawforce
Copy link
Contributor

@rsoesemann This was for the version merged back in May. There are few bits we might want to look at again when #2807 and #3131 are merged and I have some apex-link API work coming up that should make the integration a bit more efficient.

I have been working on exposing APIs (mainly for use in VSCode) in a couple of areas that might be of interest. The ability to find the definition used in expressions or parts of (this is available) and to find all the callers of a method (this is coming up soon). Taken together, these could be used to walk the call graph, which might be a decent stepping stone towards supporting some forms of inter-procedural dataflow analysis as discussed on #2717.

@rsoesemann
Copy link
Member Author

@nawforce @oowekyala what is the status with this?

@adangel adangel linked a pull request Jan 13, 2023 that will close this issue
@adangel
Copy link
Member

adangel commented Jan 13, 2023

This is done with the POC PR #2830 for PMD 7. This added a new rule UnusedMethod.

The rule documentation describes, how to configure the root directory as an environment variable.

We might change the implementation later on with respect to

The API within the Apex language module ApexMultifileAnalysis is currently marked as experimental and not documented. This documentation is part of #3175.

So, I'll close this one as done for PMD 7.

@adangel adangel closed this as completed Jan 13, 2023
@adangel adangel mentioned this issue Jan 23, 2023
55 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:new-rule Proposal to add a new built-in rule
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants