-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[apex] Integrate nawforce/ApexLink to build robust Unused rule #2667
Comments
@nawforce maybe you like the idea to discuss potential cooperations towards a more robust PMD here ;-) @adangel, @jsotuyod, @oowekyala this ticket is about evaluating if we can overcome the shortcomings of the existing Apex parser by using a second Antlr Parser from @nawforce. Kevin has built his own Apex grammar parser and managed to built a very good Unused code finder which I would like to incorporate in PMD. |
Let's try shall we ;-) The challenge here I think is that approach I have taken is quite different from a file by file analysis of Apex. After parsing Apex and other metadata I run an analysis pass to validate the Apex Code. The analysis is exact in the sense that I check references to other types exists and are in the correct form. For example, on a method call, the analysis will either identify the method that will be called or it will error, similar for fields, custom objects etc. To do this 'global' analysis you obviously require that all the metadata needed has been parsed and loaded. A dependency graph is built between the in-memory representations of the classes as a side effect of the analysis phase. This graph is where I think most of the additional value would come from. For example, the unused method analysis is simply implemented by scanning all methods to determine which if any have no incoming dependency edges. Similarly I have shown in other work that you can easily do call graph type analysis from say entry point to DML operation by following the graphs dependencies. When I looked at how this might be integrated with PMD earlier I thought the best option was likely to make it appear to PMD as though a file by file analysis was actually being used. On the largest Salesforce package I work with this would likely mean having a hidden delay on processing the first file of up to about 90 seconds. There is a disk cache model in my library that will bring this back to ~20sec for a second analysis on the same/similar code. In either case the max heap required is about 300MB. Once the initial boot hit has been taken then there is no further costs beyond navigating the parsed metadata. At the moment the library API only provides access to a 'summary' form that helps identify where things like fields, methods and nested classes are in the source file. Work would be needed to expose the internal AST and graph navigations parts to PMD in a stable way. The mechanics of exposing the internal AST should be fairly straight forward but compatibility with the existing approach would likely take some effort I think. Of course we may choose to look at this an entirely different way as well, but this was my thinking so far... |
@nawforce Have a look at #2124 where I with an architect of Salesforce already discussed call graph flow ideas. There was an old POC that worked a bit for Java. Maybe something where we could steal Integration ideas from. Looks like @adangel and @oowekyala worked on something for pmd 7.0 and API for multi pass runs. Maybe both can chime in what the status is on this issue #1426. |
This sounds very interesting. The fact PMD only processes single files is very limiting... It really needs the ability to process all analysed files ahead of time to build global data structures like indices, or your call graph. This could be useful for any language. At the moment the most limiting factor is our implementation of incremental analysis. This feature makes the promise that we won't even parse files that haven't changed from one run to another. This forces us to rely on data other than the analysed files themselves if we want to get information about other files. For example in Java, we use compiled classes to get info about other files. The entire cache is invalidated if the classpath has changed. OTOH if a file changes, but not the classpath, only the file is reparsed, but not files that depend on it (eg those that use its methods, or extend a class defined in the file). So this is both too coarse-grained, as we drop more data than necessary if the classpath changes, and too inflexible, as it doesn't let us register dependencies between analysed files in a language- or application-specific way. If we step back and ignore incremental analysis for a while, then we can sketch out a way to implement this:
This would be fine for a first prototype I think. It confines your library behind the existing AST API, so that we don't have to use 2 different AST structures to write a rule. The following other points can be expanded later:
In terms of project management, this should be done on the branch for PMD 7, since it will depend on big changes in pmd-core. One downside is we don't currently have an ETA for that release. It also depends on part of #2518
The data flow codebase doesn't build a call graph, and it's so terrible that we'll be removing it entirely (#2681)
These separate passes are still confined to a single file. In other words, it's only about layering the logic that parses a single file into several named passes. It's not processing all files several times. I'm very much hoping to throw that away before 7.0 and never release it though. It's too complicated a solution, and improvements to the Parser API (as well as removing the DFA code) will make it unnecessary. |
This sounds like a very smart way to approach this to get some early value. I have spent quite a bit of time working out how to avoid specifying files individually but it is often conceptually the better model for people. Much of the work for me has gone into reimplementing handling that Salesforce provide through the sfdx cli to identify where source files are located in a project and which files should be ignored. I don't think we would need to worry about this now beyond understanding that some files may not have the additional analysis information because they are excluded due to the way the Salesforce project has been configured. As an aside, I was thinking I really should split this handling off into a separate library at some point which would make it possible to consume at a higher level. Part of the problem in this area is the 'correct' interpretation of the Salesforce metadata is shifting quite rapidly and I have been trying to minimise the issues caused by handling it in even subtly different ways. My next focus in this area is around multiple package directories but there are also some new analysis issues to look at with 2nd generation packaging around class/method accessibility. In the cache layer there is handling for rejection based on dependencies. It will reject cache entries where the source has not changed but a significant dependency has changed because that might of course invalidate information currently being stored in the cache. Anything that can't be recovered from the cache is then loaded via parsing. There are some bits of Salesforce metadata I don't yet handle this way but they can be loaded quickly enough that the value of caching them is rather limited. At the moment the rejection model is a bit too aggressive but that just something for me to address. Thanks for the help on understanding how to approach this. I will start getting familiar with the existing code and experimenting with where bits will need to go. I have quite a lot of other work on related to this project so it may take me some time to do this but I think there would be a lot of value in being able to find a way to make this available. |
You rock @nawforce. Please let me or @oowekyala know when and how we can help you. |
@rsoesemann @oowekyala Apologies for letting this sit for a while. I have built a little POC with an unused method rule following the approach discussed above. There were a number of niggles in the implementation but nothing really of note, mostly things to do with my API that I can resolve. The way the POC code hooks into the existing Apex support is very hacky but that was always going to be the case at this stage with the single->multi file difference. Looking at #2518 I was wondering if it makes any sense for me to put some time into it as it feels like it's a blocker to having a design we could be comfortable with. Obviously I don't have much experience working on PMD, but so far has it has been pretty straight forward so I might be able to make some progress on it. |
@nawforce that is awesome! Even if it is just a POC I think its a great success to make a rule work with somewhat external code. Not sure what #2518 really is about. @oowekyala will know. Maybe also the API plans for PMD 7 which Clement is driving could benefit from using ApexLink as first core API consumer. Not sure. |
One thing I was wondering about was what to do with errors that get generated as a side effect of the unused analysis. We could just ignore these but they tell you about the deployability of the code. Maybe a better way to wrap them up would be to add a single DeployabilityRule which will trigger on any reason we think the code is not in a deployable state (excluding syntax issues that jorje will catch)? The technical difference between this and an unused method/field rule is just that we don't know upfront what kind of AST node we need for marking the violation, but I would expect we could always find a way of picking a suitable node in the AST based on overlapping location information. |
I like the delployability rule. Looks like we will soon have a dynamic analysis ruleset in the static analysis engine pmd. |
Sorry for the late feedback
Sounds great ! You can open a PR if you want, I wonder what this looks like.
I think solving this issue is really too far ahead for now. I think it's ok if you push code now that will be refactored later to fit the improved model
PMD 7 will have a concept of "semantic errors": Lines 13 to 16 in 2777daf
In the java framework this allows rejecting some programs that have non-sensical semantics before they can cause an error in later analysis passes (including rules). For example, a cycle in an inheritance hierarchy might throw some other code into an infinite loop. The idea is that these warnings/ errors are reported after parsing, but before rules apply, and hence are more important than rules. Pulling that up into PMD core is in my local branches (here), but #2807 needs to be merged first. So maybe these errors could be reported as semantic errors/warnings? Or if you prefer the idea of a rule, maybe just save a list of the lines + error message, and report them as violations on the root node (no need to fetch a specific ast node)? |
Thanks @oowekyala. The Semantic errors sounds like just what I was searching for. Thinking I should create the PR for unused and then maybe look into that later. I will do some tidy up on the code first so likely PR will pop up a week or so for you take take a look at. |
Opened a PR for the POC work. The main issues with it I think is getting the analysis to in a better phase. There are plenty of other things to improve but they all feel pretty straight forward. |
Sorry for my ignorance @oowekyala and @nawforce will this be something we plan for PMD 7? Are there any blockers or open questions where I can help with? |
@rsoesemann This was for the version merged back in May. There are few bits we might want to look at again when #2807 and #3131 are merged and I have some apex-link API work coming up that should make the integration a bit more efficient. I have been working on exposing APIs (mainly for use in VSCode) in a couple of areas that might be of interest. The ability to find the definition used in expressions or parts of (this is available) and to find all the callers of a method (this is coming up soon). Taken together, these could be used to walk the call graph, which might be a decent stepping stone towards supporting some forms of inter-procedural dataflow analysis as discussed on #2717. |
@nawforce @oowekyala what is the status with this? |
This is done with the POC PR #2830 for PMD 7. This added a new rule UnusedMethod. The rule documentation describes, how to configure the root directory as an environment variable. We might change the implementation later on with respect to The API within the Apex language module ApexMultifileAnalysis is currently marked as experimental and not documented. This documentation is part of #3175. So, I'll close this one as done for PMD 7. |
Placeholder for my upcoming discussion discussion with Kevin Jones @nawforce from FinancialForce to make use ApexLink in PMD
The text was updated successfully, but these errors were encountered: