Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

javasrc2cpg: Handling Generic Types + Type Arguments #2655

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

DavidBakerEffendi
Copy link
Collaborator

@DavidBakerEffendi DavidBakerEffendi commented May 5, 2023

  • Fixed bug where MethodReturn type will be ANY for generic types due to type arguments being added to the search
  • Created a prototype Type->TypeArgument tree for representing nested type arguments
  • Added evalType edges to Ast class

Currently, my Type-AST->TypeArgument(-AST>TypeArgument)* edges are not working... I think this needs to be Type-REF>TypeDecl->TypeParameter(-AST>TypeParameter)*?

I suppose List<Long> and List<Integer> get their own TypeDecl for this kind of schema? Right now, there is only List

* Fixed bug where `MethodReturn` type will be `ANY` for generic types due to type arguments being added to the search
* Created a prototype Type->TypeArgument tree for representing nested type arguments
* Added `evalType` edges to `Ast` class
@DavidBakerEffendi
Copy link
Collaborator Author

DavidBakerEffendi commented May 5, 2023

Okay, I have a plan:

Java will simply give types with fully qualified type arguments (if possible), i.e. instead of java.util.List we have java.util.List<java.lang.Long> and then this will create a unique Type and TypeDecl. Then after TypeUsagePass we have, say, a JavaGenericTypeHandlerPass that can modify the structure of what it detects as generics to build the tree out correctly with TypeParameter and TypeArgument nodes as intended.

Then we can query e.g. method.methodReturn.evalTypeOut.typeArgument.l but it seems like we don't have much query capability around type arguments which will need to be added.

@johannescoetzee does this sound more sane?

@johannescoetzee
Copy link
Contributor

Regarding the type edges, the TypeUsagePass is perhaps a good reference (at least as a start). That creates Type->REF->TypeDecl edges, along with the eval types to match.

@DavidBakerEffendi DavidBakerEffendi changed the title javasrc2cpg: Identifier Decl + Method Type Args javasrc2cpg: Handling Generic Types + Type Arguments May 5, 2023
@DavidBakerEffendi
Copy link
Collaborator Author

TypeUsagePass seems to use the linking util pretty blindly, I don't think polymorphic types will be handled well here. Seems like TypeDecl->TypeParameter e.g. List<Comparable> or whatever, but if I spawn a Type->TypeArgument where I have List<Long> and List<Integer> then I don't think the linking util will check that Long and Integer are both using Comparable as a base class.

But this is a future issue and probably more appropriate for internal generic classes in any case

@johannescoetzee
Copy link
Contributor

Okay, I have a plan:

@DavidBakerEffendi Fabs and I discussed this a while ago in the context of inheritsFrom and decided against including generic types in the typeFullName. This was a while ago, so I can't remember exactly why we decided this. I think it was for policy compatibility with the closed source java frontend. Since our closed source policies don't include the generic types, we omitted those from javasrc *fullNames as well.

It might be possible to add a workaround on the closed source side, but for now I wouldn't go with this approach.

@johannescoetzee
Copy link
Contributor

TypeUsagePass seems to use the linking util pretty blindly, I don't think polymorphic types will be handled well here. Seems like TypeDecl->TypeParameter e.g. List<Comparable> or whatever, but if I spawn a Type->TypeArgument where I have List<Long> and List<Integer> then I don't think the linking util will check that Long and Integer are both using Comparable as a base class.

But this is a future issue and probably more appropriate for internal generic classes in any case

Yeah, it doesn't handle polymorphic types well at all. I was just suggesting it as a reference for the currently expected edge types. There's definitely still a lot of room for improvement for how we handle generics in general.

@DavidBakerEffendi
Copy link
Collaborator Author

@johannescoetzee okay for backwards compatibility reasons I could try register nodes that are eval'ed to generic types and accumulate then on one side, then have TypeUsagePass run over them as normal. Using the generic type info I kept aside, I can generate additional generic type nodes to add on top of that? So we would see both List and List. But the node itself would only have List.

Then if details of a generic type is available we can see it via a query.

@johannescoetzee
Copy link
Contributor

This is perhaps a conversation that @ml86 should be a part of, since he'll have a much better idea of what the implications of this are for the closed source dataflow tracker. I'm not sure how important uniqueness of types is, but I know uniqueness of methodFullNames is a topic that's been brought up.

@ml86
Copy link
Contributor

ml86 commented May 5, 2023

TypeArguments and TypeParameter nodes where added in the early days of the CPG as placeholders, but so far have never been implemented by any frontend. This is also the reason why there are basically no querying facilities, we just never needed them.
Trying to now add TypeArgument and TypeParamter nodes in javasrc2cpg would not disturb production as long as the full names and general structure is unchanged but i would very much expect this to be non straight forward since corresponding node and edge definitions have back than been written up quite quickly. So we will likely need to adjust some of the definitions as we try to implement it.

Changing the full names and structure is pretty much out of question since we need to keep compatibility to the internal byte code frontend.

I am curious why you need the generic types in the CPG? After all in the JVM world the type erased type names are what counts.

@DavidBakerEffendi
Copy link
Collaborator Author

@ml86 One example is generating API definitions from method headers + Spring annotations. If there is Collection<User> in the method return, we could resolve User and expect the following return on the payload:

  { 
   type: "array",
   elementType: {
      name: "string", 
      lastName: "string", 
      email: "string"
     }
  }

In the latest commit, I've gone ahead and added a CodeTree class that will allow use to create a type tree to represent the generics for now. This will produce a 2nd TypeDecl that includes the generics but does not have TypeParameter or TypeArgument since it seems non-trivial to add them in.

But there is code to effectively create the type argument tree in the CPG. Hopefully then from this tree we would be able to query around the arguments.

@DavidBakerEffendi
Copy link
Collaborator Author

If the schema is up for change, then I think I have an okay solution and propose something with a diagram sometime.

@DavidBakerEffendi
Copy link
Collaborator Author

A bit rudimentary diagram, but since the description of Type nodes are that they are an instance of a TypeDecl, then we can keep instances of them that have TypeArgument nodes that can be queried. These can all point to the same TypeDecl as usual.

Then I've also kept the default Type node that would be generated by TypeUsagePass.

image

@DavidBakerEffendi
Copy link
Collaborator Author

DavidBakerEffendi commented May 8, 2023

@ml86 and I had a chat about this one offline. The correct way to use the schema was explained, so I'll be able to handle that, but where we implement this comes with two options, where the main goal is to avoid duplication. It was also mentioned that Type(fullName) can be whatever as long as every unique Type node points to the correct single TypeDecl. E.g. Type(java.util.List<java.lang.String> and Type(java.util.List<java.lang.Integer> must point to TypeDecl(java.util.List<T>).

With that, here are two proposed options:

  1. We register and log the operation for building the generic type structure "in-line" i.e., within the base passes. This avoids memory issues and makes this a non-optional feature. TypeUsagePass is what currently reads typeFullName properties of nodes generated by the AST and use these to generate de-duplicated Type nodes and do all the linking. This pass would need to be modified to understand generics.

  2. This is perhaps simpler but comes with some memory overhead, but we accumulate all the nodes that evaluate to types with generic arguments where we have those details and persist this information in a separate pass, say, before TypeUsagePass. I believe TypeUsagePass is idempotent so whichever types are not accumulated and written, this pass will handle the rest. This is close to what is already in this PR with CodeTree. To negate the memory, I would propose to change the global structure to map CodeTree -> Set[NodeRef] whereas it now stands, it is NodeRef -> CodeTree. This means that we can easily add an off-switch to disable this process and leave TypeUsagePass as it is.

  3. Some combination of 1) and 2) where we accumulate the generic types and have the implementation in TypeUsagePass.

CodeTree is a nice structure as it can generate the respective language dependent type since the syntactic sugar is overridable - so can be useful as a X2Cpg data-structure. On the other hand, the smallest and perhaps most effective change would be 1).

@DavidBakerEffendi
Copy link
Collaborator Author

DavidBakerEffendi commented May 9, 2023

Example for List, minus missing TYPE(List) -REF-> TYPE_DECL(List)
graphviz

@DavidBakerEffendi
Copy link
Collaborator Author

And here is an example with the more complex nested generics graphviz-2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
java Relates to javasrc2cpg
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants