Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support graphs and graph traversals #46

Closed
DerSaidin opened this issue Sep 27, 2017 · 11 comments
Closed

Support graphs and graph traversals #46

DerSaidin opened this issue Sep 27, 2017 · 11 comments

Comments

@DerSaidin
Copy link

There are many graph structures in static analysis which are useful to preserve in results. For example:

  • A value flow graph
    • Show multiple values contributing to a value
  • A call graph
    • Thanks to the graph structure, we can expand callsites to see more information about their side effects

Several of the existing properties of result could be abstracted and generalized in this manner:

  • codeFlows property
  • stacks property
  • relatedLocations property
    Note that all of these have properties in common: location, message, These would be the vertices in the graph. (in the case of stacks, the stackFrame objects are the vertices and the stacks object is providing some of the graph edges/structure).

This would also allow the format to support other information which generally fits into a graph.
Having codeflows and stacks properties show the desire for this generalization/extensibility. What other similar properties will be wanted in the future that are not currently specified?

Each vertex would need some tag to identify what it means (i.e. this vertex is a stackFrame, this vertex is a value flow at an addition) and how vertices are expected to fit together (a stackFrame cannot flow into an addition, these should not appear in the same graph).
Tools doing their own graphs (not specified in SARIF) could still have a graph of vertices with a location and a message and their own meaning.

@michaelcfanning
Copy link
Contributor

This is an interesting suggestion that's worth considering. The SATE format has a notion of linking related results that is another example of this kind of relationship. Their solution was simply to provide a universal identifier, allowing results to store lists of UIDs to other items. You're proposing something more expressive, obviously.

Besides the option of annotating a node/entity in the graph ('stack frame'), it's interesting to consider whether we annotate the edges instead, with most nodes simply being a code location. so a 'calls into' edge could point to a common location referenced by other entities.

All of this resurrects an historical question with SARIF, which is whether locations themselves should be stored as a distinct set of things (like we do with the 'files' object) with results and other things referencing items in the set. That would be a disruptive change to the current spec but would also be a support for what you've proposed here.

@DerSaidin
Copy link
Author

Their solution was simply to provide a universal identifier, allowing results to store lists of UIDs to other items. You're proposing something more expressive, obviously.

More expressive in that it is a graph structure, not a flat list.

Less expressive in that each vertex in the graph can only be one strucure: a location and a message (plus whatever other appropriate field), not a UID referencing anything else in the SARIF file.

The graph structure I have in mind is replacing/generalizing structures within the result object, not allowing edges/references beyond the result object - I think that would be an orthogonal feature.

All of this resurrects an historical question with SARIF, which is whether locations themselves should be stored as a distinct set of things

I would note that (as I understand it) this question is orthogonal to using a graph strucure.

  • Each vertex in the graph structure could either provide the location directly without refering to a distinct set, or refer to locations in a distinct set.
  • Each location in the current structure could also provide the location directly or be changed to refer to locations in a distinct set.

@michaelcfanning michaelcfanning changed the title Does the result object support graph information? Consider: should the result object support graph information? Nov 3, 2017
@AndrewPardoe
Copy link

Different code analysis tools look at vertices and edges. If we went down this path would it make sense to be able to express either?

@michaelcfanning
Copy link
Contributor

Andrew (@DerSaidin), do you have time to comment on this thread in general? How would a graph improve our ability to express a stack? Imagine a graph that represents a control flow graph, how would we represent a flow through it? Can you provide clarifying examples that show the promise of what you're proposing?

@DerSaidin
Copy link
Author

DerSaidin commented Nov 8, 2017

Stack Example

void b() {
    zzz1(); //Dependency1
}
void a() {
    b();
    zzz2(); //Dependency2
    c();
    zzz4(); //Dependency4
    e();
}
void c() {
    d();
}
void d() {
    zzz3(); //Dependency3
}
void e() {
    zzz5(); //Dependency5 //Bug location
}

Without a graph

With the current SARIF proposal...

This array shall include every function call in the stack for which the tool has information, and the entries that are present shall occur in chronological order with the most recent (innermost) call first and the least recent (outermost) call last.

... We can have a list of frames going linearly up the stack, starting at Bug ...

frames: [
    { // stackFrame for e()
        locations = [ Dependency5 ]
    },
    { // stackFrame for a()
        locations = [ Dependency4, Dependency2 ]
    },
]

This might be displayed something like...

|==function a()==  (Frame_a)
|  Dependency2
|  Dependency4  

   -> call

    |==function e()==  (Frame_e)
    |  Dependency5  

As a graph

With a graph, we can capture the location of all the dependencies accurately (The word "Frame" may be a little misused here, but it is convenient)...

Frame_a -> Frame_b -> Dependency1
        -> Dependency2
        -> Frame_c -> Frame_d -> Dependency3
        -> Dependency4
        -> Frame_e -> Dependency5

Note that the nodes in this graph have different types/tags. Some are tagged as frames, some are tagged as leaf dependencies.

And we can take advantage of more structured to display it with context...

|==function a()==  (Frame_a)
|

   -> call

    |==function b()==  (Frame_b)
    |  Dependency1  

   <- return

|==function a()==  (Frame_a continued)
|  Dependency2  

   -> call

    |==function c()==  (Frame_c)

       -> call

        |==function d==  (Frame_d)
        |  Dependency3  

       <- return

   <- return

|==function a()==  (Frame_a continued)
|  Dependency4  

   -> call

    |==function e()==  (Frame_e)
    |  Dependency5  

For example, Dependency3 might be a write to a variable passed in from a().

The graph could also have an edge e.g. from Dependency3 to Dependency5.

This is a key difference between static and dynamic tools - a dynamic callstack is strictly linear, but a static analysis context is not (and not really a "stack").

Graphs of other Things

5.17.10. codeFlows property
5.17.11. stacks property

The codeFlows and stacks are specific cases of graphs.

What if I want to show a callgraph?

A graph structure is generalized, and applicable to other things.

@lcartey
Copy link

lcartey commented Dec 7, 2017

The idea of having graphs associated with the result object is definitely an interesting one. However, I'm not sure that it replaces or generalizes the existing elements. I think the current elements (codeFlows and, to a certain extent, stackFrames) represent paths through some notional graph, rather than representing a graph directly.

As an example, consider a call graph, with the vertices representing functions and edges representing calls. A result could represent a path through that call graph as a codeFlow, as an explanation for that result. That path could even visit the same function vertex in the graph multiple times. However, it is not currently possible to represent the call graph itself in the SARIF file.

@DerSaidin In your example you have what is essentially a call graph, but the structured display you provide is actually a path through that graph. I think that structured display can therefore be encoded as a codeFlow using the current SARIF spec, but the graph itself cannot currently be represented.

I think the question here is whether a graph representation makes sense in addition to the existing path representation. One area I think it could be useful is to help SARIF viewers provide a more sophisticated path display mechanism, in particular when:

  • The path occurs in a nested graph. It is quite natural to represent program structure as a nested graph reflecting the AST, and using this structure can improve the display of results, for example by collapsing a node to hide the edges within that node.
  • Multiple paths occur in the same graph. If provided with a unifying graph, a viewer could display the graph with multiple paths overlaid, rather than listing each path individally.

There may also be cases where a graph is useful independently of a path for a result, but I don't have a concrete example.

@lcartey
Copy link

lcartey commented Dec 7, 2017

I have opened a separate issue (#71) related to allowing edge properties on codeFlows.

@DerSaidin
Copy link
Author

I like the distinction between paths through some a graph, rather than the graph itself. This is true, it is usually paths that are displayed.

If there was graphs in addition to paths, then I think the paths should refer to nodes of a graph. At this point the paths are essentially part of the graph.

I think graph vs path is like vector graphics vs bitmap. The vector graphics has more structured information and details, you can render various bitmaps views from it.

  • Having vector graphics is good because the structure can be useful - you can turn off some layers, or change colors more easily.
  • Having bitmaps is good because it is simple to display; you just send pixels strait to the screen, without the complexity of rendering something first.

As always, it is a trade off. Complexity vs flexibility.

@michaelcfanning
Copy link
Contributor

michaelcfanning commented Jan 25, 2018

As per most recent TC discussion, we are excluding this feature from this SARIF release. Larry will add some additional detail from notes taken during that session.

@ghost
Copy link

ghost commented Jan 25, 2018

The upshot of our discussion in the last TC meeting was that

  1. The full graph information would be huge.
  2. We didn't want to be in the business of specifying a generic graph structure for the various types of graphs we might want to represent.

@ghost ghost added enhancement impact-non-breaking-change CSD.1 Will be fixed in CSD.1. labels Mar 5, 2018
@ghost ghost assigned michaelcfanning Mar 5, 2018
@michaelcfanning michaelcfanning changed the title Consider: should the result object support graph information? Provide support for graphs and graph traversals Mar 18, 2018
ghost pushed a commit that referenced this issue Apr 2, 2018
ghost pushed a commit that referenced this issue Apr 2, 2018
ghost pushed a commit that referenced this issue Apr 5, 2018
@ghost ghost changed the title Provide support for graphs and graph traversals Support graphs and graph traversals Apr 18, 2018
ghost pushed a commit that referenced this issue Apr 18, 2018
@ghost
Copy link

ghost commented Apr 18, 2018

Closed by 20f0042.

@ghost ghost closed this as completed Apr 18, 2018
@ghost ghost added the resolved-fixed label Apr 18, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants