Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding codeflow.state to capture initial execution state for things like static variables #168

Closed
michaelcfanning opened this issue May 9, 2018 · 23 comments
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. code-flows design-approved The TC approved the design and I can write the change draft e-ballot-2 enhancement impact-non-breaking-change merged Changes merged into provisional draft. p2 Priority 2 issue to close resolved-fixed

Comments

@michaelcfanning
Copy link
Contributor

No description provided.

@michaelcfanning michaelcfanning added CSD.1 Will be fixed in CSD.1. 2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. and removed CSD.1 Will be fixed in CSD.1. labels May 30, 2018
@michaelcfanning
Copy link
Contributor Author

@lgolding, thinking this over, this data is very specific for sure to dynamic analysis tools and not static.
I think we can close this in any case. You can think of this information as 'state' that is provided to a code flow, and so a dynamic analysis results author could populate that property with this class of information.

do you agree?

@ghost
Copy link

ghost commented May 31, 2018

@michaelcfanning I agree that it is dynamic-specific, and for that reason I would agree to relabel this from CSD.2 to future.

At that point, it's moot, but -- I'm not sure that this is "state", if by that, you mean the property codeFlowLocation.state. That property is intended to provide a debugger-like Watch Window experience. You could certainly do this:

{                    # A codeFlowLocation object.
  "state": {
    "Accept": "text/plain",
    ....
  }
}

... but it seems like a bit of an abuse.

If we're not "really" supporting this, IMO the proper place is the result's property bag -- and then the tool can make use of hierarchical property bag property names:

{                    # A result object.
  ...
  "properties": {
    "Header/Accept": "text/plain",
    ...
  }
}

@michaelcfanning
Copy link
Contributor Author

Disagree. Our position is that if we have properties that reasonably can accommodate output from a dynamic analysis tool, then we don't need to require separation for some reason.

What I'm arguing for here is that the forms data, web request details, etc. are state that are associated with a result code flow. This is conceptually no different than a static analysis tool that populates state with some assumptions that are driving the analysis.

What I have in mind in general is populating an unparsed proper, e.g., 'User Agent'., the value of which is the relevant unparsed string. But producers are free to populate as they see fit.

@ghost
Copy link

ghost commented May 31, 2018

What I have in mind in general is populating an unparsed proper[ty?], e.g., 'User Agent'., the value of which is the relevant unparsed string. But producers are free to populate as they see fit.

Then I'm not sure what you're suggesting. You clearly disagree with using result.properties. But I can't tell if you want to use codeFlowLocation.state or something else. You talk about "state ... associated with a result code flow", but as the spec stands, there is no state associated with a codeFlow (or with a threadFlow). The state is associated with a codeFlowLocation, and that seems inappropriate to me, because these request parameters are not location-specific; they'll never change from one location to the next.

@michaelcfanning
Copy link
Contributor Author

How do users specify the 'inaugural' state then? This is a useful question to raise. Consider a code flow that consists of three distinct threads. How should a preliminary value of a shared static variable be specified?

Perhaps we should add a state property somewhere else. Probably at the root of the code flow. For anything thread specific, this can be specified at the first thread flow location?

@ghost
Copy link

ghost commented Jun 20, 2018

I agree with you that the root of the code flow is the right place for this information. I still don't agree that the thing you're describing (forms data and HTTP headers) is "state," because it never changes. If we're going to have to add a new property anyway, I'd prefer not to give it that name. It's more like "parameters". But I'm not going to lie down in front of the "state" train.

Another option is to use "properties":

{                 # A codeFlow object
  "properties": {
    "webScanner/headers/UserAgent": "Mozilla/5.0 (iPhone; ...",
    "webScanner/forms/Address/Street": "555-55th Street"
  }
}

@michaelcfanning
Copy link
Contributor Author

huh? state that does not change is state. an analysis that starts with an assumption around a variable that is never modified during analysis is a very common scenario.

@ghost
Copy link

ghost commented Jun 20, 2018

Ok, I yield. We add codeFlow.state, and include non-normative spec language (a "NOTE") explaining how you can use this for headers and forms data.

@michaelcfanning
Copy link
Contributor Author

:)
Actually, I'm not advocating for any spec text on using this property for headers and forms data. That's a specific use for dynamic analysis. Instead, I've been thinking along the lines of our usual strategy in this case: consider how a dynamic analysis tool would sensibly utilize SARIF constructs. It seems to me that the 'state' property could sensibly hold this data. As you point out, another sensible approach would be to break apart the individual properties and put them in the properties bag.

As part of this discussion, I believe we've uncovered a useful gap in the format from the perspective of a static analysis tool. It is not clear where a code flow would specify initial state for static variables, other data in a multi-thread scenario. So my suggestion is to add codeFlow.state and restrict spec language to the static analysis scenario.

It's an interesting idea to consider some non-normative language around how to utilize SARIF for dynamic analysis scenarios where there's sufficient overlap (and to describe perhaps where that gaps are too extreme). But this seems like a low priority compared to other open items.

@michaelcfanning michaelcfanning changed the title Consider adding placeholder for things like forms data and request headers for dynamic analysis tools Consider adding codeflow.state to capture initial execution state for things like static variables Jun 20, 2018
@ghost
Copy link

ghost commented Jun 20, 2018

Having thought more about this, I suggest codeFlow.initialState for parallelism with graphTraversal.initialState.

@ghost
Copy link

ghost commented Jun 20, 2018

But even so "state" is problematic for the "headers/forms data" scenario. Here's why:

As the user navigates a code flow, the debugger-like "watch window" is going to display the state belonging to the current location. For example, suppose you are on location 1, whose state property is

"state": {
  "x": "42",
  "y": "54"
}

The watch window displays

x    42
y    54

Now you navigate to location 2, whose state property is

"state": {
  "x": "33"
}

What should the viewer display? Should it display

x    33
y    54

... assuming that x has changed but y has not? Or should it display

x    33

... assuming that y has gone out of scope?

I claim that it should just display x. If the SARIF producer wants you to display the unchanged value of y, it should include it in the state property.

What does that imply for an initialState like this?

"initialState": {
  "x": "0",
  "UserAgent": "Mozilla/5.0..."
}

To me it implies that UserAgent will disappear from the watch window as soon as you navigate to the first location.

That's why I think that persistent information like this, which makes sense throughout the code flow, (and which you might want to view throughout your navigation through the code flow), needs a separate slot in the format.

"properties" works for that.

@michaelcfanning
Copy link
Contributor Author

properties does not work for this, because properties could contain some arbitrary data that's not intended to be viewable and does not allow for general viewer consumption.

do me a favor. stop talking about the headers/forms example. you'll note I retitled this issue. let's revisit your example: how do we project that a viewer can most usefully report out on things like static variables that may not be associated with the current selected code location.

This is a good question to answer. The answer is not to use the properties bag. Any thoughts?

@ghost
Copy link

ghost commented Jun 20, 2018

"Static" variables can change from location to location, just as local variables can:

"locations": [
  {
    "x": "42,
    "s_count": "5"
  },
  {
    "x": "42",
    "s_count": "6"
  }
]

The producer should add them to the state property of any location where it makes sense. There's no need to treat them specially by mentioning them in a codeFlow-level property.

But if you mean "constant" variables, rather than "static" ones -- there I think is the answer. We could define a new property codeFlow.constants. The viewer would display them in a window separate from the state. This would also cover the Scenario Who Must Not Be Named.

@michaelcfanning
Copy link
Contributor Author

I am focused on this specific question: how does a viewer differentiate between local vs. static variables. In general, when debugging, all viewable static variables do no appear in watch windows. This data can be quite copious. Users can explicitly add a static to a watch window, but this is an explicit question.

Now consider your step-through scenario. Does we expect all state to be viewable in a common window? Is it useful to distinguish between variables of varying scope? etc. I believe we need to do some thinking here.

You have raised a new concern around constants. Maybe that's valuable, I'm not sure. Some debuggers also have a place for 'automatics', a place to show implied variables that may not be explicit in source code (such as an exception instance that pops up out of nowhere during execution).

I appreciate all the good thinking/speculation here. We would do be well-advised to collect some traces from various tools to drive progress. Probably should create a bucket issue around all the new topics that have turned up as this conversation has turned into a more systematic review of this area.

@ghost
Copy link

ghost commented Jun 20, 2018

One way that a SARIF viewer differs from a debugger is that the SARIF producer gets to decide which variables the user will find interesting at each location in the code flow. So it might be less important to segregate variables by type than it is in a debugger. For example, the producer could say:

const x                 42
return value            null
s_count                 5

@michaelcfanning
Copy link
Contributor Author

This is a good point, you are noting that one position we could have is that a step-through of a code flow is an extremely directed experience, in that the SARIF data is explicitly designed to show only the relevant information. That's a good value. On the other hand, this implies a certain rigidity, we've limited the ability of viewers to provide alternate/competing views.

I actually thought that our spec said that the state shouldn't recapitulate data except for what has changed. This is an optimization that prevents the need to keep repeating state values for data that hasn't changed. Does our spec actually say that? If not, we need yet another issue on code flows.

Which continues to be my most urgent concern. With all due respect to our highly intelligent discussion here, I won't really feel like we're making progress until we're driving SARIF v2 examples through multiple tools through multiple viewers. Have you and Chris taken a look at SDV yet in our current implementation of various things?

@ghost
Copy link

ghost commented Jun 20, 2018

The spec does not say that state should only mention values that have changed. If we want to make it say that, we have to figure out how to tell the viewer "You know that variable y we were just showing? It's not important any more (maybe it went out of scope; maybe we just don't care about it any more), so stop showing it."

With regard to SDV: The sarif-v2 branch of the SARIF SDK includes all the code flow changes, except for (1) the rename of codeFlowLocation to threadFlowLocation, and (2) the addition of threadFlowLocation.timestamp.

That means that the SDV-to-SARIF converter is already up to date with those changes, and the tests demonstrate that the converter produces valid SARIF in the new format. What more SDV-related validation did you have in mind?

@michaelcfanning
Copy link
Contributor Author

I would like someone to please assess all existing SARIF v2 samples, whether produced directly or by conversion, in whatever SARIF v2 code flow consuming mechanisms exist today. These include a web control, a VS Code extension and a VS add-in, minimally. I am not attempting to provide a complete test methodology for evaluating SDV. I am hoping that you, or Chris, or another resource, can take this on.

Send me mail offline if this isn't clear and we need to work against a more clearly articulated plan. This discussion is getting off topic.

@michaelcfanning michaelcfanning added the p2 Priority 2 issue to close label Jan 24, 2019
@michaelcfanning
Copy link
Contributor Author

michaelcfanning commented Mar 15, 2019

E-BALLOT PROPOSAL:

  • In the codeFlow object:
    • Add property immutableState of type object with string-valued properties, to hold information like HTTP headers and forms data. That was the initial motivation for this issue.
    • Add property initialState of type object with string-valued properties (parallel to the existing property graphTraversal.initialState).
  • In the graphTraversal object:
    • Add property immutableState of type object with string-valued properties (parallel to the new property codeFlow.immutableState).
  • In the spec, clarify that threadFlowLocation.state and edgeTraversal.finalState must mention every relevant state variable, even those unchanged since the previous step, because otherwise, if a state variable was present at step n but missing at step n + 1, you couldn't tell if it still existed but its value was unchanged, or if it had gone out of scope.

@ghost ghost added change-draft-available merged Changes merged into provisional draft. labels Mar 15, 2019
@ghost
Copy link

ghost commented Mar 16, 2019

E-BALLOT PROPOSAL

Enable the representation of "immutable state": state that remains constant throughout the traversal of a graph or a code flow. This is useful for capturing, for example, HTTP request headers or forms data.

Also enable the representation of "initial state": mutable state captured before the start of a code flow or graph traversal. The graphTraversal object already defined an initialState property for this purpose.

SCHEMA CHANGES

  • In the codeFlow object:

    • Add property initialState of type object with string-valued properties.
    • Add property immutableState of type object with string-valued properties.
  • In the graphTraversal object:

    • Add property immutableState of type object with string-valued properties.

@kupsch
Copy link

kupsch commented Mar 18, 2019

This could be a separate ticket, but it is a change to the description for the state value, so it relates to this ticket.

For many of the tools that I have seen that include flows or traversals, when explaining a flow or graph traversal state, they specify the value of an expression as a literal value (expressible in this current design), or as a constraint on the value (currently not not expressible). Such as

{
  "i":  "i < 5",
  "j":  "j >= 10 && j < 100",
  "p":  "p != NULL ",
  "r":  "r == NULL",
  "s":  "strlen(s) > 100",
  "t":  "buffer t is not NULL terminated"
}

This state constraint information that many tools report, could incorporate this without changing the schema by describing that a magic string value such as "{expr}" is used to represent a constraint expression. If "{expr}" exists in the value then it is a constraint (the value may substitute the actual expression for {expr}. If "{expr}" does not exist in the value then this is an equality constraint equivalent to "{expr} == VALUE" where VALUE is the value of the JSON property.

The above constraints would then become a state object of

{
  "i":  "{expr} < 5",
  "j":  "{expr} >= 10 && {expr} < 100",
  "p":  "{expr} != NULL ",
  "r":  "NULL",
  "s":  "strlen({expr}) > 100",
  "t":  "buffer {expr} is not NULL terminated"
}

For viewer just displaying the values above would probably be deciphered with the correct meaning by most developers, and advanced viewers could do more to indicate a general constraint instead of just equality.

ghost pushed a commit that referenced this issue Mar 19, 2019
This draft contains all the changes through "e-ballot #2," which
opened on Friday March 15 and will close on Friday March 22. It contains
changes for ballot issues #168, #291, #309, #320, #321, #326, #335,
and #341, as well as for previously approved issue #340.

It does _not_ contain changes for any issues from "e-ballot #3,"
which will open on Friday March 22 and close on Friday March 29.
@ghost ghost added resolved-fixed design-approved The TC approved the design and I can write the change draft labels Mar 29, 2019
@ghost
Copy link

ghost commented Mar 29, 2019

Approved in e-ballot #2.

@ghost ghost closed this as completed Mar 29, 2019
@ghost
Copy link

ghost commented Apr 6, 2019

@kupsch I will open an issue for your "arbitrary expression" suggestion, and I'll take care of it. Since all the state-ish properties are just string-to-string dictionaries, there's no downside to this, it's non-breaking, it's just words in the spec, and I have the time.

@michaelcfanning FYI

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. code-flows design-approved The TC approved the design and I can write the change draft e-ballot-2 enhancement impact-non-breaking-change merged Changes merged into provisional draft. p2 Priority 2 issue to close resolved-fixed
Projects
None yet
Development

No branches or pull requests

2 participants