Provide optional result.rank value of 0.0 to 100.0 #280

fishoak · 2018-11-13T11:27:00Z

I hope this isn't going to make me unpopular, but I really want to revisit this issue.

I've re-read the contents of issue #20 and conclude that the standard would be better if we allowed tools to specify a numeric value for the severity of a rule and a result.

In issue #20 it was argued that the "level" property is intended to represent severity. Fair enough, but that property conflates another concept that I consider different; roughly speaking it's the response of the checker.

The problem I'm facing is that most of the tools I work with (not just CodeSonar) specify positive results that indicate a potential problem with the code, and they mostly specify a severity that is more precise than can be expressed with level. In the current spec, if the checker yields a warning, then level can be "note", "warning" or "error". None of the other values for level appear to be useful for this category of tools. That only gives me three values, and this isn't nearly enough resolution. Given that we have already conceded that level can indicate severity, why not allow it to be specified with more precision?

I don't buy the argument that it is equally good to put this information in a property bag. As a Sarif importer, this mandates special code for each Sarif generator because it is unlikely that all tools will agree on the name of the property, and the domain of its values.

Consequently I propose that we have a "severity" property that is an integer from zero to 100 where higher values indicate higher severity. I don't feel strongly about the domain. A float from 0.0 to 1.0 would be just as good.

michaelcfanning · 2018-11-13T20:48:15Z

Your popularity with this audience is proof against reraising this issue. :) I am personally open to revisiting this issue in any case. There are a handful of limits that we drew in our earlier design/discussion that were helpful to focus efforts on what was most actionable. In some cases, these decisions were skewed by problems we perceived in unifying outputs of existing tools. This is one of those topics. At this point, however, having (presumably) landed a substantive body of relevant data that covers a broad range of open source, commercial and private industry tooling, it's worth discussing the best design for direct production of useful concepts where we understand our decisions will work for some but not all.

I like your proposal. I'd suggest that we call this property 'ranking'. The term 'priority' may be useful. The only rules are that a value for one result which is higher than the value for another means that the former result should be priotirized over the latter.

There isn't much difficulty in this sorting mechanism for an output of a single tool (absent radical changes in prioritization between tool versions). In order to handle interleaved sorting for multiple tools, we could recommend that tools distribute their ranking values in some common way. Simply dividing the range into rough quartiles (not very important, possibly important, important, and very important) could be useful. Alternately, we could be very clear: this relative comparison is only relevant to results produced by the same tool. That approach would provide more utility, I think, allowing direct use of things like ML model computations where the bucketing suggested above might be an arbitrary or impossible to realize process.

This topic relates to other open quetsions around how to categorize results. We really should consider all of these at once as we close on them. I will propose to Larry that we apply 'results-categorization' as a label so we can track them all.

michaelcfanning · 2018-11-15T17:34:43Z

@lgolding, your thoughts? I'd like to propose that we develop a change draft for this request. I think we probably should have a float here, as rounding to an integer from 0 - 100 might not provide sufficient precision, particularly for model generated rankings. So let's start with either a float from 0.0F to 1.0F or 0.0 to 100.0F.

Note that this request clearly falls outside the ambiguity around the degree to which we get inside the results management domain: producing a prioritization/ranking value is something that we expect tools producers to do.

ghost · 2018-11-15T19:24:54Z

I am open to this idea but first I'd like to clear up a point of confusion that keeps creeping into this discussion. @michaelcfanning wrote:

I'd suggest that we call this property 'ranking'. The term 'priority' may be useful.

"Priority" is not the same as "level" or "severity".

"Severity" means "how bad is the bug". If a problem is too severe, something bad happens. Your code won't compile, or your device driver won't get the Windows logo, or you won't be able to sell your product in the EU. An analysis tool can assign a value for "severity".

"Priority" means "how important is it to fix the bug". Sure, the code won't compile, but we're about to delete the code. Sure, the device driver won't get the Windows logo, but we're not submitting for logo certification until next year. Sure, we can't sell in the EU, but this product won't ship to the EU. A tool cannot assign a value for "priority" -- but a human being, or a tool that is aware of the engineering teams's process and context, might.

I made this point repeatedly in #20. If you think I'm wrong, please correct me; otherwise, let's leave "priority" out of this discussion. Please note that #20 was about whether we should add result.priority, and somehow it kept derailing into a discussion of "severity".

</rant>

Ok, but this issue (#280) is about severity and how to represent it. So here are my thoughts:

We should add a property result.severityLevel for tools that require a fine-grained assessment.

See below for my reasoning about property names.
severityLevel should be a float (a "number", when serialized as JSON) rather than an integer, for the reason @michaelcfanning gave.
I prefer a limit of 100.0. "1" and "100" can both connote "full value" ("a probability of 1", or "100% of something"). "100" has the advantage that you don't have to type the decimal point. And it's natural to speak of a "severity 95 bug" and then map that directly to what you see in the log file.
We should keep the existing coarse-grained level property (but we should rename it; see below) for two reasons:
1. Many tools classify the severity of their results that way, and one of the design goals of SARIF, stated in the introduction, is to be
  
  ... an effective interchange format into which the output of any analysis tool can be converted.
2. It allows us to include the values "notApplicable", "pass", and "note", to which it would be difficult to assign ordered, distinct numeric values in a sensible way.
We should rename the existing level property to severity.

NOTE on naming: Some people (at least, Tiki Wan and I) feel that "level" suggests a numeric value. If the new numeric property gets the name "severityLevel", that leaves "severity" available for the existing string-valued property. Naming the numeric property "severityLevel" rather than simply "level" leaves the door open for similar priority-related properties; see below.
If severity is present but severityLevel is absent, then severityLevel defaults as follows:

severity Default severityLevel

error 100

warning 50

note 0

pass 0

notApplicable 0
Rename ruleConfiguration.defaultLevel to defaultSeverity (it lives on the ruleConfiguration object now, not on the rule object) and add property ruleConfiguration.defaultSeverityLevel, numeric, 0.0 -- 100.0.
Unrelated, but going back to the "priority" discussion, we might decide to introduce similar properties priority (string-valued, with values like "critical", "high", "normal", "low") and priorityLevel (number, 0F - 100F). Note that the naming of the severity-related properties suggests natural parallel names for the priority-related properties.

michaelcfanning · 2018-11-15T20:29:46Z

My apologies for not making this clearer. I reject the notion of documenting severity in favor of approving a ranking or prioritzation. The problem with severity in a static analysis context is that the presumed riskof an issue (and a ranking for it) cannot easily be separated from the certainty that we've identified an issue. So, a presumed persistent XSS issue would be quite severe indeed but the reality may be that our detection technique is very rough and likely to produce false positives.

A simple prioritization or ranking property avoids making strict assertions about defects while still allowing them to apply their own internal ranking mechanisms. Several tools have these. This kind of ranking can also be applied post facto by producing ML models that integrate other things (like reliability data) into the computation.

I think this is clear, but I don't find much in your previous reply that's relevant to my specific suggestion. :) My position is the our existing result levels reflect process designations (these errors must be fixed, these errors should be fixed, you may be interested to know about these other errors that you will never fix). There is a coarse relationship to severity but in practice many issues that could, in fact, be severe if a true positive are dropped down for other reasons. We do have an open issue around result level where it will be useful to recall all your feedback.

I've updated the title of this issue to reflect that we're discussing ranking/prioritization.

ghost · 2018-11-15T21:38:00Z

I would love to have a numeric priority/ranking property, as I made clear in #20.

But you are saying you don't want SARIF to represent "severity" at all. As you wrote:

I reject the notion of documenting severity in favor of approving a ranking or prioritzation.

(my emphasis). Rather, you apparently see our existing "severity-centric" level property just as a surrogate for a ranking, as one possible way to come up with a ranking -- which is never how I understood it.

I have multiple concerns with that:

In Azure DevOps (Visual Studio Team System), the typical bug filing template has both a Severity and a Priority field. A SARIF-driven automatic bug filer would benefit from having SARIF represent both concepts. There are thousands of engineers working in teams that make that distinction.
If result.level becomes purely numeric, how do you represent the "pass", "notApplicable", and "note" values? Do you just map them all to 0 and lose the distinctions?
Aren't there many tools that explicitly represent severity?
You wrote:

We do have an open issue around result level where it will be useful to recall all your feedback.

What issue is that? If "severity" as a concept goes away, in what context could my feedback (which relies entirely on the "severity"/"priority" distinction) be useful?

michaelcfanning · 2018-11-15T21:44:36Z

Sorry again for the confusion. My statement was around this specific issue. In this specific issue, I am discussing prioritization, not severity. I am glad to discuss severity, but would like to keep it distinct from the discussion around ranking/priority. And I don't see level as a surrogate for ranking.

#215 is the open issue around result.level, which provides an open door to revisiting this property.

Let's please focus this issue on the proposal for priorizing/ranking issues using a percentile value from 0.0 to 1.0. If you agree on this, great, let's produce a change for that. Glad to discuss severity otherwise in another issue.

ghost · 2018-11-15T21:52:42Z

I just talked with @michaelcfanning. We just had a misunderstanding.

When Michael wrote:

I reject the notion of documenting severity in favor of approving a ranking or prioritzation.

... he did not mean to suggest that we shouldn't represent "severity" in SARIF. He meant that he didn't want to discuss severity in this issue! He wants to limit the discussion in this issue to priority/ranking.

phew

So bottom line: I support the addition of a property named priority, numeric, range from 0.0 to 100.0.

I prefer "priority" to "ranking" because "ranking" doesn't state the criterion according to which the result is being ranked.

I prefer 100.0 to 1.0 so that most of the time an integer persisted value will suffice.

michaelcfanning · 2018-11-23T20:30:50Z

I think we all agree (you, paul and i) that a priority property of value 0.0 to 100.0 makes sense. How about producing a change draft?

It is interesting to wonder whether it would be helpful to allow for users to specify named pairs here (to allow decorating results with multiple rankings). A tool might emit its own, in which case it could simply use its tool name as the priority label, or perhaps the absence of a priority 'name' would indicate it was emitted directly by the tool).

Could be a dictionary. Could be an ordered array of unique members, as we wouldn't expect this set to be numerically intimidating.

ghost · 2018-11-23T23:21:23Z

Let's do the simple thing (a number-valued property result.priority). I assume "0" is the highest. At least in our world at MS, "pri-0" meant "Do this now!" (and of course everything was pri-0 😬).

michaelcfanning · 2018-11-26T18:14:42Z

This is where 'ranking' more accurately reflects tooling. 0 s/be the lowest value. You are correct about pri0 being the highest MS priority, but the only other possible values are 1, 2, 3 and sometimes 4.

I prefer ranking as a term because that's what tools are generally doing. The priority of an issue can change for lots of reasons, outside of tooling control. This is all semantics, however. I care less about the term than I do that 0.0 s/be the lowest ranking/priority and 100.0 s/be the highest. Like a test score.

Agree this property needs to be a non-integer value. Most of our ranking techniques here have greater precision than afforded by an integer range of 0 - 100.

ghost · 2018-11-26T18:17:06Z

How about we split the difference. My objection to "ranking" was that it didn't say what we being ranked. How about priorityRank (not sure what the "-ing" buys us)?

fishoak · 2018-11-26T18:24:59Z

I like the idea of using the term "ranking" (or just plain "rank"). I look forward to seeing the change draft.

michaelcfanning · 2018-11-26T18:25:14Z

it is, as usual, a property of the result. would result.rank feel clearer to you? thinking it through, 'ranking' feels like the right term to me only because i follow sports in a few specific contexts. :) that also happens to be the term that was used in the prominent internal MS checkers that provides this data.

i think 'result.rank' is perfectly clear if you agree

ghost · 2018-11-26T18:34:22Z

I yield. I still think priorityRank is better because it says what is being ranked, but I'm outvoted. rank it is.

- Define ruleConfiguration.defaultRank. - Note that rank values from different tools might not be commensurable.

michaelcfanning changed the title ~~Resurrect severity of rules and results~~ Resurrect ranking of rules and results Nov 15, 2018

ghost mentioned this issue Nov 15, 2018

Consider: 'review' or 'audit' result level. and reconsider 'note' #215

Closed

michaelcfanning added triage-approved impact-non-breaking-change labels Nov 23, 2018

michaelcfanning changed the title ~~Resurrect ranking of rules and results~~ Provide optional result.priority value of 0.0 to 100.0 Nov 23, 2018

ghost self-assigned this Nov 23, 2018

ghost added 2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. to-be-written design-approved The TC approved the design and I can write the change draft labels Nov 23, 2018

ghost changed the title ~~Provide optional result.priority value of 0.0 to 100.0~~ Provide optional result.rank value of 0.0 to 100.0 Nov 27, 2018

ghost pushed a commit that referenced this issue Nov 27, 2018

Change draft for #280: result.rank.

9cdd899

ghost added change-draft-available and removed to-be-written labels Nov 27, 2018

michaelcfanning mentioned this issue Nov 27, 2018

Consider adding 'rank' or 'probability' property #58

Closed

ghost pushed a commit that referenced this issue Nov 29, 2018

#280: Incorporate TC feedback into change draft.

a6e5aed

- Define ruleConfiguration.defaultRank. - Note that rank values from different tools might not be commensurable.

ghost pushed a commit that referenced this issue Nov 29, 2018

Merge #280: result.rank.

4c2d0b8

ghost added resolved-fixed and removed change-draft-available labels Nov 29, 2018

ghost closed this as completed Nov 29, 2018

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide optional result.rank value of 0.0 to 100.0 #280

Provide optional result.rank value of 0.0 to 100.0 #280

fishoak commented Nov 13, 2018

michaelcfanning commented Nov 13, 2018

michaelcfanning commented Nov 15, 2018

ghost commented Nov 15, 2018 •

edited by ghost

Loading

michaelcfanning commented Nov 15, 2018 •

edited

Loading

ghost commented Nov 15, 2018 •

edited by ghost

Loading

michaelcfanning commented Nov 15, 2018

ghost commented Nov 15, 2018 •

edited by ghost

Loading

michaelcfanning commented Nov 23, 2018

ghost commented Nov 23, 2018

michaelcfanning commented Nov 26, 2018

ghost commented Nov 26, 2018 •

edited by ghost

Loading

fishoak commented Nov 26, 2018

michaelcfanning commented Nov 26, 2018

ghost commented Nov 26, 2018

Provide optional result.rank value of 0.0 to 100.0 #280

Provide optional result.rank value of 0.0 to 100.0 #280

Comments

fishoak commented Nov 13, 2018

michaelcfanning commented Nov 13, 2018

michaelcfanning commented Nov 15, 2018

ghost commented Nov 15, 2018 • edited by ghost Loading

michaelcfanning commented Nov 15, 2018 • edited Loading

ghost commented Nov 15, 2018 • edited by ghost Loading

michaelcfanning commented Nov 15, 2018

ghost commented Nov 15, 2018 • edited by ghost Loading

michaelcfanning commented Nov 23, 2018

ghost commented Nov 23, 2018

michaelcfanning commented Nov 26, 2018

ghost commented Nov 26, 2018 • edited by ghost Loading

fishoak commented Nov 26, 2018

michaelcfanning commented Nov 26, 2018

ghost commented Nov 26, 2018

ghost commented Nov 15, 2018 •

edited by ghost

Loading

michaelcfanning commented Nov 15, 2018 •

edited

Loading

ghost commented Nov 15, 2018 •

edited by ghost

Loading

ghost commented Nov 15, 2018 •

edited by ghost

Loading

ghost commented Nov 26, 2018 •

edited by ghost

Loading