Decide how to handle uncommon line break characters #169

ghost · 2018-05-11T21:29:33Z

Unicode defines line-breaking characters that are not commonly used, such as NEL (U+0085), LS (U+2028), and PS (U+2029). This can cause problems if a SARIF producer recognizes this characters as line breaks (and so counts them in region.startLine and region.endLine), but the SARIF consumer does not (and so highlights the region incorrectly). Or vice versa, a producer might not recognize them but an editor might, and again, it will highlight the wrong region.

Is there anything SARIF can do to help? Like defining a property run.defaultNewlineSequences:string[] that says what the producer considers to be a newline sequence?

@michaelcfanning @kupsch

The text was updated successfully, but these errors were encountered:

ghost · 2018-06-02T17:54:28Z

@michaelcfanning @kupsch

We already partially ducked the issue (in the sense of that we avoided telling consumers what line breaking convention a file uses) by adding this text to §3.2.2, "fileContent.text property":

Notwithstanding any necessary transcoding and escaping, the SARIF producer SHALL preserve the text file’s line breaking convention (for example, "\n" or "\r\n").

We could complete the ducking process by adding this text to §3.22.2, Text regions:

In calculating line numbers, SARIF producers SHALL conform to the text file's line breaking convention (for example, "\n" or "\r\n"). SARIF consumers that interact with text regions (for example, by highlighting them) SHALL also conform to the text file's line breaking convention.

NOTE: This requirement applies even if the text file includes line-breaking characters that are not commonly used, such as NEL (U+0085), LS (U+2028), and PS (U+2029).

Who would consume the proposed property run.defaultNewlineSequences? Do editors commonly have an option to define the set of line breaks? If not, I don't see how this property would help.

michaelcfanning · 2018-06-04T13:17:04Z

I am thinking we are getting a bit out of our charter here. Why are we specifying the producers behavior in regards to text files?

The run.defaultNewlineSequences is a mechanism that provides a clue to how the producer behaved.

Most editors don't provide configuration around this. But the property could be used by a post-processor to recompute line breaks (when knowing a specific editor that the file was headed to).

ghost · 2018-06-04T21:55:23Z

@michaelcfanning No, I'm not out of charter. The behavior I specified above simply required both the SARIF producer and the SARIF consumers to be aware of the file's line break conventions:

The SARIF producer needs to be aware so it can calculate the line numbers where issues occur.
The SARIF consumer needs to be aware so it can render the file properly (perhaps by transforming uncommon newline characters into newline sequences that the target editor can handle).

I'm sure we both agree that the producer needs to know what line number it's on. The only real question is how the consumer figures it out. My proposal above just ducked the issue by telling the consumer that it needed to "conform to" the file's link breaking convention. Your proposal (as I now understand it) is for the producer to give the consumer a hint by populating run.defaultNewlineSequences. A post-processor can help out by translating newline sequences in advance.

Note that you're asking more of the producer than I was. You're asking the producer to:

Be aware of the file's line breaking convention.
Populate run.defaultNewlineSequences if the file uses any uncommon newlines.

I was just asking for #1.

ghost · 2018-06-04T21:57:36Z

@michaelcfanning In addition to run.defaultNewlineSequences, do we want to define file.newlineSequences so that most "normal" files use the defaults, but oddball files can say "Hi, I use NEL!"?

ghost · 2018-06-05T17:33:14Z

@michaelcfanning That would be consistent with the pair of properties run.defaultFileEncoding and file.encoding.

ghost · 2018-08-28T01:49:23Z

@michaelcfanning @kupsch

To be clear, my previous two comments are moot if we decide against introducing run.defaultNewlineSequences (and as my previous comments indicated, I don't favor introducing that property).

kupsch · 2018-08-28T16:02:02Z

I would add the defaultNewLineSequence and newLineSequence properties (or both could be called newLineSequence).

The fundamental problem here is that a viewer (program or human) needs to know the assessment tool's interpretation of line ending, so it can compute the line number in the same was as a tool. This is same reason that a viewer needs to know the tools interpretation of the file's encoding. These don't have to be "correct," but they do need to be consistent between the producer and the viewer. Other than heuristics or backchannel (tightly coupled tool and viewer), I think that it would be useful to have the producer clearly disclose its interpretation.

In issue #93, I proposed adding a lineEndSeqs property on the file, run and sarif object with a default being CRLF, LF, CR, and NEL. To remove ambiguity, I would propose that there at least be a new property on the run object that is an array of possible line ending sequences, and that the default be [CRLF, LF]. If not specified this should produce reasonable results for most uses if the tool uses the standard Linux/UNIX or Windows conventions (the only anomaly would be a windows file with lone LFs which would be unusual). If a producer considers other characters such at NEL, CR, VT, FF, LS, or PS to end a line then they SHALL specify the property. Producers SHOULD populate the property if they compute line endings as only CRLF or only LF.

Having a property on the file object would generalize and allow different files to have different line ending interpretation. It would be unusual for a file to be processed multiple times with different sequences, so no need to support, but it could happen as this isn't really a property of the file, but of the run of the tool and a produced result.

The spec should specify that to determine if the character(s) at an offset is a line ending sequence the processor should proceed as if it tested each element of the array in order for a match and stopped when a match is found; and that the search for the next new line seuquences proceeds from the character following the match. As an example the text "one\r\ntwo" finds one new line sequence (CRLF) with the array [CRLF, CR, LF], and two (CR and LF) with the array [CR, LF, CRLF].

ghost · 2018-11-06T17:57:07Z

A consumer that performs fixes would be use this property.

michaelcfanning · 2018-11-06T18:09:53Z

This property is approved at the run level. The default is CRLF, LF.

ghost · 2018-11-06T21:30:14Z

Confirmed with @michaelcfanning. then run-level property is named newlineSequences (as opposed to defaultNewlineSequences) because we don't anticipate the need for a file-level override.

ghost · 2018-11-06T21:31:59Z

We put the property on run rather than tool because it might be configurable; run rather than invocation because nothing else on invocation has semantics of interpreting the command line arguments.

ghost added discussion-ongoing CSD.1 Will be fixed in CSD.1. labels May 11, 2018

ghost assigned michaelcfanning May 11, 2018

ghost mentioned this issue May 11, 2018

Problems with regions #93

Closed

ghost added enhancement impact-non-breaking-change labels Jun 2, 2018

ghost added 2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. and removed CSD.1 Will be fixed in CSD.1. labels Jun 6, 2018

ghost added triage-approved to-be-written design-approved The TC approved the design and I can write the change draft and removed discussion-ongoing labels Nov 7, 2018

ghost pushed a commit that referenced this issue Nov 7, 2018

Change draft for #169: Handle uncommon line break characters.

54a87b7

ghost self-assigned this Nov 7, 2018

ghost added change-draft-available and removed to-be-written labels Nov 7, 2018

ghost pushed a commit that referenced this issue Nov 16, 2018

Merge #169: newlineSequences.

678b86d

ghost added resolved-fixed and removed change-draft-available labels Nov 16, 2018

ghost closed this as completed Nov 16, 2018

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide how to handle uncommon line break characters #169

Decide how to handle uncommon line break characters #169

ghost commented May 11, 2018 •

edited by ghost

Loading

ghost commented Jun 2, 2018 •

edited by ghost

Loading

michaelcfanning commented Jun 4, 2018

ghost commented Jun 4, 2018

ghost commented Jun 4, 2018

ghost commented Jun 5, 2018

ghost commented Aug 28, 2018

kupsch commented Aug 28, 2018

ghost commented Nov 6, 2018

michaelcfanning commented Nov 6, 2018

ghost commented Nov 6, 2018 •

edited by ghost

Loading

ghost commented Nov 6, 2018

Decide how to handle uncommon line break characters #169

Decide how to handle uncommon line break characters #169

Comments

ghost commented May 11, 2018 • edited by ghost Loading

ghost commented Jun 2, 2018 • edited by ghost Loading

michaelcfanning commented Jun 4, 2018

ghost commented Jun 4, 2018

ghost commented Jun 4, 2018

ghost commented Jun 5, 2018

ghost commented Aug 28, 2018

kupsch commented Aug 28, 2018

ghost commented Nov 6, 2018

michaelcfanning commented Nov 6, 2018

ghost commented Nov 6, 2018 • edited by ghost Loading

ghost commented Nov 6, 2018

ghost commented May 11, 2018 •

edited by ghost

Loading

ghost commented Jun 2, 2018 •

edited by ghost

Loading

ghost commented Nov 6, 2018 •

edited by ghost

Loading