Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide how to handle uncommon line break characters #169

Closed
ghost opened this issue May 11, 2018 · 11 comments
Closed

Decide how to handle uncommon line break characters #169

ghost opened this issue May 11, 2018 · 11 comments
Assignees
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. design-approved The TC approved the design and I can write the change draft enhancement impact-non-breaking-change resolved-fixed triage-approved

Comments

@ghost
Copy link

ghost commented May 11, 2018

Unicode defines line-breaking characters that are not commonly used, such as NEL (U+0085), LS (U+2028), and PS (U+2029). This can cause problems if a SARIF producer recognizes this characters as line breaks (and so counts them in region.startLine and region.endLine), but the SARIF consumer does not (and so highlights the region incorrectly). Or vice versa, a producer might not recognize them but an editor might, and again, it will highlight the wrong region.

Is there anything SARIF can do to help? Like defining a property run.defaultNewlineSequences:string[] that says what the producer considers to be a newline sequence?

@michaelcfanning @kupsch

@ghost ghost added discussion-ongoing CSD.1 Will be fixed in CSD.1. labels May 11, 2018
@ghost ghost assigned michaelcfanning May 11, 2018
@ghost ghost mentioned this issue May 11, 2018
@ghost
Copy link
Author

ghost commented Jun 2, 2018

@michaelcfanning @kupsch

We already partially ducked the issue (in the sense of that we avoided telling consumers what line breaking convention a file uses) by adding this text to §3.2.2, "fileContent.text property":

Notwithstanding any necessary transcoding and escaping, the SARIF producer SHALL preserve the text file’s line breaking convention (for example, "\n" or "\r\n").

We could complete the ducking process by adding this text to §3.22.2, Text regions:

In calculating line numbers, SARIF producers SHALL conform to the text file's line breaking convention (for example, "\n" or "\r\n"). SARIF consumers that interact with text regions (for example, by highlighting them) SHALL also conform to the text file's line breaking convention.

NOTE: This requirement applies even if the text file includes line-breaking characters that are not commonly used, such as NEL (U+0085), LS (U+2028), and PS (U+2029).

Who would consume the proposed property run.defaultNewlineSequences? Do editors commonly have an option to define the set of line breaks? If not, I don't see how this property would help.

@michaelcfanning
Copy link
Contributor

I am thinking we are getting a bit out of our charter here. Why are we specifying the producers behavior in regards to text files?

The run.defaultNewlineSequences is a mechanism that provides a clue to how the producer behaved.

Most editors don't provide configuration around this. But the property could be used by a post-processor to recompute line breaks (when knowing a specific editor that the file was headed to).

@ghost
Copy link
Author

ghost commented Jun 4, 2018

@michaelcfanning No, I'm not out of charter. The behavior I specified above simply required both the SARIF producer and the SARIF consumers to be aware of the file's line break conventions:

  • The SARIF producer needs to be aware so it can calculate the line numbers where issues occur.
  • The SARIF consumer needs to be aware so it can render the file properly (perhaps by transforming uncommon newline characters into newline sequences that the target editor can handle).

I'm sure we both agree that the producer needs to know what line number it's on. The only real question is how the consumer figures it out. My proposal above just ducked the issue by telling the consumer that it needed to "conform to" the file's link breaking convention. Your proposal (as I now understand it) is for the producer to give the consumer a hint by populating run.defaultNewlineSequences. A post-processor can help out by translating newline sequences in advance.

Note that you're asking more of the producer than I was. You're asking the producer to:

  1. Be aware of the file's line breaking convention.
  2. Populate run.defaultNewlineSequences if the file uses any uncommon newlines.

I was just asking for #1.

@ghost
Copy link
Author

ghost commented Jun 4, 2018

@michaelcfanning In addition to run.defaultNewlineSequences, do we want to define file.newlineSequences so that most "normal" files use the defaults, but oddball files can say "Hi, I use NEL!"?

@ghost
Copy link
Author

ghost commented Jun 5, 2018

@michaelcfanning That would be consistent with the pair of properties run.defaultFileEncoding and file.encoding.

@ghost ghost added 2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. and removed CSD.1 Will be fixed in CSD.1. labels Jun 6, 2018
@ghost
Copy link
Author

ghost commented Aug 28, 2018

@michaelcfanning @kupsch

To be clear, my previous two comments are moot if we decide against introducing run.defaultNewlineSequences (and as my previous comments indicated, I don't favor introducing that property).

@kupsch
Copy link

kupsch commented Aug 28, 2018

I would add the defaultNewLineSequence and newLineSequence properties (or both could be called newLineSequence).

The fundamental problem here is that a viewer (program or human) needs to know the assessment tool's interpretation of line ending, so it can compute the line number in the same was as a tool. This is same reason that a viewer needs to know the tools interpretation of the file's encoding. These don't have to be "correct," but they do need to be consistent between the producer and the viewer. Other than heuristics or backchannel (tightly coupled tool and viewer), I think that it would be useful to have the producer clearly disclose its interpretation.

In issue #93, I proposed adding a lineEndSeqs property on the file, run and sarif object with a default being CRLF, LF, CR, and NEL. To remove ambiguity, I would propose that there at least be a new property on the run object that is an array of possible line ending sequences, and that the default be [CRLF, LF]. If not specified this should produce reasonable results for most uses if the tool uses the standard Linux/UNIX or Windows conventions (the only anomaly would be a windows file with lone LFs which would be unusual). If a producer considers other characters such at NEL, CR, VT, FF, LS, or PS to end a line then they SHALL specify the property. Producers SHOULD populate the property if they compute line endings as only CRLF or only LF.

Having a property on the file object would generalize and allow different files to have different line ending interpretation. It would be unusual for a file to be processed multiple times with different sequences, so no need to support, but it could happen as this isn't really a property of the file, but of the run of the tool and a produced result.

The spec should specify that to determine if the character(s) at an offset is a line ending sequence the processor should proceed as if it tested each element of the array in order for a match and stopped when a match is found; and that the search for the next new line seuquences proceeds from the character following the match. As an example the text "one\r\ntwo" finds one new line sequence (CRLF) with the array [CRLF, CR, LF], and two (CR and LF) with the array [CR, LF, CRLF].

@ghost
Copy link
Author

ghost commented Nov 6, 2018

A consumer that performs fixes would be use this property.

@michaelcfanning
Copy link
Contributor

This property is approved at the run level. The default is CRLF, LF.

@ghost
Copy link
Author

ghost commented Nov 6, 2018

Confirmed with @michaelcfanning. then run-level property is named newlineSequences (as opposed to defaultNewlineSequences) because we don't anticipate the need for a file-level override.

@ghost
Copy link
Author

ghost commented Nov 6, 2018

We put the property on run rather than tool because it might be configurable; run rather than invocation because nothing else on invocation has semantics of interpreting the command line arguments.

@ghost ghost added triage-approved to-be-written design-approved The TC approved the design and I can write the change draft and removed discussion-ongoing labels Nov 7, 2018
@ghost ghost self-assigned this Nov 7, 2018
ghost pushed a commit that referenced this issue Nov 16, 2018
@ghost ghost closed this as completed Nov 16, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. design-approved The TC approved the design and I can write the change draft enhancement impact-non-breaking-change resolved-fixed triage-approved
Projects
None yet
Development

No branches or pull requests

2 participants