Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets #286

michaelcfanning · 2018-11-17T02:01:57Z

Our snippets are file contents. Some tools produce formatted snippets. For example, they create a gutter in the left margin that shows line numbers. They provide syntax coloring for the snippet.

SARIF has no way for a tools producer to create these formatted snippets. Instead, the consumer is required to examine the textual snippet and do something sensible with it. We have done this internally in our pipeline with some success.

Currently, we are looking at converting a tool's output that produces a formatted snippet. It occurred to me that we could define another property on fileContent that would be designed to store a formatted version of its contents. I suppose this would be called fileContent.richText in order to follow other places in the format where we allow for markdown-rendered text.

michaelcfanning · 2018-11-20T20:09:41Z

Per offline discussion, there's a question around whether GFM allows for colorization of snippets. Note that the markdown embedded in this note uses HTML to provide colorization. This displays in VS code but not here.

Item	Status	Notes
Baseline existing source	Done

Here's a CSS snippet using a markdown language hint. Note that it is colorized.

#button {     border: none; }

ghost · 2018-11-21T23:04:58Z

Two thoughts:

With a plain-text snippet, the location of the snippet together with the location of the result allows a SARIF consumer (such as a Web UI) easily to display the snippet and indicate the position of the result, for example:

    int x = 0;
    int z = y / x;
                ^ divide by zero

But if you allow a "formatted snippet" containing Markdown, it will be difficult for the consumer to locate the actual character position within the marked up source code where the problem occurred:

123    <span style="color:blue">int</span> x = 0;<br/>
124    <span style="color:blue">int</span> z = y / x;
                ^ divide by zero

Relying on the GFM "triple-backtick + language" syntax is a nice idea. But then we don't need a formatted snippet at all -- just an indication of the source code's language. A new property file.sourceLanguage might serve (together with a run-level property defaultSourceLanguage).

But!

That doesn't give you the line numbers.
We'll have a huge discussion about what languages to include in this sourceLanguage property's enumeration, or whether to leave it open (and then leave it to consumers to decide if "C" is the same as "c", or "C++" the same as "cplusplus").

I suggest this feature might not be worth the trouble.

michaelcfanning · 2018-11-23T19:38:31Z

I independently reached some of your points but not your final conclusion. :) First, great point, plaintext snippets are inherently valuable. I'd independently reached the same conclusion as you: what we need here is simply a language designation.

I do think this feature is worth the trouble, mostly due to observing our web development team struggle with the problem of how to handle code snippets in our browser-based results explorer. For this reason, I'm willing to invest time in discussion. :)

We should favor the approach that @fishoak suggested in TC, an open-ended enum populated with some preliminary values. When defining new language values, we can suggest that producers stick with certain conventions:

use an entirely lower-case name
use a hierarchical string if it's helpful to specify a particular language variant or implementation (e.g., 'sql/tsql
prefer spelling out (or explicitly advise against it) things like plusplus/++ or sharp/# (we should make a calll on this).
allow for some level of duplication to emerge around use of standard abbreviations (or explicitly advice against). a consumer interested in vb, for example, might reasonably look for either 'vb' or 'visualbasic'

With the guidance above, if a tool producer happens to decide to author a static analysis capability for the 'Racket' or 'R++' languages, the conventions we provide allow for predictable enum values of 'racket' and 'rplusplus' respectively.

Here's a list of languages/formats that are extremely current in 'most utilized' data. Note that not all of these have significant static analysis tooling eco-systems. I've explicitly avoided adding languages that are not currently highly utilized for which a language value can easily be predicted ('fortran', 'basic', 'lisp', 'vbscript', etc.)

javascript
java
python
ruby
c++ or cplusplus
csharp or c#
c
css
go
html
objectivec
scala
swift
typescript
php
sql, [sql/tsql, sql/psql]
powershell
shell/unix [shell/bash, shell/csh, shell/tcsh, shell/ksh, shell/sh]
plaintext
markdown [markdown/gfm]

ghost · 2018-11-23T23:00:07Z

I suggest "spelling out" the language names. That will allow SDKs that wish to do so to use them as identifiers (for example, enum values).

I'm surprised not to see "html" on the language list. There's lots of static analysis around it. We should add it. (html, html/5?)

With those nits out of the way, here what I think we agree on:

We do not add region.formattedSnippet.
We add file.sourceLanguage and run.defaultSourceLanguage with values as you suggest, with my modifications.
We do not make any explicit provision for line numbers. The consumer knows the line numbers from the region properties and can display them if it wants to.

That's enough for me to write the change draft.

michaelcfanning · 2018-11-23T23:22:08Z

Yes to all three of your points. I meant to note the same thing as your Introduce result.taxonomies #3, all line data can be recovered from the region properties that are strongly associated with the snippet.
Yes, 'html' was an oversight and 'xml' probably should be on the list as well.
I don't think we need to provide for language version. The reason is that display for these snippets will mostly entail syntax coloring (not actual interpretation, compilation, etc.). i think most languages evolve in a way where the superset of keywords for most current language will allow a decent job. might not be perfect but sufficient for this scenario,

michaelcfanning · 2018-11-23T23:27:34Z

btw - can't resist, isn't it true that the plain text snippet example you provided could be constructed entirely from other SARIF constructs, suggesting that you shouldn't be doing this? for the same reasons that you shouldn't be injecting line numbers? 'divide by zero' here would presumably derive from the rule id. the spces and ^ location would derive from region data.

also, note that a viewer might usefully provide syntax coloring for this snippet, while still finding value in using the ^ notation to call out the specific location. so, with our new proposal what you'd do is specify the language, provide only the snippet of source, and then reconstruct the ^ + other formatting at runtime.

basically, all snippets are just that, snippets of a file. don't inject anything else into them. what this means is that if a tool itself has some formatted notion of a result, SARIF doesn't provide a place to flow this formatted content along.

int x = 0;
int z = y / x;
            ^ divide by zero

ghost · 2018-11-23T23:34:26Z

Sorry, I didn't explain my example clearly enough.

In my example, this is the snippet:

int x = 0;
int z = y / x;

Because it's plain text, and because the consumer has the region location, snippet location, and result message properties available, the consumer can easily render the snippet like this:

int x = 0;
int z = y / x;
            ^ divide by zero

michaelcfanning · 2018-11-26T18:19:20Z

Yes, completely agree, sorry for my confusion.

michaelcfanning · 2018-11-28T19:20:39Z

Open-ended enum
Describe principles for populating member with examples of non-trademarked language names
Add a non-normative appendix that provides a more exhaustive list

katrinaoneil · 2018-11-28T19:44:06Z

we also support the following languages:

ABAP
ActionScript
Apex
COBOL
ColdFusion

Also, would you consider something like JSP a separate "language" for the purposes of snippet colorization?

ghost · 2018-11-28T19:46:17Z

@katrinaoneil Yes. Similarly, the ASP.NET MVC "Razor" syntax.

kupsch · 2018-12-12T16:56:07Z

Attached is a spreadsheet of some research I did on programming languages and identifiers in use for them.

I took 12 sources of programming languages lists, including 3 that rank the top 50 or 100, editors, javascript syntax highlighters, lines of code counters, and giant lists of languages. The spreadsheet is sorted by the 3 ranking sites and then the number of references to the language in other sites. It also includes the id's that the references use for the language (if present) and the display name. I think that we would use commonly used id in SARIF to avoid an impedence mismatch with existing usage. the id, langName, and v_globs columns might be worth including in the appendix.

The list is quite long, but can be reasonably cut off when there are only two non-ranking site listings.

The column descriptions are below:

Tiobe2018-rank - rank in Tiobe 2018 index (https://www.tiobe.com/tiobe-index/)
ieee2018-rank - rank in IEEE 2018 index (https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages)
githutinfo2014-rank - rank in githut 2014 index (https://githut.info)
highlightjs - present in (https://highlightjs.org)
vim81 - present in vim 8.1
cloc - present in cloc (https://cloc.sourceforge.net)
loc - present in loc (https://github.com/cgag/loc)
tokei - present in tokei (https://github.com/Aaronepower/tokei)
prismjs2018 - present in PrismJs (https://prismjs.com/)
rosettacode - present in Rosetta Code (http://www.rosettacode.org/wiki/Category:Programming_Languages)
wikipedia - present in Wikipedia (https://en.wikipedia.org/wiki/List_of_programming_languages)
Visualstudio - present in Visual Studio (https://code.visualstudio.com/docs/languages/identifiers)
count - number of non-ranking sources listing language
id - id for the language
langName - Display name for the language
v_globs - glob patterns for files (from vim)
h_display - highlightjs's display name
h_id - highlightjs's primary id
v_id - vim's primary id
h_aliases - highlightjs's id aliases
v_aliases - visual studio's id aliases
p_aliases - PrismJS's id aliases
c_fileExt - cloc's file extensions for language
h_desc - description from highlightjs's
v_desc - description from vim
w_desc - description from wikipedia

languages.xlsx

kupsch · 2018-12-13T17:01:37Z

Here is new revision of the .xlsx and the .csv file (named .csv.txt to make GitHub happy). The .xlsx file is now correctly displays the couple of non-ascii characters (Excel's default is not uft-8). It also contains minor corrections. The column descriptions are below:

Tiobe2018-rank - rank in Tiobe 2018 index (https://www.tiobe.com/tiobe-index/)
ieee2018-rank - rank in IEEE 2018 index (https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages)
githutinfo2014-rank - rank in githut 2014 index (https://githut.info)
highlightjs - present in (https://highlightjs.org)
vim81 - present in vim 8.1
cloc - present in cloc (https://cloc.sourceforge.net)
loc - present in loc (https://github.com/cgag/loc)
tokei - present in tokei (https://github.com/Aaronepower/tokei)
prismjs2018 - present in PrismJs (https://prismjs.com/)
rosettacode - present in Rosetta Code (http://www.rosettacode.org/wiki/Category:Programming_Languages)
wikipedia - present in Wikipedia (https://en.wikipedia.org/wiki/List_of_programming_languages)
Visualstudio - present in Visual Studio (https://code.visualstudio.com/docs/languages/identifiers)
count - number of non-ranking sources listing language
id - id for the language
langName - Display name for the language
v_globs - glob patterns for files (from vim)
h_display - highlightjs's display name
h_id - highlightjs's primary id
v_id - vim's primary id
p_id - PrismJs's primary id
h_aliases - highlightjs's id aliases
v_aliases - visual studio's id aliases
p_aliases - PrismJS's id aliases
c_fileExt - cloc's file extensions for language
h_desc - description from highlightjs's
v_desc - description from vim
w_desc - description from wikipedia

languages.xlsx

languages.csv.txt

ghost added enhancement impact-non-breaking-change discussion-ongoing 2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. labels Nov 21, 2018

michaelcfanning changed the title ~~Consider providing a formatted version of file contents~~ Consider specifying a n optional language for snippets to assist in syntax-driven colorization Nov 23, 2018

ghost self-assigned this Nov 23, 2018

ghost added triage-approved to-be-written design-approved The TC approved the design and I can write the change draft and removed discussion-ongoing labels Nov 23, 2018

ghost changed the title ~~Consider specifying a n optional language for snippets to assist in syntax-driven colorization~~ Specify optional property file.sourceLanguage to guide in syntax-driven colorization of snippets Nov 24, 2018

ghost pushed a commit that referenced this issue Dec 11, 2018

Change draft for #286: file.sourceLanguage.

254bae7

ghost added change-draft-available and removed to-be-written labels Dec 11, 2018

ghost changed the title ~~Specify optional property file.sourceLanguage to guide in syntax-driven colorization of snippets~~ Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets Dec 11, 2018

ghost pushed a commit that referenced this issue Dec 11, 2018

#286: Add cross-reference to Appendix.

6f92aa2

ghost pushed a commit that referenced this issue Jan 3, 2019

#286: Revise per TC feedback.

b63ae4e

ghost pushed a commit that referenced this issue Jan 10, 2019

Merge #286 (sourceLanguage) and #304 (unique logicalLocations).

f675742

ghost added resolved-fixed and removed change-draft-available labels Jan 10, 2019

ghost closed this as completed Jan 10, 2019

ghost mentioned this issue Jan 10, 2019

Provide an expanded list of languages in an appendix or other document #307

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets #286

Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets #286

michaelcfanning commented Nov 17, 2018

michaelcfanning commented Nov 20, 2018 •

edited

ghost commented Nov 21, 2018

michaelcfanning commented Nov 23, 2018 •

edited by ghost

ghost commented Nov 23, 2018 •

edited by ghost

michaelcfanning commented Nov 23, 2018 •

edited by ghost

michaelcfanning commented Nov 23, 2018

ghost commented Nov 23, 2018

michaelcfanning commented Nov 26, 2018

michaelcfanning commented Nov 28, 2018

katrinaoneil commented Nov 28, 2018

ghost commented Nov 28, 2018

kupsch commented Dec 12, 2018

kupsch commented Dec 13, 2018

Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets #286

Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets #286

Comments

michaelcfanning commented Nov 17, 2018

michaelcfanning commented Nov 20, 2018 • edited

ghost commented Nov 21, 2018

michaelcfanning commented Nov 23, 2018 • edited by ghost

ghost commented Nov 23, 2018 • edited by ghost

michaelcfanning commented Nov 23, 2018 • edited by ghost

michaelcfanning commented Nov 23, 2018

ghost commented Nov 23, 2018

michaelcfanning commented Nov 26, 2018

michaelcfanning commented Nov 28, 2018

katrinaoneil commented Nov 28, 2018

ghost commented Nov 28, 2018

kupsch commented Dec 12, 2018

kupsch commented Dec 13, 2018

michaelcfanning commented Nov 20, 2018 •

edited

michaelcfanning commented Nov 23, 2018 •

edited by ghost

ghost commented Nov 23, 2018 •

edited by ghost

michaelcfanning commented Nov 23, 2018 •

edited by ghost