Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets #286

Closed
michaelcfanning opened this issue Nov 17, 2018 · 13 comments
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. design-approved The TC approved the design and I can write the change draft enhancement impact-non-breaking-change resolved-fixed triage-approved

Comments

@michaelcfanning
Copy link
Contributor

Our snippets are file contents. Some tools produce formatted snippets. For example, they create a gutter in the left margin that shows line numbers. They provide syntax coloring for the snippet.

SARIF has no way for a tools producer to create these formatted snippets. Instead, the consumer is required to examine the textual snippet and do something sensible with it. We have done this internally in our pipeline with some success.

Currently, we are looking at converting a tool's output that produces a formatted snippet. It occurred to me that we could define another property on fileContent that would be designed to store a formatted version of its contents. I suppose this would be called fileContent.richText in order to follow other places in the format where we allow for markdown-rendered text.

@michaelcfanning
Copy link
Contributor Author

michaelcfanning commented Nov 20, 2018

Per offline discussion, there's a question around whether GFM allows for colorization of snippets. Note that the markdown embedded in this note uses HTML to provide colorization. This displays in VS code but not here.

Item Status Notes
Baseline existing source Done

Here's a CSS snippet using a markdown language hint. Note that it is colorized.

#button {     border: none; } 

@ghost
Copy link

ghost commented Nov 21, 2018

Two thoughts:

  1. With a plain-text snippet, the location of the snippet together with the location of the result allows a SARIF consumer (such as a Web UI) easily to display the snippet and indicate the position of the result, for example:
    int x = 0;
    int z = y / x;
                ^ divide by zero

But if you allow a "formatted snippet" containing Markdown, it will be difficult for the consumer to locate the actual character position within the marked up source code where the problem occurred:

123    <span style="color:blue">int</span> x = 0;<br/>
124    <span style="color:blue">int</span> z = y / x;
                ^ divide by zero
  1. Relying on the GFM "triple-backtick + language" syntax is a nice idea. But then we don't need a formatted snippet at all -- just an indication of the source code's language. A new property file.sourceLanguage might serve (together with a run-level property defaultSourceLanguage).

But!

  1. That doesn't give you the line numbers.
  2. We'll have a huge discussion about what languages to include in this sourceLanguage property's enumeration, or whether to leave it open (and then leave it to consumers to decide if "C" is the same as "c", or "C++" the same as "cplusplus").

I suggest this feature might not be worth the trouble.

@michaelcfanning
Copy link
Contributor Author

michaelcfanning commented Nov 23, 2018

I independently reached some of your points but not your final conclusion. :) First, great point, plaintext snippets are inherently valuable. I'd independently reached the same conclusion as you: what we need here is simply a language designation.

I do think this feature is worth the trouble, mostly due to observing our web development team struggle with the problem of how to handle code snippets in our browser-based results explorer. For this reason, I'm willing to invest time in discussion. :)

We should favor the approach that @fishoak suggested in TC, an open-ended enum populated with some preliminary values. When defining new language values, we can suggest that producers stick with certain conventions:

  1. use an entirely lower-case name
  2. use a hierarchical string if it's helpful to specify a particular language variant or implementation (e.g., 'sql/tsql
  3. prefer spelling out (or explicitly advise against it) things like plusplus/++ or sharp/# (we should make a calll on this).
  4. allow for some level of duplication to emerge around use of standard abbreviations (or explicitly advice against). a consumer interested in vb, for example, might reasonably look for either 'vb' or 'visualbasic'

With the guidance above, if a tool producer happens to decide to author a static analysis capability for the 'Racket' or 'R++' languages, the conventions we provide allow for predictable enum values of 'racket' and 'rplusplus' respectively.

Here's a list of languages/formats that are extremely current in 'most utilized' data. Note that not all of these have significant static analysis tooling eco-systems. I've explicitly avoided adding languages that are not currently highly utilized for which a language value can easily be predicted ('fortran', 'basic', 'lisp', 'vbscript', etc.)

javascript
java
python
ruby
c++ or cplusplus
csharp or c#
c
css
go
html
objectivec
scala
swift
typescript
php
sql, [sql/tsql, sql/psql]
powershell
shell/unix [shell/bash, shell/csh, shell/tcsh, shell/ksh, shell/sh]
plaintext
markdown [markdown/gfm]

@michaelcfanning michaelcfanning changed the title Consider providing a formatted version of file contents Consider specifying a n optional language for snippets to assist in syntax-driven colorization Nov 23, 2018
@ghost
Copy link

ghost commented Nov 23, 2018

I suggest "spelling out" the language names. That will allow SDKs that wish to do so to use them as identifiers (for example, enum values).

I'm surprised not to see "html" on the language list. There's lots of static analysis around it. We should add it. (html, html/5?)

With those nits out of the way, here what I think we agree on:

  1. We do not add region.formattedSnippet.
  2. We add file.sourceLanguage and run.defaultSourceLanguage with values as you suggest, with my modifications.
  3. We do not make any explicit provision for line numbers. The consumer knows the line numbers from the region properties and can display them if it wants to.

That's enough for me to write the change draft.

@ghost ghost self-assigned this Nov 23, 2018
@ghost ghost added triage-approved to-be-written design-approved The TC approved the design and I can write the change draft and removed discussion-ongoing labels Nov 23, 2018
@michaelcfanning
Copy link
Contributor Author

michaelcfanning commented Nov 23, 2018

  1. Yes to all three of your points. I meant to note the same thing as your Introduce result.taxonomies #3, all line data can be recovered from the region properties that are strongly associated with the snippet.
  2. Yes, 'html' was an oversight and 'xml' probably should be on the list as well.
  3. I don't think we need to provide for language version. The reason is that display for these snippets will mostly entail syntax coloring (not actual interpretation, compilation, etc.). i think most languages evolve in a way where the superset of keywords for most current language will allow a decent job. might not be perfect but sufficient for this scenario,

@michaelcfanning
Copy link
Contributor Author

btw - can't resist, isn't it true that the plain text snippet example you provided could be constructed entirely from other SARIF constructs, suggesting that you shouldn't be doing this? for the same reasons that you shouldn't be injecting line numbers? 'divide by zero' here would presumably derive from the rule id. the spces and ^ location would derive from region data.

also, note that a viewer might usefully provide syntax coloring for this snippet, while still finding value in using the ^ notation to call out the specific location. so, with our new proposal what you'd do is specify the language, provide only the snippet of source, and then reconstruct the ^ + other formatting at runtime.

basically, all snippets are just that, snippets of a file. don't inject anything else into them. what this means is that if a tool itself has some formatted notion of a result, SARIF doesn't provide a place to flow this formatted content along.

int x = 0;
int z = y / x;
            ^ divide by zero

@ghost
Copy link

ghost commented Nov 23, 2018

Sorry, I didn't explain my example clearly enough.

In my example, this is the snippet:

int x = 0;
int z = y / x;

Because it's plain text, and because the consumer has the region location, snippet location, and result message properties available, the consumer can easily render the snippet like this:

int x = 0;
int z = y / x;
            ^ divide by zero

@ghost ghost changed the title Consider specifying a n optional language for snippets to assist in syntax-driven colorization Specify optional property file.sourceLanguage to guide in syntax-driven colorization of snippets Nov 24, 2018
@michaelcfanning
Copy link
Contributor Author

Yes, completely agree, sorry for my confusion.

@michaelcfanning
Copy link
Contributor Author

  1. Open-ended enum
  2. Describe principles for populating member with examples of non-trademarked language names
  3. Add a non-normative appendix that provides a more exhaustive list

@katrinaoneil
Copy link

we also support the following languages:

ABAP
ActionScript
Apex
COBOL
ColdFusion

Also, would you consider something like JSP a separate "language" for the purposes of snippet colorization?

@ghost
Copy link

ghost commented Nov 28, 2018

@katrinaoneil Yes. Similarly, the ASP.NET MVC "Razor" syntax.

ghost pushed a commit that referenced this issue Dec 11, 2018
@ghost ghost changed the title Specify optional property file.sourceLanguage to guide in syntax-driven colorization of snippets Specify optional property file.sourceLanguage to guide syntax-driven colorization of snippets Dec 11, 2018
ghost pushed a commit that referenced this issue Dec 11, 2018
@kupsch
Copy link

kupsch commented Dec 12, 2018

Attached is a spreadsheet of some research I did on programming languages and identifiers in use for them.

I took 12 sources of programming languages lists, including 3 that rank the top 50 or 100, editors, javascript syntax highlighters, lines of code counters, and giant lists of languages. The spreadsheet is sorted by the 3 ranking sites and then the number of references to the language in other sites. It also includes the id's that the references use for the language (if present) and the display name. I think that we would use commonly used id in SARIF to avoid an impedence mismatch with existing usage. the id, langName, and v_globs columns might be worth including in the appendix.

The list is quite long, but can be reasonably cut off when there are only two non-ranking site listings.

The column descriptions are below:

languages.xlsx

@kupsch
Copy link

kupsch commented Dec 13, 2018

Here is new revision of the .xlsx and the .csv file (named .csv.txt to make GitHub happy). The .xlsx file is now correctly displays the couple of non-ascii characters (Excel's default is not uft-8). It also contains minor corrections. The column descriptions are below:

languages.xlsx

languages.csv.txt

ghost pushed a commit that referenced this issue Jan 3, 2019
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. design-approved The TC approved the design and I can write the change draft enhancement impact-non-breaking-change resolved-fixed triage-approved
Projects
None yet
Development

No branches or pull requests

3 participants