Skip to content

Sarif SDK BSOA

Scott Louvau edited this page Nov 10, 2020 · 3 revisions

We've created a version of the Sarif SDK which uses BSOA underneath the object model. This version can be used instead of the normal SDK, but should be tested to ensure correct behavior in your specific scenario.

Advantages

The BSOA object model uses significantly less memory (60% - 90%, usually) and can be loaded from the SARIF JSON format about twice as fast as the old object model.

SarifLogs can be written in a new BSOA binary form, which is about 33% smaller than the unindented JSON SARIF and can be loaded and saved 100x faster (800 MB - 1,900 MB/s). SarifLog.Save() and SarifLog.Load() use this format if the file extension is ".bsoa". The binary format may change, so store JSON copies of important data.

Disadvantages

The BSOA object model does not Garbage Collect automatically during use. If you create many Sarif object model objects during program execution, they will not be reclaimed until every object in the log containing them goes out of scope. You can control the log an object is created in by using the constructors which take a SarifLog.

The BSOA object model is slower during use in many cases, particularly when working heavily with strings. String properties must be converted from UTF-8 bytes to .NET strings each time they are used, and the .NET string form isn't cached. You should store strings in local variables when using them repeatedly.

Recommended Scenarios

The BSOA version of the SDK is excellent for read-only usage of logs (like querying them), simple changes (like Multitool rewrite), and working with large logs more quickly. The binary format is very useful for logs which need to be loaded multiple times (like for multiple Multitool steps in succession) or when your code will be the only reader and writer of a set of logs.

Use Sarif.Multitool rewrite YourLog.sarif -o YourLog.bsoa to convert a SARIF log to BSOA (and vice-versa).

Usage Caveats

See BSOA Caveats for details.

Use constructors which take a SarifLog when possible.

BSOA object data is stored in 'tables' under a specific root SarifLog. Use the constructor which takes a SarifLog to ensure the objects are created under the correct root so that the data doesn't have to be copied later.

Fully initialize objects and fully populate collections before setting them.

BSOA copies data from external types (List, Dictionary) to an internal representation on set. Changes made to the collection after setting it won't take effect on the copy in your SarifLog model. BSOA also copies objects from a different SarifLog root on set, so instances can't be reused across SarifLogs and changes made to an object after setting it won't effect both SarifLog copies.

Cache strings you retrieve.

BSOA stores strings as UTF-8 and must convert them to .NET strings when retrieved. Keep returned strings in a variable rather than getting them repeatedly to avoid extra conversions.

Avoid creating objects you won't use.

BSOA objects take memory on the SarifLog they belong to until the whole log is garbage collected. If you are creating many Results but will only add some to the output log, create a temporary SarifLog to hold the Results and periodically clear it with SarifLog.DB.Clear() (or by re-creating the temporary log and letting the old one be collected) to avoid leaking memory.

Use Dictionaries efficiently

The BSOA Dictionary implementation stores values sorted by keys rather than in a table by hash for portability. This means retrieving and adding values to a BSOA Dictionary is somewhat slower than a .NET Dictionary. If you are using Dictionary values, retrieve them and store them in local variables rather than retrieving them repeatedly in different expressions in your code.

Other Differences

  • Sarif objects now directly support Equals() and GetHashCode(), which compare the values of all properties under an object.
  • The object model will not skip serializing non-empty collections which have all default objects; keep the collections empty or null to avoid serializing them.
  • The binary format serializes enums using the numeric values; ensure these are stable to avoid compatibility problems.