Rust library crate to interact with Semgrep. It allows you to:
- Parse and combine Semgrep rules.
- Create and populate and a new construct (policy) that imitates rulesets.
- Parse Semgrep's JSON output.
- Run Semgrep CLI by passing the rules as a string and get the results in a struct.
I have used it in my Personal Semgrep Server project. See also the examples.
- Add detailed structs and the ability to programmatically create rules.
Handle running and executing Semgrep CLI.Structs to parse the Semgrep output and do post-processing on the results.- Simple plugin system.
Note: Work in progress. The interface is subject to change and might break backwards compatibility.
I have not published the crate yet. You can use the library in two ways right now:
- Add it as a submodule to your repository.
git submodule add -b dev https://github.com/parsiya/semgrep-rs src/semgrep-rs
- Add the following to your project's
Cargo.toml
.[dependencies] semgrep-rs = { path = "src/semgrep-rs" }
Add the following to your project's Cargo.toml
.
[dependencies]
semgrep-rs = { git = "https://github.com/parsiya/semgrep-rs", branch = "dev" }
See the examples or https://github.com/parsiya/personal-semgrep-server.
// Deserialize the rule file into a GenericRuleFile struct.
let rule_file: GenericRuleFile =
GenericRuleFile::from_file("tests/rules/cpp/memcpy-insecure-use.yaml").unwrap();
// A rule file can have multiple rules, the process is the same.
let rule_file2: GenericRuleFile =
GenericRuleFile::from_file("tests/rules/multiple-rules.yaml").unwrap();
// Iterate through all the rules in the file and get their ID.
for rule in &rule_file2.rules {
print!("{}", rule.get_id().unwrap());
}
// You can serialize them back to YAML.
let yaml_string: String = rule_file.to_string().unwrap();
// Usually rule files only have one rule, if you have a GenericRuleFile
// with multiple rules, you can create rule files with only one rule per file
// with .split().
let rules: Vec<GenericRuleFile> = rule_file2.split();
for rule in &rules {
let mut path = "tests/".to_string();
// Each rule file has only one rule so we will just use that one's ID.
path.push_str(rule.rules[0].get_id().unwrap());
// Store the result in a file.
let err = utils::write_string_to_file(&path, &rule.to_string().unwrap());
}
I have more detailed structs in the dev-old branch. These
structs allow you to extract all the rule fields (e.g., pattern
or
metavariable
). The plan is to add them after I have the generic structs
production ready. It will allow you to parse every field in a rule file and also
create rules dynamically.
A rule index contains all the rules and uses their ID as an index. You can create a rule index that contains all the rules in a specific path and its children. If there are files with multiple rules, the library will split them and add each one individually.
Note: The index doesn't care about rule ID collisions and will happily overwrite them. See below for a solution.
let simple_gri: GenericRuleIndex =
GenericRuleIndex::from_path_simple("test/rules").unwrap();
// Get all rules IDs in the index.
let ids: Vec<String> = simple_gri.get_ids();
The _simple
methods only index files with .yaml
and .yml
extensions and
ignores rule test files ending in .test.yaml
, .test.yml
and
.test.fixed.yaml
.
You can use your own Option<Vec<&str>>
for include and exclude.
// Don't include the dot for the include vector.
let include = vec!["ext1", "ext2"];
// For exclude, you need to specify how the file ends.
// Including the dot here helps prevent skipping files like "overflow-test.yml".
let exclude = vec![".test.ext1", ".test.ext2"];
let custom_gri = GenericRuleIndex::from_path(
"test/rules",
Option<include>,
Option<exclude>,
false,
).unwrap();
You can also pass multiple paths by using the from_paths
and
from_paths_simple
methods.
If a file is not accessible or it cannot be deserialized into a struct, the
crate logs it with error!
and continues.
By default, both Semgrep and the portal use complete rule IDs to avoid collisions. A complete rule ID contains the complete path of the rule in addition to the ID in the file.
For example, the normal ID of the rule in
tests/rules/cpp/arrays-out-of-bounds-access.yaml
is
arrays-out-of-bounds-access
. The complete rule ID will be
tests.rules.cpp.arrays-out-of-bounds-access.arrays-out-of-bounds-access
.
Using complete rule IDs is a hassle when we create policies by hand (see policies below). So you can specify if you want simple or complete rule IDs.
In both cases you have to make sure you don't have rule ID collisions.
let custom_gri = GenericRuleIndex::from_path(
"test/rules",
include,
exclude,
true, // Create complete rule IDs.
).unwrap();
A rule index by itself is not that useful. A policy is a collection of one or multiple rules. This is not a Semgrep construct and you cannot pass it to the CLI. The crate is trying to imitate rulesets in the Semgrep portal. You can define a policy by creating a YAML file.
name: policy1 # this should be unique
rules:
- arrays-out-of-bounds-access
- potentially-uninitialized-pointer
- snprintf-insecure-use
Rules in a policy are defined by rule ID. If you have created your rule index
with complete IDs then you should use the same version here. E.g.,
path.to.file.id
.
Similar to rules you can create a Policy
object from a YAML string and
serialize it back to YAML, again.
// Read a policy from a file.
let policy1: Policy = Policy::from_file("tests/policies/policy1.yaml").unwrap();
// We can also create it from a string.
let content: String = utils::read_file_to_string("tests/policies/policy2.yaml").unwrap();
let policy2: Policy = Policy::from_string(content).unwrap();
// Serialize it to a YAML string (note this doesn't create a rule file).
let policy1_string: String = policy1.to_yaml().unwrap();
// Write it to a file (note this doesn't create a rule file).
let res = policy2.to_file("tests/policies/policy2-copy.yaml");
As we saw above, we can create the same YAML file with rule names. This cannot
be passed to the Semgrep CLI because we want to pass policies as a collection of
rules. The GenericRuleIndex
that we created before comes into play. We can
populate
a policy by using the index.
// Note: this method ignores rules that are not found (writes to stderr) and
// panics on YAML de/serialization errors.
policy2.populate(&simple_gri);
This will create a YAML string that contains a rule file with all rules in the
policy that are found in the rule index. This is stored in the content
field.
You can write this to a file to pass to Semgrep.
// You can write this to a file to pass to Semgrep.
let rule_file: String = policy2.get_content();
This is another custom construct. It's an index of all policies in a path where policy name is the key. Users should avoid duplicate names here as the library does not check for collisions. If there's a need to implement complete IDs (similar to rules) here, it can be done quickly.
// Create a policy index from all rules in a path. This will panic on YAML
// de/serialization errors and if there are no valid policies in the path.
let simple_pi: PolicyIndex = PolicyIndex::from_paths_simple("test/rules").unwrap();
// Get a policy by ID.
let pol: Policy = simple_pi.get_policy("policy1").unwrap();
// Get the rule file for it.
let content: String = pol.get_content();
// You can also iuterate through every policy in the index.
for (id, pol) in simple_pi.get_index() {
assert_eq!(id, pol.get_name());
}
The simple version only deserializes files with .yaml/.yml
extensions. You can
set custom include and exclude extensions like the rule index example above.
// No need to include the dot.
let include = vec!["ext1", "ext2"];
// For exclude, you need to specify how the file ends.
let exclude = vec![".test.ext1", ".test.ext2"];
let custom_pi: PolicyIndex::from_paths(
"test/rules",
Option<include>,
Option<exclude>,
).unwrap();
The policy index is useful if you want to create your own server and serve policies to Semgrep.
You can also pass multiple paths by using the from_paths
and
from_paths_simple
methods.
The crate automatically creates an special policy named all
. This policy
contains every rule in the rule index. If you have a policy named all
, it will
be overwritten by this policy. The all
policy is useful when you want to throw
every rule at the code.
let all_policy: Policy = custom_pi.get_policy("all").unwrap();
// Get the rule file that has every rule!
let all_rules: String = all_policy.get_content();
The crate supports parsing Semgrep's output in JSON (not the SARIF one). Use the
--json
flag: semgrep --config p/default --json --output my-results.json
.
use semgrep_rs::CliOutput;
// Parse the file.
let res = CliOutput::from_json_file("result.json").unwrap();
// Print all scanned paths.
for p in res.paths.scanned {
print!("{}", p);
}
// Go through all results.
for r in res.results {
print!("{}", r.check_id); // Print the rule ID and path for each result.
print!("{}", r.path);
}
The structs are based on https://github.com/returntocorp/semgrep-interfaces/blob/main/semgrep_output_v1.jsonschema. I have created an annotated version of it in source-schemas/semgrep_output_v1.jsonschema.
A few structs are parsed as serde_json::Value
because I did not know how to
implement the enums with multiple different field types.
For example core_match_call_trace
or CoreMatchCallTrace
is an enum that can
either be:
- An array with at least two items. The first item must be the string
CoreLoc
and the second is an object of typeLocation
. - An array with at least 2 items:
- The first item is the string
CoreCall
. - The second item is another array with at least 3 items:
- The first item is
Location
. - Second item is a
Vec<CoreMatchIntermediateVar>
(another array). - Third item is a
CoreMatchCallTrace
(which is the parent object).
- The first item is
- The first item is the string
I can make Rust enums but I have no idea how to make these other objects, yet.
Another one is core_error_kind
or CoreErrorKind
. It can be one of:
- 14 hardcoded strings.
- An array with at least two items:
- The first item is the string
Pattern parse error
. - The second item is a
Vec<String>
.
- The first item is the string
- An array with at least two items:
- The first item is the string
PartialParsing
. - The second item is a
Vec<Location>
.
- The first item is the string
The problem here is with choices 2 and 3. We're looking at something like
{"Pattern parse error", {"foo", "bar", "baz"}}
. I don't know how to create a
type like this in the Rust enum. There are no field names so I cannot make an
object.
You can also use this crate to run the Semgrep CLI. You can pass the rules as a
string and get the results in a struct. Create an instance of Args
:
pub struct Args {
/// Semgrep rules as a string.
pub rules: String,
/// value of the Semgrep `metrics` CLI argument, default is `off`
/// (`--metrics=off`). Note metrics will be collected regardless of this
/// field on certain invocations like `-c=p/default`. See the docs at:
/// https://semgrep.dev/docs/metrics/.
metrics: Metrics,
/// other flags, passed to the tool as-is before scan_paths and separated by
/// space.
pub extra: Option<Vec<String>>,
/// paths scanned with Semgrep.
pub paths: Vec<String>,
/// the output format
pub output_format: OutputFormat,
}
Like this:
let rule_as_string = "a string with one or more rules";
let paths = vec!["scan/code1/".to_string(), "scan/code2/".to_string()];
let metrics = false;
let format = semgrep_rs::OutputFormat::from_str("json").unwrap();
// Or just make a format using one of the enums.
let format2 = semgrep_rs::OutputFormat::JSON;
// pass any other cli switches.
let extra = vec!["--time".to_string(), "--timeout=10".to_string()];
let args = semgrep_rs::Args::new(
rules_as_string, paths, metrics, format, extra
);
// See the final command as a string and inspect it.
print!("{}", args.to_string());
// Run Semgrep and get the result in a struct.
let results: semgrep_rs::CliOutput = args.execute().unwrap();
// You can also convert it back to JSON to write to disk.
let bytes: Vec<u8> = results.to_json_bytes().unwrap();
Rust likes dual-licensing like this so here we go.
Licensed under either of Apache License, Version 2.0 or MIT license.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.