add support for analysis of source code/scripted languages #1080

adamstorek · 2022-07-01T12:39:00Z

This enhancement extends capa's functionality to the analysis of potentially malicious scripts and source code. A tree-sitter backend was added to parse the source files into a lightweight AST. Features akin to the PE-Vivisect capa are then extracted:

File-level:

trivial: language, file format
global string literals
global integer literals
namespaces
globally-instantiated imported classes
globally-called imported functions

Function-level:

string literals
integer literals
imported classes
imported functions

To install Tree-sitter:

Pip-install Tree-sitter:
pip3 install tree-sitter
Install bindings:
mkdir vendor build
cd vendor
git clone git@github.com:tree-sitter/tree-sitter-c-sharp.git
git clone git@github.com:tree-sitter/tree-sitter-embedded-template.git
git clone git@github.com:tree-sitter/tree-sitter-html.git
git clone git@github.com:tree-sitter/tree-sitter-javascript.git

Checklist

No CHANGELOG update needed

No new tests needed

No documentation update needed

github-actions

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

williballenthin

This is a great start towards adding scripting language support to capa! thanks @adamstorek!

The code is already quite good and I don't anticipate any major issues to getting it merged; however, I have added a number of comments on regions I think should be tweaked.

One thing you should know about me is that I prefer to over-communicate review feedback with the understanding that everything is up for discussion. So, if anything feels weird or wrong, don't hesitate to ask for more details or deeper discussion.

General points:

I really like the file range address type, I think that will work well.
I think we can simplify the code a bit by merging the "Script" feature into "Format". thoughts?
Some of the embedded data and configuration can be restructured into python globals.
I'd like to hear a bit about what it takes to embed/depend on the TS languages so we can ensure its easy for people to download/use.
Please add tests showing the features extracted on various example files

williballenthin · 2022-07-01T15:02:46Z

capa/features/common.py

+class ScriptLanguage(Feature):
+    def __init__(self, value: str, description=None):
+        super().__init__(value, description=description)
+        self.name = "script language"


could we use format for this? e.g. format: C#.

pro:

fewer features to memorize

less duplication

less code

con:

maybe slightly less precise

The problem with overloading the file format feature is that file to language is a one-to-many mapping, e.g. there can be embedded templates that contain multiple different script languages such as C# for server-side scripts and JavaScript for client-side.

capa/features/extractors/script.py

williballenthin · 2022-07-01T15:04:21Z

capa/features/extractors/script.py

+def get_language_from_ext(path: str):
+    _, ext = os.path.splitext(path)
+    if ext == ".cs":
+        return LANG_CS


we should also think about maybe trying to guess the language based on the file contents if there's no extension.

also supporting things like .cs_ which some may use to prevent the file from accidentally getting executed.

Auto-identifying the script language is definitely a worthwhile feature which I have deprioritized for the minimal implementation and instead manually incorporated the extensions (now including the defanged extensions as suggested above). This is also because it might not always be very straightforward to do so (e.g. one file might include multiple scripts; sometimes context might be necessary).

capa/features/extractors/ts/build.py

capa/features/extractors/ts/engine.py

capa/features/extractors/ts/extractor.py

capa/features/extractors/ts/query.py

capa/features/extractors/ts/sig.py

capa/features/extractors/ts/signatures/cs.json

williballenthin · 2022-07-06T21:05:39Z

i think it would be worthwhile to get the tests running (and passing) in CI. this means:

add the example files to capa-testfiles and get those merged, and
update the github actions workflows to install the TS bindings (temporarily, until we have a better solution)

adamstorek · 2022-07-07T14:49:36Z

add the example files to capa-testfiles and get those merged, and

Just submitted the pull request pull request.

update the github actions workflows to install the TS bindings (temporarily, until we have a better solution)

On it.

…such as byte-range address.

…rves as an interface to the language-specific tree-sitter queries.

…ng language-independent extractors).

…SitterEngine class.

…and function definition parsing for a pure C# sample.

…d fixed bugs found in the process.

…and not introduce unspecified rule-exceptions.

…default to the base extractor level).

…arations.

…/html: aspx.

capa/features/address.py

capa/features/extractors/ts/engine.py

mike-hunhoff · 2022-07-19T19:15:37Z

capa/features/extractors/ts/engine.py

+            if self.is_aspx_import_directive(node):
+                namespace = self.get_aspx_namespace(node)


Can we document why we are only/specially handling ASPX here?

This is related to the issue discussed here: #1080 (comment).

capa/features/extractors/ts/engine.py

capa/features/extractors/ts/extractor.py

capa/features/extractors/ts/function.py

capa/features/extractors/ts/extractor.py

capa/features/extractors/ts/function.py

…refactored language toolkit code; added extraction of global constants.

…ddressed most of the GH pull request comments/suggestions.

mike-hunhoff

nice updates! I've left a few comments, questions, and suggestions for your review 🚀

mike-hunhoff · 2022-08-04T17:40:02Z

capa/features/extractors/ts/autodetect.py

+            tree = _parse(ts_language, buf)
+        except ValueError:
+            continue
+        if not _contains_errors(ts_language, tree.root_node):


Can we add a comment on what are assumptions are here? I'm not overly familiar with tree-sitter but it appears that we assume it will only throw errors when encountering a language mismatch e.g. we attempt to parse Python using tree-sitter C# tooling?

This might be more readable, what do you think?

def _parse(ts_language: Language, buf: bytes) -> Optional[Tree]: try: parser = Parser() parser.set_language(ts_language) return parser.parse(buf) except ValueError: return None def _contains_errors(ts_language, node: Node) -> bool: return ts_language.query("(ERROR) @error").captures(node) def get_language_ts(buf: bytes) -> str: for language, ts_language in TS_LANGUAGES.items(): tree = _parse(ts_language, buf) if tree and not _contains_errors(ts_language, tree.root_node): return language raise ValueError("failed to parse the language")

mike-hunhoff · 2022-08-04T17:45:23Z

capa/features/extractors/ts/build.py

@@ -0,0 +1,15 @@
+from tree_sitter import Language
+
+build_dir = "build/my-languages.so"


does this mean we only support Linux?

Tree-sitter needs to compile its (C) language bindings. Although I have a limited knowledge of package management, I've suggested to Moritz that we should precompile and package the supported tree-sitter bindings for each platform we support. The current state is a temporary measure.

mike-hunhoff · 2022-08-04T17:48:16Z

capa/features/extractors/ts/engine.py

+    def parse(self) -> Tree:
+        parser = Parser()
+        parser.set_language(self.query.language)
+        return parser.parse(self.buf)


can this call generate any exceptions?

It can throw type errors (which I believe we prevent with mypy) and a value error if parsing completely fails. Then the parse method will throw a ValueError, so the engine will throw a ValueError etc.: I can handle it in the following way at the Extractor level:

try: self.language = capa.features.extractors.ts.autodetect.get_language(path) self.template_engine = self.get_template_engine(buf) self.engines = self.get_engines(buf) except ValueError as e: raise UnsupportedFormatError(e)

mike-hunhoff · 2022-08-04T17:51:33Z

capa/features/extractors/ts/engine.py

+    def get_range(self, node: Node) -> str:
+        return self.get_byte_range(node).decode("utf-8")


The intended use of this function appears to be decoding a string found in a specific byte range? If so, consider changing the name to something more descriptive like get_str_from_range. Also, do we expect encoding exceptions to be thrown by the decode?

Good question, but I doubt tree-sitter would be able to parse something that we can't decode.

Also changed get_range to get_str.

capa/features/extractors/ts/engine.py

capa/features/extractors/ts/function.py

mike-hunhoff · 2022-08-04T23:56:18Z

capa/features/extractors/ts/function.py

+def _extract_imported_constants(fn_node: Node, engine: TreeSitterExtractorEngine) -> Iterator[Tuple[Feature, Address]]:
+    for ic_node, ic_name in engine.get_processed_imported_constants(fn_node):
+        for name in get_imports(ic_name, engine.namespaces, engine):
+            yield API(engine.language_toolkit.format_imported_constant(name)), engine.get_address(ic_node)


need more discussion on #1125

mike-hunhoff · 2022-08-04T23:59:55Z

capa/features/extractors/ts/tools.py

+        signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file))
+        return {category: set(namespaces) for category, namespaces in signatures.items()}
+
+    def _is_import(self, name: str) -> bool:


typo w/ _?

No, this is merely a private method to handle import table lookups that the public is_import method uses.

capa/features/extractors/ts/tools.py

mike-hunhoff · 2022-08-05T00:13:58Z

capa/features/extractors/ts/tools.py

+                return int(integer, base)
+        return int(integer)


do we expect exceptions to occur here?

This can raise ValueError (and does when TS labels something as an int which is not an int), but the ValueError is handled by the caller (see extract_integers).

…es in order to make rules clearer; refactored the codebase to address the latest PR comments/suggestions.

…ace, and modified tests.

…(passes all test cases but by no means perfect); further clean up, especially of the signatures; synced with new Python test cases.

github-actions bot requested changes Jul 1, 2022

View reviewed changes

williballenthin requested changes Jul 1, 2022

View reviewed changes

adamstorek force-pushed the capa-scripts branch 3 times, most recently from 2fdcac2 to 0a61d86 Compare July 5, 2022 16:28

williballenthin mentioned this pull request Jul 6, 2022

how to bundle TreeSitter bindings #1092

Open

adamstorek and others added 22 commits July 19, 2022 10:36

Added initial capa control flow for scripts in C#.

bbd3f70

Implemented some further basic TreeSitter Extractor-related concepts …

8173397

…such as byte-range address.

Modified mypy config file to ignore tree-sitter's missing exports.

428f6bc

Implemented core tree sitter engine component with C# queries that se…

a6d7ba2

…rves as an interface to the language-specific tree-sitter queries.

Implemented script global extraction handlers (mostly wrapping existi…

80bf78b

…ng language-independent extractors).

Reworked format parsing to align better with the rest of capa logic.

cf3dc7e

Implemented a large part of the C# functionality; refactored the Tree…

9d7f575

…SitterEngine class.

Added function-level feature extraction.

3d4b4ec

Bug fixes and code refactoring of the Tree Sitter extractor.

eca7ead

Added tree_sitter to requirements in setup.py.

5fd953f

Added tests for TreeSitterExtractorEngine initialization, new object …

1f79db9

…and function definition parsing for a pure C# sample.

Added more TreeSitterExtractorEngine tests for pure C#.

a58bc0b

Added last remaining tests for the TreeSitterExtractorEngine class an…

5ddb8ba

…d fixed bugs found in the process.

Reverted yielding only non-empty strings in order to stay consistent …

31e2fb9

…and not introduce unspecified rule-exceptions.

Removing functions that should not be used in tree-sitter extractor (…

5bf3f18

…default to the base extractor level).

Modifying extraction of global statements to omit local function decl…

a4529fc

…arations.

Added script language feature to freeze.

d5de9a1

Added test cases for TS Extractor.

6c10458

Refactored query bindings.

9bd9824

Added support for template parsing.

2594849

Added support for HTML parsing.

619ed94

Implemented the necessary modifications to support embedded templates…

5e23802

…/html: aspx.