investigate optimization of rule matching (May, 2024) #2063

williballenthin · 2024-05-06T07:45:52Z

As shown in #2061, perhaps 70% of capa runtime is spent evaluating rule logic. That means, if we want to make capa run faster, improvements to the logic evaluation code may have a bigger impact than code analysis changes.

In this thread I'll record findings and ideas for performance enhancements. We can close out this issue when we feel we have a good handle on performance and whether its worthwhile to make changes to capa.

williballenthin · 2024-05-06T07:49:34Z

Here's a speedscope trace captured by py-spy when using capa with the BinExport2 backend against mimikatz. Using this backend removes a bunch of noise related to vivisect analysis, which isn't relevant to this thread, and isn't something we can easily improve.

profile-be2.speedscope.zip

For example, use the sandwich diagram to identify routines that take up a lot of runtime:

mike-hunhoff · 2024-05-06T18:46:47Z

@s-ff take a look at the research and discussion in this issue to get you thinking about our GSoC project. No action beyond reviewing (and posting any thoughts you have) is needed at this time, we'll discuss more in our upcoming meetings 😄 .

williballenthin · 2024-05-08T13:36:16Z

slow mimikatz function

DEBUG:capa.capabilities.static:analyzed function 0x420a81 and extracted 325 features, 2 matches in 1.57s

Just 300 features takes 1.5s to evaluate! Why is this so? Maybe if we investigate this one function we can make fixes that help all functions.

Ok, well that is gross:

Lots of small basic blocks means there are going to be many instruction scopes and many basic block scopes to evaluate before the function scope is evaluated.

6964 total features.
852 basic blocks!
4442 instructions.

williballenthin · 2024-05-08T14:10:55Z

FeatureSet size distributions

From mimikatz using Ghidra BinExport2 backend. These plots show the distribution of sizes of FeatureSets by scope. In summary, instructions usually have less than 10 features, basic blocks less than 20, and functions less than 100.

instruction

basic block

function

We can also use this technique to investigate the number of rules selected to be evaluated at each of these scope instances (and then attempt to minimize these numbers).

(notes for future willi)

bat /tmp/1.txt | grep "^perf: match" | grep "FUNC" | choose 4 | sort | uniq -c | awk '{print $2" "$1}' | sort --human-numeric-sort
gnuplot> plot "/tmp/func.txt" with boxes using 1:2

Implement the "tighten rule pre-selection" algorithm described here: #2063 (comment) In summary: > Rather than indexing all features from all rules, > we should pick and index the minimal set (ideally, one) of > features from each rule that must be present for the rule to match. > When we have multiple candidates, pick the feature that is > probably most uncommon and therefore "selective". This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around 3x more functions per second (wow wow). When doing large scale runs, capa is about 25% faster when using the vivisect backend (analysis heavy) or 3x faster when using the upcoming BinExport2 backend (minimal analysis).

williballenthin · 2024-07-23T07:27:54Z

candidate enhancements have been broken out into their own issues. closing this thread of investigation.

Dextera0007 · 2024-08-09T13:22:12Z

Collection of rules to identify capabilities within a program; has anyone ever experience capa times out? Also is there a some configuration settings where only invoke the rules that match the file features ?

williballenthin · 2024-08-09T13:43:41Z

@Dextera0007 would you mind creating a new issue so that we can have a separate thread of conversation?

williballenthin added the enhancement New feature or request label May 6, 2024

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

williballenthin mentioned this issue May 14, 2024

tighten rule pre-selection #2080

Closed

6 tasks

This comment was marked as outdated.

Sign in to view

mr-tz added gsoc Work related to Google Summer of Code project. performance Related to capa's performance labels May 22, 2024

williballenthin mentioned this issue Jun 6, 2024

optimize rule matching by better indexing rule by features #2125

Merged

williballenthin closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate optimization of rule matching (May, 2024) #2063

investigate optimization of rule matching (May, 2024) #2063

williballenthin commented May 6, 2024

williballenthin commented May 6, 2024

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

mike-hunhoff commented May 6, 2024

This comment was marked as resolved.

williballenthin commented May 8, 2024 •

edited

Loading

williballenthin commented May 8, 2024 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

williballenthin commented Jul 23, 2024

Dextera0007 commented Aug 9, 2024

williballenthin commented Aug 9, 2024

investigate optimization of rule matching (May, 2024) #2063

investigate optimization of rule matching (May, 2024) #2063

Comments

williballenthin commented May 6, 2024

williballenthin commented May 6, 2024

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

mike-hunhoff commented May 6, 2024

This comment was marked as resolved.

williballenthin commented May 8, 2024 • edited Loading

slow mimikatz function

williballenthin commented May 8, 2024 • edited Loading

FeatureSet size distributions

instruction

basic block

function

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

williballenthin commented Jul 23, 2024

Dextera0007 commented Aug 9, 2024

williballenthin commented Aug 9, 2024

williballenthin commented May 8, 2024 •

edited

Loading

williballenthin commented May 8, 2024 •

edited

Loading