show the prevalence of rules in the output #520

williballenthin · 2021-04-08T15:56:01Z

we can improve the report by showing how commonly a rule matches globally/against benign samples/against malware. this context can help a user decide if a match is interesting or not. for example "open a file" matches everywhere, so its not usually "interesting" while "encrypt with FakeM" is quite uncommon and therefore "interesting".

in order to do this, we need to collect wide scale statistics on where each capa rule matches. we also need a way to store/provide this information - embed in the rules? distribute within the standalone exe? and how does this interact with third-party rules?

Aayush-Goel-04 · 2023-06-09T10:29:42Z

#1502 (comment)

Entropy Based Approach
one approach is to calculate the entropy of rule matches. Entropy, in this context, refers to a measure of the distribution or variability of rule matches across a dataset.
By calculating the entropy of each rule, we can determine how commonly or rarely a rule matches within the dataset.

Incorporating entropy as metadata for each rule in the capa report allows users to quickly assess the distribution and variability of rule matches. This information can aid in distinguishing between frequently occurring rule matches that may be less interesting and those that are relatively rare and more noteworthy.

We can employ a rule ranking system , while displaying caps results we can sort the matched rules based on their entropy levels in metadata and display confidence levels.

While entropy provides a quantitative measure of rule variability, it's important to note that it may not capture all aspects of rule significance. Therefore, additional factors or heuristics might be necessary to provide a more comprehensive assessment. Continuous refinement and improvement of the analysis techniques can enhance the precision and conciseness of the entropy-based information in the capa report.

We can divide the datasets based on groups such as "File Operations," "Network Communication," "Process Injection," etc. For each rule, we can assign weights to each group.
We can store rule match statistics for each group within executable itself. This ensures that the information is readily available and accessible when generating reports or displaying rule match details.
While testing caps on a file, if we can possibly categorise the file to a group then we can sort the rules matches based on their entropy for that group and display the interestingness.

williballenthin · 2023-06-12T09:24:03Z

that's an interesting idea @Aayush-Goel-04. I can see how entropy has been used in ML systems like this before, so it might apply to capa rules and features, too.

that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?

Aayush-Goel-04 · 2023-06-12T10:28:28Z

that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?

That can be a good start, Also If capa allows for third-party rules, we need to define a standard format or mechanism for third-party rules to include their own statistics or contribute to the overall statistics collection.

Aayush-Goel-04 · 2023-06-12T10:32:39Z

We can add a field in rule meta as probabilty of occurence for each rule.
Also testing all rules on a large dataset would require lot of time and power, at starting we can try this for a small set of rules and sample dataset and them run on a simple exe file.
any thoughts @williballenthin

Aayush-Goel-04 · 2023-07-29T06:12:55Z

@williballenthin
I ran some test for this. Below are results.

Order of Capabilities currently shown

Order of capabilities after probability is integrated.

The rules are ordered with least probability at top.

Below file contains number of occurences for each rule in all capa-testfiles.
entropy.xlsx

Aayush-Goel-04 · 2023-07-29T06:20:38Z

@williballenthin
There two options we can either add a entropy in meta for each rule, which will be used while rendering.

rule:
  meta:
    name:
    namespace:
    authors:
    scope: file
    mbc:
    references:
    examples:
    entropy: 10

Or we store a results within executable itself.

For third party rules

we can have users define probability for each rule
or set default value as 0 or 1 for all third party rules.

Aayush-Goel-04 · 2023-08-03T14:02:53Z

@williballenthin what are your thoughts on above comments !

williballenthin · 2023-08-03T14:07:19Z

Or we store a results within executable itself.

I think I prefer this strategy, since I think it would be a burden to expect rule authors to collect the prevalence of their rule as soon as they author it. Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.

I expect that we'll be able to provide a prevalence table derived from VT; however, this data isn't approved for public release yet. Let's assume it will be available for when we merge the final representation of this data and use your example data in the meantime.

For third party rules ... or set default value as 0 or 1 for all third party rules.

I think this makes sense. And, it may encourage people to contribute their rules to the common set so they can see prevalence information.

williballenthin · 2023-08-03T14:11:33Z

Thank you @Aayush-Goel-04 for taking the time to update the rendering based on the prevalence. I like how it puts the "more interesting" rules towards the top.

I think if we'd want to use this format, we should display the prevalence in a column so that users can see why the ordering is the way it is.

Alternatively, I would like to explore finding a cutoff between "common" and "uncommon" and highlighting the rules that are uncommon (via a different output color and/or perhaps a * next to their name). This way, users don't have to guess about how to interpret the prevalence numbers and can rely on capa's recommendations. It also lets us use the existing output format (which is ordered by namespace, which has nice properties, like grouping of similar things).

Aayush-Goel-04 · 2023-08-05T09:48:21Z

Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.

I am aware of one approach, which is to embed the results directly into the executable's resources using either JSON or pickle file formats. However, I'm interested to know if there are any alternative approaches available.

for highlighting rules I have following ideas :

first we seperate rules based on probability into three sections (each section will be ordered based on namespace).
rare: (0, 0.1), uncommon: (0.1 , 0.3), common: (0.3, 1) . range can be decided later on based on how final data is calculated.
in rare ones those with probability as close to 0 can have a * next to their name as u said.
We can also represent each section with three different colors for visuals.

Whats are ur thoughts @williballenthin !

Aayush-Goel-04 · 2023-08-05T22:38:24Z

@williballenthin , below are the sample screenshots of render
I think below rendering this looks better.
rare refers to prob < 0.05 or if number of matches for a rule is < 30 in all capa-test-files & common refers to prob > 0.05.
After filtering based on probability they are ordered based on namespace and name.

mr-tz · 2023-08-08T11:16:38Z

I think this is pretty neat! How would you propose to handle new rules with no prevalence data (yet)? Show them as unknown?

Aayush-Goel-04 · 2023-08-08T11:28:27Z

Their entropy value will be taken as zero and they will be ordered based on namespace and their prevalence will be shown as unknown.

mr-tz · 2023-08-08T11:36:24Z

Ok, I wonder about these alternatives:

show two tables: one for rare and one for common and unknown
always sort by namespace but highlight rare rules (or tune down common rules)

Aayush-Goel-04 · 2023-08-08T11:46:11Z

show two tables

Instead of this i think it would be better to seperate them with a line in the table.

mr-tz · 2023-08-08T12:18:16Z

good idea, that could work well

Aayush-Goel-04 · 2023-08-13T18:52:22Z

good idea, that could work well

common(known entropy) and unknown(no prevalence data) ones can also be seperated but then there would be no sense in sorting based on name and namespace. I propose only two sections , coloring can be discussed.

what are ur thoughts @williballenthin @mr-tz

mr-tz · 2023-08-14T08:56:22Z

I like it! Minor adjustments could be:

same color for capabilities
different colors for rare, unknown, common

Aayush-Goel-04 · 2023-08-15T21:25:43Z

@mr-tz
rare: blue, common: cyan(default color for capability), unknown: no color. We can decide on color for rare.

in my opinion format in 2nd image looks better.

mr-tz · 2023-08-16T08:19:04Z

Agreed, one and two look good. Green may suggest "good" (vs. red is "bad") in some context so we may want to stick to other colors.

Aayush-Goel-04 · 2023-08-18T20:00:51Z

@mr-tz , then I think it would be better to stick with current coloring cyan.
Since rare ones are present as seperate section they can have same color as capabilities and for common section we can leave unknown as uncolored and common as colored (cyan)

in case no rare present

williballenthin · 2023-08-18T21:28:14Z

i wonder if we should color the rule name the same as the prevalence column. as is, we use color to convey information in one column (prevalence) but in another column (name) it's just for highlighting. i think this is confusing.

alternatively, maybe we could use different colors for names/prevalence, but then we run the risk of introducing too many colors.

mr-tz · 2023-08-19T09:44:53Z

good points, Willi, I like different colors if we can find a good selection

williballenthin added the enhancement New feature or request label Apr 8, 2021

mr-tz mentioned this issue Jun 9, 2023

Update Metadata type in capa main #1502

Merged

3 tasks

Aayush-Goel-04 mentioned this issue Sep 4, 2023

Show prevalence of rules in the output #1737

Open

3 tasks

mr-tz added gsoc Work related to Google Summer of Code project. usability Related to using capa and displaying results (CLI/GUI) labels May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

show the prevalence of rules in the output #520

show the prevalence of rules in the output #520

williballenthin commented Apr 8, 2021

Aayush-Goel-04 commented Jun 9, 2023 •

edited

Loading

williballenthin commented Jun 12, 2023

Aayush-Goel-04 commented Jun 12, 2023

Aayush-Goel-04 commented Jun 12, 2023 •

edited

Loading

Aayush-Goel-04 commented Jul 29, 2023

Aayush-Goel-04 commented Jul 29, 2023 •

edited

Loading

Aayush-Goel-04 commented Aug 3, 2023

williballenthin commented Aug 3, 2023 •

edited

Loading

williballenthin commented Aug 3, 2023

Aayush-Goel-04 commented Aug 5, 2023 •

edited

Loading

Aayush-Goel-04 commented Aug 5, 2023 •

edited

Loading

mr-tz commented Aug 8, 2023

Aayush-Goel-04 commented Aug 8, 2023

mr-tz commented Aug 8, 2023

Aayush-Goel-04 commented Aug 8, 2023

mr-tz commented Aug 8, 2023

Aayush-Goel-04 commented Aug 13, 2023

mr-tz commented Aug 14, 2023

Aayush-Goel-04 commented Aug 15, 2023

mr-tz commented Aug 16, 2023

Aayush-Goel-04 commented Aug 18, 2023

williballenthin commented Aug 18, 2023

mr-tz commented Aug 19, 2023

show the prevalence of rules in the output #520

show the prevalence of rules in the output #520

Comments

williballenthin commented Apr 8, 2021

Aayush-Goel-04 commented Jun 9, 2023 • edited Loading

williballenthin commented Jun 12, 2023

Aayush-Goel-04 commented Jun 12, 2023

Aayush-Goel-04 commented Jun 12, 2023 • edited Loading

Aayush-Goel-04 commented Jul 29, 2023

Aayush-Goel-04 commented Jul 29, 2023 • edited Loading

Aayush-Goel-04 commented Aug 3, 2023

williballenthin commented Aug 3, 2023 • edited Loading

williballenthin commented Aug 3, 2023

Aayush-Goel-04 commented Aug 5, 2023 • edited Loading

Aayush-Goel-04 commented Aug 5, 2023 • edited Loading

mr-tz commented Aug 8, 2023

Aayush-Goel-04 commented Aug 8, 2023

mr-tz commented Aug 8, 2023

Aayush-Goel-04 commented Aug 8, 2023

mr-tz commented Aug 8, 2023

Aayush-Goel-04 commented Aug 13, 2023

mr-tz commented Aug 14, 2023

Aayush-Goel-04 commented Aug 15, 2023

mr-tz commented Aug 16, 2023

Aayush-Goel-04 commented Aug 18, 2023

williballenthin commented Aug 18, 2023

mr-tz commented Aug 19, 2023

Aayush-Goel-04 commented Jun 9, 2023 •

edited

Loading

Aayush-Goel-04 commented Jun 12, 2023 •

edited

Loading

Aayush-Goel-04 commented Jul 29, 2023 •

edited

Loading

williballenthin commented Aug 3, 2023 •

edited

Loading

Aayush-Goel-04 commented Aug 5, 2023 •

edited

Loading

Aayush-Goel-04 commented Aug 5, 2023 •

edited

Loading