Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show the prevalence of rules in the output #520

Open
williballenthin opened this issue Apr 8, 2021 · 23 comments
Open

show the prevalence of rules in the output #520

williballenthin opened this issue Apr 8, 2021 · 23 comments
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. usability Related to using capa and displaying results (CLI/GUI)

Comments

@williballenthin
Copy link
Collaborator

we can improve the report by showing how commonly a rule matches globally/against benign samples/against malware. this context can help a user decide if a match is interesting or not. for example "open a file" matches everywhere, so its not usually "interesting" while "encrypt with FakeM" is quite uncommon and therefore "interesting".

in order to do this, we need to collect wide scale statistics on where each capa rule matches. we also need a way to store/provide this information - embed in the rules? distribute within the standalone exe? and how does this interact with third-party rules?

@williballenthin williballenthin added the enhancement New feature or request label Apr 8, 2021
@Aayush-Goel-04
Copy link
Contributor

Aayush-Goel-04 commented Jun 9, 2023

#1502 (comment)

Entropy Based Approach
one approach is to calculate the entropy of rule matches. Entropy, in this context, refers to a measure of the distribution or variability of rule matches across a dataset.
By calculating the entropy of each rule, we can determine how commonly or rarely a rule matches within the dataset.

Incorporating entropy as metadata for each rule in the capa report allows users to quickly assess the distribution and variability of rule matches. This information can aid in distinguishing between frequently occurring rule matches that may be less interesting and those that are relatively rare and more noteworthy.

We can employ a rule ranking system , while displaying caps results we can sort the matched rules based on their entropy levels in metadata and display confidence levels.

While entropy provides a quantitative measure of rule variability, it's important to note that it may not capture all aspects of rule significance. Therefore, additional factors or heuristics might be necessary to provide a more comprehensive assessment. Continuous refinement and improvement of the analysis techniques can enhance the precision and conciseness of the entropy-based information in the capa report.

We can divide the datasets based on groups such as "File Operations," "Network Communication," "Process Injection," etc. For each rule, we can assign weights to each group.
We can store rule match statistics for each group within executable itself. This ensures that the information is readily available and accessible when generating reports or displaying rule match details.
While testing caps on a file, if we can possibly categorise the file to a group then we can sort the rules matches based on their entropy for that group and display the interestingness.

@williballenthin
Copy link
Collaborator Author

that's an interesting idea @Aayush-Goel-04. I can see how entropy has been used in ML systems like this before, so it might apply to capa rules and features, too.

that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?

@Aayush-Goel-04
Copy link
Contributor

that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?

That can be a good start, Also If capa allows for third-party rules, we need to define a standard format or mechanism for third-party rules to include their own statistics or contribute to the overall statistics collection.

@Aayush-Goel-04
Copy link
Contributor

Aayush-Goel-04 commented Jun 12, 2023

We can add a field in rule meta as probabilty of occurence for each rule.
Also testing all rules on a large dataset would require lot of time and power, at starting we can try this for a small set of rules and sample dataset and them run on a simple exe file.
any thoughts @williballenthin

@Aayush-Goel-04
Copy link
Contributor

@williballenthin
I ran some test for this. Below are results.

Order of Capabilities currently shown
Screenshot 2023-07-29 113428

Order of capabilities after probability is integrated.
Screenshot 2023-07-29 113040

The rules are ordered with least probability at top.

Below file contains number of occurences for each rule in all capa-testfiles.
entropy.xlsx

@Aayush-Goel-04
Copy link
Contributor

Aayush-Goel-04 commented Jul 29, 2023

@williballenthin
There two options we can either add a entropy in meta for each rule, which will be used while rendering.

rule:
  meta:
    name:
    namespace:
    authors:
    scope: file
    mbc:
    references:
    examples:
    entropy: 10

Or we store a results within executable itself.

For third party rules

  • we can have users define probability for each rule
  • or set default value as 0 or 1 for all third party rules.

@Aayush-Goel-04
Copy link
Contributor

@williballenthin what are your thoughts on above comments !

@williballenthin
Copy link
Collaborator Author

williballenthin commented Aug 3, 2023

Or we store a results within executable itself.

I think I prefer this strategy, since I think it would be a burden to expect rule authors to collect the prevalence of their rule as soon as they author it. Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.

I expect that we'll be able to provide a prevalence table derived from VT; however, this data isn't approved for public release yet. Let's assume it will be available for when we merge the final representation of this data and use your example data in the meantime.

For third party rules ... or set default value as 0 or 1 for all third party rules.

I think this makes sense. And, it may encourage people to contribute their rules to the common set so they can see prevalence information.

@williballenthin
Copy link
Collaborator Author

Thank you @Aayush-Goel-04 for taking the time to update the rendering based on the prevalence. I like how it puts the "more interesting" rules towards the top.

I think if we'd want to use this format, we should display the prevalence in a column so that users can see why the ordering is the way it is.

Alternatively, I would like to explore finding a cutoff between "common" and "uncommon" and highlighting the rules that are uncommon (via a different output color and/or perhaps a * next to their name). This way, users don't have to guess about how to interpret the prevalence numbers and can rely on capa's recommendations. It also lets us use the existing output format (which is ordered by namespace, which has nice properties, like grouping of similar things).

@Aayush-Goel-04
Copy link
Contributor

Aayush-Goel-04 commented Aug 5, 2023

Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.

I am aware of one approach, which is to embed the results directly into the executable's resources using either JSON or pickle file formats. However, I'm interested to know if there are any alternative approaches available.

for highlighting rules I have following ideas :

  • first we seperate rules based on probability into three sections (each section will be ordered based on namespace).
  • rare: (0, 0.1), uncommon: (0.1 , 0.3), common: (0.3, 1) . range can be decided later on based on how final data is calculated.
  • in rare ones those with probability as close to 0 can have a * next to their name as u said.
  • We can also represent each section with three different colors for visuals.

Whats are ur thoughts @williballenthin !

@Aayush-Goel-04
Copy link
Contributor

Aayush-Goel-04 commented Aug 5, 2023

@williballenthin , below are the sample screenshots of render
I think below rendering this looks better.
rare refers to prob < 0.05 or if number of matches for a rule is < 30 in all capa-test-files & common refers to prob > 0.05.
After filtering based on probability they are ordered based on namespace and name.

image

image

@mr-tz
Copy link
Collaborator

mr-tz commented Aug 8, 2023

I think this is pretty neat! How would you propose to handle new rules with no prevalence data (yet)? Show them as unknown?

@Aayush-Goel-04
Copy link
Contributor

Their entropy value will be taken as zero and they will be ordered based on namespace and their prevalence will be shown as unknown.

@mr-tz
Copy link
Collaborator

mr-tz commented Aug 8, 2023

Ok, I wonder about these alternatives:

  • show two tables: one for rare and one for common and unknown
  • always sort by namespace but highlight rare rules (or tune down common rules)

@Aayush-Goel-04
Copy link
Contributor

show two tables

Instead of this i think it would be better to seperate them with a line in the table.

@mr-tz
Copy link
Collaborator

mr-tz commented Aug 8, 2023

good idea, that could work well

@Aayush-Goel-04
Copy link
Contributor

good idea, that could work well

common(known entropy) and unknown(no prevalence data) ones can also be seperated but then there would be no sense in sorting based on name and namespace. I propose only two sections , coloring can be discussed.

image

what are ur thoughts @williballenthin @mr-tz

@mr-tz
Copy link
Collaborator

mr-tz commented Aug 14, 2023

I like it! Minor adjustments could be:

  • same color for capabilities
  • different colors for rare, unknown, common

@Aayush-Goel-04
Copy link
Contributor

@mr-tz
rare: blue, common: cyan(default color for capability), unknown: no color. We can decide on color for rare.

image
image
image

in my opinion format in 2nd image looks better.

@mr-tz
Copy link
Collaborator

mr-tz commented Aug 16, 2023

Agreed, one and two look good. Green may suggest "good" (vs. red is "bad") in some context so we may want to stick to other colors.

@Aayush-Goel-04
Copy link
Contributor

@mr-tz , then I think it would be better to stick with current coloring cyan.
Since rare ones are present as seperate section they can have same color as capabilities and for common section we can leave unknown as uncolored and common as colored (cyan)
image

in case no rare present

image

@williballenthin
Copy link
Collaborator Author

i wonder if we should color the rule name the same as the prevalence column. as is, we use color to convey information in one column (prevalence) but in another column (name) it's just for highlighting. i think this is confusing.

alternatively, maybe we could use different colors for names/prevalence, but then we run the risk of introducing too many colors.

@mr-tz
Copy link
Collaborator

mr-tz commented Aug 19, 2023

good points, Willi, I like different colors if we can find a good selection

@mr-tz mr-tz added gsoc Work related to Google Summer of Code project. usability Related to using capa and displaying results (CLI/GUI) labels May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. usability Related to using capa and displaying results (CLI/GUI)
Projects
Status: No status
Development

No branches or pull requests

3 participants