Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: instruction scope #930

Merged
merged 39 commits into from
Apr 4, 2022
Merged

feature: instruction scope #930

merged 39 commits into from
Apr 4, 2022

Conversation

williballenthin
Copy link
Collaborator

@williballenthin williballenthin commented Mar 28, 2022

closes #767
closes #931

This PR introduces support for a new matching scope "instruction" and two features for matching operand values (immediate constants and memory offsets). The instruction scope is intended to enable matching of mnemonic + operand value combinations, such as cmp ???, 0x11223344. This should enable more precise rules to replace existing logic like:

- basic block:
  - and:
    - mnemonic: cmp
    - number: 0x5E = '^' (Track 1 separator)

which may become:

- instruction:
  - mnemonic: cmp
  - operand[1].number: 0x5E = '^' (Track 1 separator)

You can use the instruction scope in the rule.meta.scope field or as a subscope (via block instruction:) within another rule (as above). When used as a subscope, a top level and: is implied, so the following are equivalent:

- instruction:
  - mnemonic: cmp
  - operand[1].number: 0x5E = '^' (Track 1 separator)

is equivalent to the more verbose form:

- instruction:
  - and:
    - mnemonic: cmp
    - operand[1].number: 0x5E = '^' (Track 1 separator)

And, of course, you can have complex logic in the instruction scope:

- instruction:
  - or:
    - and:
      - description: obvious way to load a constant via immediate
      - mnemonic: mov
      - operand[1].number: 0x5E = '^' (Track 1 separator)
    - and:
      - description: space-optimized form to move a constant relative to a zero register
      - mnemonic: lea
      - operand[1].offset: 0x5E = '^' (Track 1 separator)

Within the instruction scope, you can reference all the existing features that were already extracted per instruction, like API, number, offset, string, bytes, and many characteristics (like "cross section flow"). You can also use two new features: operand number and operand offset features. Both of these are specified with operand indices, like operand[0].number, which lets you match source and/or destination operations.

operand[{0, 1, 2}].number matches operands that are immediate constants, like 0x123 in the instruction mov eax, 0x123. Like the existing number feature, valid addresses are filtered out.

operand[{0, 1, 2}].offset matches the offset portion of memory reference operands, like 0x10 in the instruction mov eax, [ebx+0x10]. Like the existing offset feature, suspected stack variable references are filtered out.

Register, displacement, and computed address features are not supported at this time, since I haven't imagined any common use cases yet.

This is a breaking change because old versions of capa will not understand the instruction scope.

performance impact

When considering mimikatz.exe, the vivisect backend extracts 322,647 total features before this change, and with this change, 49,188 (+15%) operand features: 16,661 numbers and 32,527 offsets.

Somehow, the total runtime against mimikatz doesn't change much:
image

Once we convert some rules over to using the instruction scope, we should re-evaluate this.

Maybe 4% slower during feature extraction?
image

In fact, there are fewer evaluations during matching with these changes, probably due to the fix for #931 (don't use global features for optimizer down selection) which perhaps makes up for the additional features generated:

label count(evaluations) min(time) avg(time) max(time)
fe5d885 base 16,743,068 26.13s 26.16s 26.19s
76831e9 with insn scope 14,846,388 26.20s 26.24s 26.27s

TODO

  • add new scope name
    • add rule format tests
  • extract inline insn blocks into subscope rules
  • index insn rules by mnem
  • extract features
    • viv
    • ida
    • smda
  • match features
  • performance analysis
  • changelog
  • documentation
  • update rules

@williballenthin williballenthin added the enhancement New feature or request label Mar 28, 2022
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

scripts/lint.py Outdated Show resolved Hide resolved
capa/rules.py Outdated Show resolved Hide resolved
@github-actions github-actions bot dismissed their stale review March 29, 2022 18:34

CHANGELOG updated or no update needed, thanks! 😄

@williballenthin
Copy link
Collaborator Author

williballenthin commented Mar 30, 2022

image

image

sidebar: it would be nice to know the stack delta at each address so that we can filter out those offset accesses. i think this is easily doable with IDA. viv might need some work.

@mike-hunhoff
Copy link
Collaborator

Really cool! This is definitely something we should eventually merge and a great preview for related call scope syntax.

I think I'd like to see immediate replaced with number to keep in sync with our current feature set. Basically, I'm thinking at a high level which existing features applicable at the instruction and call scopes benefit from an operand position?

For instruction scope I'm thinking:

- operand[{0,1,n}].number: ...
- operand[{0,1,n}].offset: ...

For call scope I'm thinking:

- operand[{0,1,n}].number: ...
- operand[{0,1,n}].offset: ...
- operand[{0,1,n}].string: ...
- operand[{0,1,n}].substring: ...
- operand[{0,1,n}].bytes: ...

which enables rules like

- call:
  - api: WinExec
  - operand[0].string: "ipconfig.exe /all"

Also, where do the /x32 and /x64 decorators fit? Can I do something like this?

- operand[1].number/x32: 0x100

My preference would be to eliminate the /x32 and /x64 decorators in favor of supporting os and arch features at these smaller scopes:

- instruction:
  - arch: i386
  - mnemonic: mov
  - operand[1].offset: 0x10

versus

- instruction:
  - mnemonic: mov
  - operand[1].offset/x32: 0x10

@williballenthin
Copy link
Collaborator Author

williballenthin commented Mar 31, 2022

I think I'd like to see immediate replaced with number to keep in sync with our current feature set.

this makes sense to me. I used immediate to be more consistent with the underlying terminology, though i agree that using number fits better with what capa is already doing. we'll just have to be clear in the docs how this maps to immediates.

@williballenthin
Copy link
Collaborator Author

For call scope I'm thinking:

- operand[{0,1,n}].number: ...
- operand[{0,1,n}].offset: ...
- operand[{0,1,n}].string: ...
- operand[{0,1,n}].substring: ...
- operand[{0,1,n}].bytes: ...

I think this is pretty reasonable. Initially I had imagined trying to infer the type based on the value (e.g. 0x10 is number and "foo" is a string, and /bar/ is a regex, ...), so we could have a single operand[0]: ... to minimize syntax. But, as you point out here, there are potentially many options, so differentiating the types probably does make sense.

I don't know that offset makes sense as an argument type, and substring can maybe somehow be merged with string maybe via regex? I'm not really sure, but lets continue this discussion in a different thread.

@williballenthin
Copy link
Collaborator Author

Also, where do the /x32 and /x64 decorators fit? Can I do something like this?

- operand[1].number/x32: 0x100

These are not supported for operands right now. They're just extracted as number/offset features right now. I expect and: arch: i386 to be a better fit, as you propose.

My preference would be to eliminate the /x32 and /x64 decorators in favor of supporting os and arch features at >... smaller scopes

I think this might be a good idea now that we have arch features (we didn't when /x32 was introduced). It'll also simplify our feature code, as we don't have to thread bitness everywhere. Here are the rules where we currently use these features:

rg "(offset|number)/x" -l
linking/runtime-linking/access-peb-ldr_data.yml
linking/runtime-linking/get-ntdll-base-address.yml
linking/runtime-linking/get-kernel32-base-address.yml
nursery/log-keystrokes-via-raw-input-data.yml
communication/socket/tcp/send/obtain-transmitpackets-callback-function-via-wsaioctl.yml
host-interaction/hardware/cpu/get-number-of-processors.yml
host-interaction/process/create/create-a-process-with-modified-io-handles-and-window.yml
host-interaction/process/get-process-heap-force-flags.yml
host-interaction/process/get-process-heap-flags.yml
lib/peb-access.yml
anti-analysis/anti-forensic/patch-process-command-line.yml
anti-analysis/anti-debugging/debugger-detection/check-for-peb-ntglobalflag-flag.yml
load-code/pe/enumerate-pe-sections.yml
load-code/pe/rebuild-import-table.yml
load-code/pe/parse-pe-header.yml

It'll also reduce the number of features we extract and match against. Created issue #932 to track this proposal.

@williballenthin williballenthin added the breaking-change introduces a breaking change that should be released in a major version label Mar 31, 2022
@williballenthin williballenthin added this to the 4.0.0 milestone Mar 31, 2022
@williballenthin
Copy link
Collaborator Author

@mike-hunhoff I've renamed the operand feature from "immediate" to "number" as you suggested.

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once test pass, I'm in favor of merging this!

@williballenthin
Copy link
Collaborator Author

I've xfail'd the SMDA tests in recognition of #937

@williballenthin
Copy link
Collaborator Author

williballenthin commented Apr 4, 2022

IDA extractor implemented and tested:

image

image

@williballenthin williballenthin marked this pull request as ready for review April 4, 2022 21:20
@mr-tz
Copy link
Collaborator

mr-tz commented Apr 4, 2022

awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change introduces a breaking change that should be released in a major version enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

probably don't use global features (arch/os/format) to optimize rule matching New feature: Instruction
3 participants