Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with number_observed #67

Open
isstabb opened this issue Sep 23, 2020 · 2 comments
Open

Performance issue with number_observed #67

isstabb opened this issue Sep 23, 2020 · 2 comments

Comments

@isstabb
Copy link

isstabb commented Sep 23, 2020

The number_observed attribute of observed-data incurs a linear cost to the matcher which seems to be related to the way it is used internally to multiply the SDO. Below are some profiles of the same test but using 1000 vs 10000 number_observed. The profiles look a bit different just because the profiler includes more in the call graph due to longer execution time.

The examples below use 1000 & 10000 just to be illustrative with a single SDO (and it makes it easier to capture the relevant bits in the profile). I realize that is extreme, but smaller values of number_observed in a larger SDO list could also add up.

1 SDO with 1000 number_observed * 50 patterns
image

1 SDO with 10000 number_observed * 50 patterns
image

The SDO list looks like

[
    {
        "id": "observed-data--107c9a2d-12e9-4599-8a0c-2021a88b472d",
        "type": "observed-data",
        "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae3ee",
        "last_observed": "2020-08-25T20:01:28.567Z",
        "first_observed": "2020-08-25T20:01:28.567Z",
        "number_observed": 10000,
        "created": "2020-08-26T13:23:57.728Z",
        "modified": "2020-08-26T13:23:57.728Z",
        "objects": {
            "0": {
                "type": "windows-registry-key",
                "key": "HKLM\\SYSTEM\\CurrentControlSet\\Control\\MiniNt",
            },
            "1": {
                "type": "process",
                "name": "powershell.exe",
                "pid": 8816,
                "x_ecs_entity_id": "{747f3d96-6e04-5f45-9d00-000000003800}",
                "binary_ref": "3",
                "x_ecs_event_ref": "6",
            },
            "2": {"type": "process", "child_refs": ["1"]},
            "3": {
                "type": "file",
                "name": "powershell.exe",
                "parent_directory_ref": "4",
            },
            "4": {
                "type": "directory",
                "path": "C:\\Windows\\System32\\WindowsPowerShell\\v1.0",
            },
            "5": {
                "type": "x-ecs-host",
                "hostname": "MSEDGEWIN10",
                "os_name": "Windows 10 Enterprise Evaluation",
                "os_version": "10.0",
                "os_platform": "windows",
                "ip": ["fe80::c50d:519f:96a4:e108", "10.0.2.15"],
                "name": "MSEDGEWIN10",
                "id": "747f3d96-68a7-43f1-8cbe-e8d6dadd0358",
                "mac": ["08:00:27:e6:e5:59"],
                "architecture": "x86_64",
            },
            "6": {
                "type": "x-event",
                "code": 12,
                "provider": "Microsoft-Windows-Sysmon",
                "created": "2020-08-25T20:01:28.591Z",
                "kind": "event",
                "module": "sysmon",
                "action": "CreateKey",
            },
        },
    }
]

Where number_observed is changed between the two tests above.

@clslgrnc
Copy link
Contributor

clslgrnc commented Oct 6, 2020

One way to mitigate this issue would be to extend the work done in #64 for observation expressions to comparison expression.
All exitComparisonExpression* should be modified to work with generators, so that obs_ids in the following is a generator:

def exitObservationExpressionSimple(self, ctx):
"""
Consumes a the results of the inner comparison expression. See
exitComparisonExpression().
Produces: a generator of 1-tuples of the IDs. At this stage, the root
Cyber Observable object IDs are no longer needed, and are dropped.
This is a preparatory transformative step, so that higher-level
processing has consistent structures to work with (always generator of
tuples).
"""
debug_label = u"exitObservationExpression (simple)"
obs_ids = self.__pop(debug_label)
obs_id_tuples = ((obs_id,) for obs_id in obs_ids.keys())
self.__push(obs_id_tuples, debug_label)

It might also be better to evaluate all exitComparisonExpression* without expanding the number_observed and only duplicate the observed-data in exitObservationExpressionSimple (if it makes sense).

I'll probably won't have time to do it in the foreseeable future.

@dennispo
Copy link
Contributor

dennispo commented Mar 14, 2021

When working with real low-level data (like data from EDRs or Sysmon) we are experiencing a huge performance degradation. As an example, there are 15754 observables generated out of 100 original observables. In another example, there are 189023 instances generated out of 300 original observables.

When comparing the two above example with a version without instances duplication, the timing is as follows:

Improvement Time measured for 100 observed_data % of improvement Time measured for 300 observed_data % of improvement
basic 0:04:59.641   1:00:40.532  
events deduplication 0:00:06.043 97.98% 0:00:15.945 99.56%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants