In [1]:
import json
import os

Our sandbox dataset contains traces from  **Tencent HABO** (Habo) and **VirusTotal Cuckoofork** (Cuckoo) sandboxes. In our leaderboard and paper, HABO is referred to as *Sandbox-1* and Cuckoo is referred to as *Sandbox-2*.

The `sandbox_dataset` directory contains two subdirectories `Train` (included both in the **Train Data** and **TrainAndTest Data**) and `Test` (included only in the **TrainAndTest Data**)

We split the samples in our dataset based on their `first_seen` timestamps and the traces under the `Train` subdirectory are from the samples seen before our cut-off date.

Lets first read the trace associated with the sample *00a2c6bab1e53f679cdd4fdc772cd291928c109b9b747652639a1700d844f719* from our **Train Data** (`MalwareITW_TrainData`). This repository includes a small set of traces to demonstrate the trace format and the structure of our sandbox dataset.

In [2]:
dataset_path = os.path.join('MalwareITW_TrainData_Sample', 'sandbox_dataset', 'Train', 'HABO')
sample_hash = '93d2014facc5b1f68b04213605e517e28070f6c3e999c73c4e1b52e45a3739ff'
with open(os.path.join(dataset_path, f'{sample_hash}.json'), 'r') as fp:
    trace = json.load(fp)

Each trace file is saved as a JSON dictionary that contains the following keys:

In [3]:
print(trace.keys())

dict_keys(['regs_created', 'regs_deleted', 'mutexes_created', 'processes_created', 'files_created', 'processes_injected'])


**'regs_created'** and **'regs_deleted'** correspond to the Windows registry key creation and deletion actions.

**'mutexes_created'**, **'procecess_created'** and **'files_created'** keys contain the mutexes, files and processes created by the sample.

**'processes_injected'** contains the injected processes, this action type has vendor specific definitions so the way it is defined might not be uniform across different sandboxes.

Each of these keys map to a list of strings, each containing an individual action performed by the sample. For example, let's see the registry keys created by this sample:

In [4]:
print(trace['regs_created'])

['registry\\machine\\software\\classes\\clsid\\<winguid>\\uets', 'registry\\machine\\software\\microsoft\\windows\\currentversion\\run\\trickler', 'registry\\machine\\software\\qwertyuio\\trickler\\apppath', 'registry\\machine\\software\\qwertyuio\\trickler\\oldtrickler']


If the sample has not performed any of these action types, the corresponding key will have an empty list as value. For example:

In [5]:
print(trace['processes_injected'])

[]


This summarizes the standardized trace format that your feature processing routine should accept as input. 