In [2]:
import os
import pickle

**hash_metadata** file includes the metadata for each hash (sample) collected for our datasets. Let's load it from the pickle file.

In [3]:
with open(os.path.join('metadata', 'hash_metadata.pkl'), 'rb') as fp:
    hash_metadata = pickle.load(fp)

Let's see the metadata for the sample with hash: *8292d65c4e38d3ba09cd6672b8646489728152a4f9c90152a03557661455665b*

In [4]:
hash_metadata['8292d65c4e38d3ba09cd6672b8646489728152a4f9c90152a03557661455665b']

{'first_seen': 1519738688,
 'publisher': 'NO SIGNATURE',
 'vhash': '015076657d1d7d7d7557z11z93z1hz8fz',
 'tlsh': 'none',
 'ssdeep': '3072:VD2fW6PW+2pTdPRRKYs87NBLJANI1vHJPRySANOQy/:Vd6PW+iT5RRBttHJKNO7',
 'dataset_name': ['ep', 'ember'],
 'new': {'fam': 'emotet',
  'label': 1,
  'scan_date': 1598111382,
  'num_detections': 58,
  'label_source': 'latest'},
 'old': {'label': 1,
  'scan_date': 1519738688,
  'num_detections': -1,
  'label_source': 'ember',
  'fam': 'emotet'},
 'file_names': ['myfile.exe',
  '6IAK4ESCR.EXE',
  '8292d65c4e38d3ba09cd6672b8646489728152a4f9c90152a03557661455665b-75482.bin',
  '1407.exe',
  '09256.exe']}

* **'first_seen'**: This is the epoch timestamp that corresponds to the first seen date of the sample on VirusTotal
* **'publisher'**: This is the publisher information for the sample we extract from the VT report. 
* **'vhash'**: This is the vhash hash of the binary file, included in its VT report.
 * **'tlsh'**: This is the tlsh hash of the binary file, included in its VT report.
 * **'ssdeep'**: This is the ssdeep hash of the binary file, included in its VT report.
* **'dataset_name'**: This is the list of dataset where this sample was seen, since we merge multiple datasets (SOREL, EMBER and VirusTotal), the same sample can be seen in multiple datasets. The dataset name *'ep'* corresponds to the samples seen in our endpoint dataset.
* **'file_name'**: This is the list of file names this sample had in its submissions to VT

There are also two sets of labels in our metadata from two different timestamps.

* **'old'** contains the label information from an older VirusTotal report (if avaiable), and **'new'**  contains information from a newer report.

Each label information contains:

* **'family'**: This is the malware family tag of the sample (it will be simply 'BENIGN' it the sample is benign)
* **'label'**: This is the ground truth label we assigned to this sample (if number of detections is over 5, the sample is labeled as malware).
* **'scan_date'**: Timestamp of the detection report we use to label the sample.
* **'num_detection'**: This is the number of AV engines on VirusTotal that detected this sample as malware, if it is -1, our labeling source was not VirusTotal.
* **'label_source'**: This indicates where our **'old'** label came from. For example, *latest_copied* means that we didn't have an older detection report for this sample.

Now let's find samples in our endpoint dataset.

In [5]:
endpoint_hashes = [h for h, v in hash_metadata.items() if 'ep' in v['dataset_name']]
print(len(endpoint_hashes))

19213


If you would like to collect the binary labels and families for all our endpoint hashes, you can do as follows:

In [6]:
new_labels = []
old_labels = []
new_families = []
old_families = []
families = []
publishers = []

for h in endpoint_hashes:
    metadata = hash_metadata[h]
    new_labels.append(metadata['new']['label'])
    old_labels.append(metadata['old']['label'])
    new_families.append(metadata['new']['fam'])
    old_families.append(metadata['old']['fam'])
    publishers.append(metadata['publisher'])

You can obtain malware family priors and benign publisher priors in the endpoint samples as follows (Label=1 is malware and Label=2 indicates PUP, assigned based on AVClass)

In [7]:
from collections import Counter
import numpy as np

print('Label Counts in our Endpoint Samples Based on Old Labels:')
print(Counter(old_labels))

print('Label Counts in our Endpoint Samples Based on New Labels:')
print(Counter(new_labels))

print('Most common malware families in our Endpoint Samples Based on Old Labels:')
print(Counter(np.asarray(old_families)[np.where(np.asarray(old_labels) == 1)]).most_common(10))

print('Most common malware families in our Endpoint Samples Based on New Labels:')
print(Counter(np.asarray(new_families)[np.where(np.asarray(new_labels) == 1)]).most_common(10))

print('Most common publishers in our Endpoint Samples:')
print(Counter(publishers).most_common(10))

Label Counts in our Endpoint Samples Based on Old Labels:
Counter({0: 16541, 2: 2159, 1: 513})
Label Counts in our Endpoint Samples Based on New Labels:
Counter({0: 16520, 2: 2165, 1: 528})
Most common malware families in our Endpoint Samples Based on Old Labels:
[('GENERIC_MAL', 201), ('wannacry', 39), ('loadmoney', 38), ('emotet', 27), ('chindo', 9), ('vools', 9), ('cnopa', 8), ('high', 7), ('installcore', 6), ('nymeria', 5)]
Most common malware families in our Endpoint Samples Based on New Labels:
[('GENERIC_MAL', 215), ('loadmoney', 40), ('wannacry', 39), ('emotet', 27), ('chindo', 11), ('vools', 9), ('cnopa', 8), ('gandcrab', 7), ('installcore', 6), ('agentb', 5)]
Most common publishers in our Endpoint Samples:
[('NO SIGNATURE', 6634), ('Tencent Technology(Shenzhen) Company Limited', 390), ('Qihoo 360 Software (Beijing) Company Limited', 333), ('Sogou.com', 309), ('IObit Information Technology', 164), ('LANDesk Software, Inc.', 161), ('Beijing Kingsoft Security software Co.,Ltd', 