# Data Source Investigation

Let's use the [ATT&CK Python Client](https://github.com/hunters-forge/ATTACK-Python-Client) to manually examine the techniques, list the data sources, and build a heatmap out of our selected sources.

If you're looking for less development or a more in-depth and finely-grained dive, check out:

* [DeTTACK](https://github.com/rabobank-cdc/DeTTECT)
* [AttackDatamap](https://github.com/olafhartong/ATTACKdatamap)

*Consider: What have you used to track data sources? What has worked well, and what has not worked so well?*

In [None]:
# Import the packages we'll need

# Some basic python and jupyter stuff
from collections import defaultdict
import json
from IPython.display import FileLink, FileLinks

# Visualization and data libraries
import altair as alt
import pandas as pd

# ATT&CK Python Client, by @HuntersForge (https://github.com/hunters-forge/ATTACK-Python-Client)
from attackcti import attack_client

# Because this is in Jupyter notebooks we need to enable that renderer for the altair charts to work
alt.renderers.enable('notebook')

## Get the ATT&CK Enterprise techniques using the client library

In [None]:
client = attack_client()
all_techniques = client.get_enterprise()['techniques'] # Note - this takes a few seconds to download and parse

print("Got {} techniques".format(len(all_techniques)))

## Analyze the data sources and build a chart to understand the most valuable ones

We'll build up a dictionary that counts data sources by the number of techniques they can help detect.

In [None]:
# Collect unique data sources from the techniques
techniques_by_source = defaultdict(lambda: {'count': 0, 'techniques': []})

# Loop through all techniques, then through all data sources on that technique
for technique in all_techniques:
    for ds in technique.get('x_mitre_data_sources', []):
        techniques_by_source[ds]['count'] += 1
        # External_ID in the first external reference is the T#### number
        techniques_by_source[ds]['techniques'].append(technique.external_references[0].external_id)

In [None]:
# Create a pandas dataframe out of that result and show count column for the top 15
df = pd.DataFrame.from_dict(techniques_by_source, orient='index', columns=['count', 'techniques']).rename_axis('source')
top_15 = df.sort_values('count', ascending=False)[0:15]
top_15[['count']]

## Show the chart in altair

Altair can be used to easily turn pandas dataframes into visualizations. In this case, we just show a histogram that you can scan.

In [None]:
df.reset_index()

alt.Chart(df.reset_index().sort_values('count', ascending=False)).mark_bar().encode(
    y=alt.Y(
        'source',
        sort=alt.EncodingSortField(
            field="count",
            order="descending"
        )
    ),
    x='count'
)

## Advanced Filtering (BONUS)

How would you alter this chart to only consider some techniques? Maybe (peeking ahead) we have a list of threat actors or techniques we want to prioritize? Can you generate a chart that prioritizes techniques used by APT1 or APT3?

In [None]:
# TODO: Your code to show a similar chart for APT1 and APT3

## Building a Heatmap

But what if you only have certain data already, and don't have flexibility to add different ones? That's the case for our exercise! How do you know what techniques you can detect based on that?

We can generate a heatmap based on the data we created earlier. We can map the data sources we know we have into the data sources here.

In [None]:
# First, list the data sources alphabetically so we can figure out which ones we have

df.sort_index()[['count']]

### List the data sources that are available

In the list below, add the data sources that we have available in BRAWL. As a reminder, we have:
* Sysmon
* Windows event logs (common security , authentication, and audit logs)

In [None]:
# Case sensitive!!!
sources_we_have = [
    '' # e.g. 'Web proxy'
]

### Calculate out the techniques for which we have some detection capability

In [None]:
techniques = defaultdict(lambda: 0)

for ds in sources_we_have:
    for technique in techniques_by_source[ds]['techniques']:
        techniques[technique] += 1

In [None]:
print("You can detect {} out of {} techniques".format(len(techniques), len(all_techniques)))

### Generate the heatmap

In [None]:
# Generate that heatmap!

def technique_score(t):
    if t not in techniques:
        return 0.0
    elif techniques[t] > 1:
        return 1.0
    else: # count of sources == 1
        return 0.5

heatmap = {
    'version': "2.1",
    'name': 'Detection Possibilities',
    'domain': "mitre-enterprise",
    'showTacticRowBackground': True,
    'gradient': {
        'colors': [
            '#ffffff',
            '#66b1ff'
        ],
        'minValue': 0.0,
        'maxValue': 1.0
    },
    'techniques': [{'techniqueID': t, 'score': technique_score(t)} for t in techniques]
}

In [None]:
# Write as a JSON file and show a download link
with open('data_sources.json', 'w') as f:
    f.write(json.dumps(heatmap))
    
FileLink('data_sources.json')

# Overlaying Priorities with Data Sources

The reason we collect data is of course to help us detect attacks, so let's see how the data that we've collected measures up.

How would you do this?

How would you show gaps in data source coverage?