In [2]:
from bokeh.io import output_notebook
output_notebook()

# Introduction

The purpose of this notebook is to document research and analysis done on the Common Attack Pattern Enumeration and Classification (CAPEC) for the ultimate goal of creating a corpus for PERCEIVE.  

CAPEC displays its data in two formats: the [CAPEC Website](http://capec.mitre.org/) and the CAPEC XML file.  The CAPEC 2.9 XML file used and its accompanying XML Schema Documentation are both available for download on the website [CAPEC website](http://capec.mitre.org/data/index.html) under "Release Downloads." 
The representation of the data on the [website](http://capec.mitre.org/data/definitions/1000.html) is easier to navigate and easier to make sense of than the XML file.  The website's interface for the Views allows you to easily explore through the developed hierarchical relationships due to the (+) and (-) buttons that allow you to expand the relationships. For this reason, we use the website to gather general information, but rely on the XML file, which contains the bulk of the infomation in a convenient machine-readable form to gather the important information.

After initial examination of the file, we found that the XML contained four root nodes: Views, Categories, Attack Patterns, and Environments.  Each of these root nodes contained subnodes, which we refer to as individual **Entries**.  These Entries have identification numbers and contain numerous subnodes of their own, which we call **Fields**.  These Fields contain the organized information regarding the Entry and are the main focus of our investigation.  

The following image attempts to provide an overview of the XML representation of CAPEC.  It visualizes the Root Nodes and the observed Fields used within the entries of those Nodes.  

![](capec_simplified_xml_schema.png)

Note that Environments is linked to the Attack Execution Flow Field used by Attack Pattern Entries.  This is because this Field uses information noted in the Environments Node.  The information in Environments does not appear to be used for anything else.  [This should be investigated further]  After further analysis of the XML file, we found that the four root nodes had a noticeable hierarchical relationship, which is visualized in diagram linked below.

## Motivation

As mentioned previously, the CAPEC Website is significantly easier to navigate than the XML file.  However, the website does not document the hierarchical rules explicitly.  As a result, we observed the hierarchical relationships and created the following diagram to provide such documentation to aid us in solidifying our understanding of those rules.
![](types_of_node_and_their_relationships_based_on_mechanisms_and_domains_of_attack_views.png)

CAPEC entries may have relationships among themselves based on Views, which comprise the highest hierarchical level, as well as relationships to other entries in other levels.

The two Views are Mechanisms of Attack Domains of Attack.  Category Entries have  *MemberOf*  relationships to and are separated based on these views depending on whether they pertain to mechanisms employed in exploiting a vulnerability or the domains on which the attacks are perpetrated.

Below Category Entries are the Attack Pattern Entries;   It is important to note that there are three types of Attack Pattern Entries: Meta, Standard, and Detailed.  These three terms refer to the level of abstraction in the particular Attack Pattern Entries.  

Meta Attack Pattern Entries are directly below Category Entries in the hierarchy and have the *MemberOf* relationship to these Categories.  As the Categories are ways of sorting Attack Patterns, a given Meta Attack Pattern Entry will be a *MemberOf* two categories, one for each View.  Meta Attack Pattern Entries have *Child* nodes that can be either Standard or Detailed Attack Patterns.  These two abstraction types of Attack Patterns do not have a relationship to the Categories.  Standard Attack Pattern Entries may also have their own Child, which will always be a Detailed Attack Pattern.

Given that Views and Categories are primarily methods of organizing Attack Patterns, we are specifically interested in the Attack Patterns and the Fields that they contain.  To prepare for extracting information from the text within the Attack Pattern Fields, we must first determine which Fields appear the most, if the most frequent Fields even contain the most important/relevant pieces of information, and the means by which to extract the information from the XML needed to create a corpus.

# Parsing the XML File

As noted in the introduction, we must determine which fields are the most frequently used among Attack Pattern Entries.  The following Python script uses a list of Fields used by the XML which was created through examining the XML's schema documentation and counts the Fields mentioned to return their frequencies in a dictionary.

We encountered a special case within the XML representation where the fields **Summary** and **Attack Execution Flow** were under a container **Description** and would not be counted, despite appearing as unique fields on the HTML representation.  Although there was a **Summary** field in every **Description**, this was not the case for **Attack Execution Flow** and provided an inaccurate representation of the data.  As such, the script takes the **Description** tag as a special case and extracts its children instead.  
![](attack_flow_strange_column.png)

This appears to be a singular case, but in case MITRE repeats this format in the future, keep in mind that these special cases will have to be manually added to the script.

In [10]:
def extract_label(node):
    if node.tag == 'Description':
        for label in node:
            tag = label.tag
            if tag in frequencies:
                frequencies[tag] +=1
            else:
                frequencies[tag] = 1
    else:
        if node.tag in frequencies:
            frequencies[node.tag] += 1 
        else:
            frequencies[node.tag] = 1


##### Single Cooccurence logic - test
#for _ in root[2]:
 #   for column in _:
  #          if column.tag  == 'Description':
    #            for column in _:
   #                 if column.tag == 'References':
     #                   pass
      #              else:
       #                 pass
#####

import lxml.etree
tree = lxml.etree.parse('capec2.9.xml')
root = tree.getroot()

##Remove namespaces from XML
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]

#Count fields
frequencies = {}
for _ in root[2]:
    for column in _:
        extract_label(column) 
        


## Plotting the Frequencies

To better visualize the counts returned by parsing the XML file, the following script uses the data stored in the dictionary created previously to plot a histogram.

## Histogram of Field Frequencies

In [None]:
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import Range1d

data = {}
data['Entries'] = frequencies

type(data)

df_data = pd.DataFrame(data).sort_values(by='Entries', ascending='true')
series = df_data.loc[:,'Entries']

p = figure(width=800, y_range=series.index.tolist(), title="Attack Pattern Histogram")

p.xaxis.axis_label = 'Frequency'
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '8pt'

p.yaxis.axis_label = 'Field'
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '8pt'

j = 1
for k,v in series.iteritems():
  
  #Print fields, values, orders
  #print (k,v,j) 
  p.rect(x=v/2, y=j, width=abs(v), height=0.4,
    width_units="data", height_units="data")
  j += 1

show(p)


# Examination of Frequent Fields

Now that we know the frequencies of fields in the Attack Pattern table, our next step for extracting the information from the fields is to determine each field's structure and subsequently, our intended method of text extraction.  For this purpose, we set the line of demarkation at 50 instances and investigated the layout of fields that occurred at least 50 times.  There were 24 fields that fit this criterium and 7 fields that did not.  The 7 fields that occur less than 50 times will likely have to be included in our scope of inquiry at a later time since rarity could be an indicator of greater importance rather than lesser, but for now, we have targeted the fields that are voluminous.  Our findings are reported in the table below.  Type names are subject to change and some minimal differences that have little effect on varying extraction methods have been omitted for initial grouping purposes.


|Type | Description |Fields|Example
|:----:|:----------:|:----:|:--------:|
|General Description |Contains one to a few sentences|Injection Vector, Payload, Payload Activation Impact, Examples-Instances, Probing Techniques |Ability to communicate synchronously or asynchronously with server. Optionally, ability to capture output directly through synchronous communication or other method such as FTP. |
|Table |Contains rows and columns.  Columns are qualities, rows are individual items  |Content History, Attack Motivation Consequence, CIA Impact, Content History, Related Attack Pattern, Related Weaknesses, Technical Context|![](TechnicalContext.png) |
|Single Word or Two Word Descriptor|Low, Medium, High, or Very High | Typical Severity| Medium
|Labeled Descriptor|Phrase label followed by a level descriptor|Attack Skills or Knowledge Required|Skill or Knowledge Level: Low|
|Labeled Descriptor with Potential Explanation|Single word label sometimes followed by a explanation (27 instances of the explanation tag in the XML)|Typical Likelihood of Exploit|Likelihood: Low <br> The nature of these type of attacks involve a coordinated effort between well-funded multiple attackers, and sometimes require physical access to successfully complete an attack. As a result these types of attacks are not launched on a large scale against any potential victim, but are typically highly targeted against victims who are often targeted and may have rather sophisticated cyber defenses already in place.| 
|Unordered List|List using bullets|Attack Pre-requisites, Methods of Attack, Purposes, Related Security Principles|<ul style="list-style: none"><li>• Injection</li><li>• Protocol Manipulation</li></ul>|
|Citation| Citation format | References|[R.13.2] [REF-3] "Common Weakness Enumeration (CWE)". CWE-20 - Input Validation. Draft. The MITRE Corporation. 2007. <http://cwe.mitre.org/data/definitions/20.html>.|
|Unbulleted List with Qualified Entries| List with no bullets, frequently has entries that start with a type. | Solutions and Mitigations |To mitigate this type of an attack, an organization can monitor incoming packets and look for patterns in the TCP traffic to determine if the network is under an attack. The potential target may implement a rate limit on TCP SYN messages which would provide limited capabilities while under attack. <br> **OR** <br> Design: Limit program privileges, so if metacharacters or other methods circumvent program input validation routines and shell access is attained then it is not running under a privileged account. chroot jails create a sandbox for the application to execute in, making it more difficult for an attacker to elevate privilege even in the case that a compromise has occurred.<br>Implementation: Implement an audit log that is written to a separate host, in the event of a compromise the audit log may be able to provide evidence and details of the compromise.|
|Numbered List and Tables |Numbers and contains a table for each _Attack Step_| Attack Execution Flow|![](AttackExecutionFlow.png) |
|Unbulleted List with Single Table |List items are qualifiers.  Last tag contains a table | Target Attack Surface | ![](TargetAttackSurface.png) |


# Investigating the Content Matter of Fields

|Field| Description|CAPEC Example|
|:----:|:---------:|:-----------------:|
|Attack-Motivation Consequence| The specific desired technical results that the attacker is hoping to achieve, which could be leveraged to achieve their end objective|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Content History| Identifies the contributor and contributor's comments.  Provides a means of contacting the authors and modifiers for clarification, merging contributions, etc.|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Summary|Provides a summary description of the attack that includes the attack target and sequence of steps|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Related Attack Patterns| Contains attack patterns that are dependent on or applied in conjunction with the current attack pattern|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Typical Severity| Reflects the typical severity of an attack on a scale.  Used to capture an overall typical average value for the type of attack, understanding that it will not be completely accurate for all attacks. |[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Attack Prerequisites| Describes the conditions that must exist or functionality and characteristics that the target software must have, or behavior it must exhibit for the type of attack to succeed|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|References|Contains one or more references, each of which represents a documentary resource used to develop the definition of the attack pattern.  These can provide further reading and insight into the attack pattern|[CAPEC-334](https://capec.mitre.org/data/definitions/334.html)|
|Resources Required| Describes the resources (CPU cycles, IP addresses, tools, etc.) needed by an attacker to effectively execute this attack type|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Solutions and Mitigations|Describes actions or approaches to prevent or mitigate the risk of the attack by improving resilience of the target, reducing the attack surface, or reducing the impact of a successful attack|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Related Weaknesses| Software weaknesses potentially targeted for exploit by the attack pattern.  Specific weaknesses reference CWE.|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Attacker Skills or Knowledge Required| Level of skills or specific knowledge required by an attacker to execute the attack type|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Injection Vector|The mechanism and format of an input-driven attack of the pattern's type.  Considers the attack's grammar, the system's accepted syntax, position of fields, and acceptable ranges of data|[CAPEC-10](https://capec.mitre.org/data/definitions/10.html)|
|Payload|Describes code, configuration, or other data to be executed or activated as part of this type of injection-based attack.|[CAPEC-10](https://capec.mitre.org/data/definitions/10.html)|
|Typical Likelihood of Exploit| Estimated likelihood of at successful attack, sometimes accompanied by an explanation of the estimate.|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)/[CAPEC-101](https://capec.mitre.org/data/definitions/101.html)|
|Payload Activation Impact|Describes the impact that the activation of the attack payload for an injection-based attack of this type would typically have on confidentiality, integrity, or availability of the target software |[CAPEC-10](https://capec.mitre.org/data/definitions/10.html)|
|Examples-Instances| An example instance details an explanatory example or demonstrative exploit instance of the attack.  Used to help the reader understand the nature, context and variabiltiy of the attack in practical/concrete terms|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Technical Context| The technical context (architectural paradigms, frameworks, platforms, and languages) for which the pattern is applicable|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Methods of Attack|The defined vectors identifying the mechanisms used in the attack.  Can help define applicable attack surface for the attack|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Purposes|Intended purpose behind the attack pattern relative to a list of attack objectives.  Used to capture pattern composability and assist with normalization and classification in the catalog|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|CIA Impact| Typical relative impact of the pattern on Confidentiality, Integrity, and Availability of the targeted software|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Attack Execution Flow|Comprised of Attack Phases.  Phases segment the attack steps: "Explore," "Experiment," and "Exploit."|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Related Security Principles| Security rules or practices that impede the attack pattern.  Defined as rules and standards for good behavior |[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Probing Techniques|Describes methods used to probe and reconnoiter potential vulnerabilities and/or prepare for attack|[CAPEC-1](https://capec.mitre.org/data/definitions/1.html)|
|Target Attack Surface| The locations where the attacker interacts with the target system|[CAPEC-285](https://capec.mitre.org/data/definitions/285.html)|

# Preliminary Impression for Field Groupings

**Undecided - documentation/context-related?** <br>
Content History <br>
References <br>
Examples-Instances<br>
Summary<br>
Related Weaknesses<br>
Related Attack Patterns<br>

**Intent** <br>
Attack Motivation-Consequences<br>
Purposes<br>

**Requirements or Preparatory Steps** <br>
Resources Required<br>
Attacker Skills or Knowledge Required<br>
Probing Techniques (?)<br>
Typical Likelihood of Exploit(?) <br>

**Attack Execution Mechanisms and Location** <br>
Attack Execution Flow<br>
Payload<br>
Methods of Attack <br>
Injection Vector<br>
Target Attack Surface<br>
Technical Context<br>

**Impact of Successful Attack** <br>
Payload Activation Impact <br>
Typical Severity<br>
CIA Impact<br>

**Prevention and Mitigation** <br>
Solutions and Mitigations<br>
Related Security Principles<br>



In [14]:
#Counts co-occurence between field pairs
search = {}

for keys in frequencies:
    search[keys] = {}
    for key in frequencies: 
        search[keys][key] = 0 

for fields in search:
    for test in search[fields]:
        for _ in root[2]:
            for column in _:
                if column.tag == fields:
                    for column in _:
                        if column.tag == test:
                            search[fields][test] +=1
