In [1]:
from bokeh.io import output_notebook
output_notebook()

# Introduction

The purpose of this notebook is to document research and analysis done on the Common Attack Pattern Enumeration and Classification (CAPEC) for the ultimate goal of creating a corpus for PERCEIVE.  The primary sources of information used were the CAPEC 2.9 XML file and accompanying XML Schema Documentation, both available for download on the CAPEC website.  

After initial examination of the file, we found that the XML contained four root nodes: Views, Categories, Attack Patterns, and Environments.  Each of these root nodes contained subnodes, which we refer to as individual "Entries."  These Entries have identification numbers and contain numerous subnodes of their own, which we call "Fields."  These Fields contain the organized information regarding the Entry and are the main focus of our investigation.  The following link provides a visualization of the Root Nodes and the observed Fields used within the entries of those Nodes.  

https://trello-attachments.s3.amazonaws.com/57c66d3ae2c67f66f174d542/5879da1c8f216b9b485f6bf3/4c1b3e2af49e46da07c23ac901a25b65/capec_simplified_xml_schema_v2.png

Note that Environments is linked to the "Attack Execution Flow" Field used by Attack Pattern Entries.  This is because this Field uses information noted in the Environments Node.  The information in Environments does not appear to be used for anything else.  [This should be investigated further]  After further analysis of the XML file, we found that the four root nodes had a noticeable hierarchical relationship, which is visualized in diagram linked below.

https://trello-attachments.s3.amazonaws.com/57c66d3ae2c67f66f174d542/5879da1c8f216b9b485f6bf3/27f9e00556354e211f5b77bfab0cb26e/Types_of_Nodes_and_Their_Relationships_based_on_Mechanisms_and_Domains_of_Attack_Views.png

In this structure, the highest hierarchal level is Views, which is comprised of Mechanisms of Attack and Domains of Attack.  These Views separate the Category Entries into two types, based on whether they pertain to a mechanisms employed in exploiting a vulnerability or the domains through which the attacks are perpetrated.  Below Category Entries are the Attack Pattern Entries.  However, it is important to note that there are three types of Attack Pattern Entries: Meta, Standard, and Detailed.  These three terms refer to the level of abstraction in the particular Attack Pattern Entries.  Meta Attack Pattern Entries are directly below Category Entries in the hierarchy.  As the Categories are ways of sorting Attack Patterns, a given Meta Attack Pattern Entry will be a "MemberOf" two categories, one for each View.  Meta Attack Pattern Entries have Child nodes that can be either Standard or Detailed Attack Patterns.  Standard Attack Pattern Entries may also have their own Child, which will always be a Detailed Attack Pattern.

Given that Views and Categories are primarily methods of organizing Attack Patterns, we are specifically interested in the Attack Patterns and the Fields that they contain.  To prepare for extracting information from the text within the Attack Pattern Fields, we must first determine which Fields appear the most, if the most frequent Fields even contain the most important/relevant pieces of information, and the means by which to extract the information from the XML needed to create a corpus.


# Parsing the XML File

As noted in the introduction, we must determine which fields are the most frequently used among Attack Pattern Entries.  The following Python script uses a list of Fields used by the XML which was created through examining the XML's schema documentation and counts the Fields mentioned to return their frequencies in a dictionary.

In [4]:
import lxml.etree

tree = lxml.etree.parse('capec.xml')
root = tree.getroot()

#Import list of field names into dictionary
frequencies = {}
with open("fields.txt") as fields:
    for line in fields:
        strippedline = line.strip()
        frequencies[strippedline] = 0

ns = "{http://capec.mitre.org/capec-2}"

#Count fields in XML
for keys, values in frequencies.items():
    
    key = ns + keys
    for _ in root[2].iter(key):
        frequencies[keys] += 1
        
#Output as text file
#==============================================================================
# f = open('sortedfrequencies.txt','w')
# for t in sorted_frequencies:
#     line = ' ' . join(str(x) for x in t)
#     f.write(line + '\n')
# f.close()
#==============================================================================

## Plotting the Frequencies

To better visualize the counts returned by parsing the XML file, the following script uses the data stored in the dictionary created previously to plot a histogram.

## Horizontal Histogram

In [6]:
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import Range1d

data = {}
data['Entries'] = frequencies

df_data = pd.DataFrame(data).sort_values(by='Entries', ascending=True)
series = df_data.loc[:,'Entries']

p = figure(width=800, y_range=series.index.tolist(), title="Attack Pattern Histogram")

p.xaxis.axis_label = 'Frequency'
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '8pt'

p.yaxis.axis_label = 'Field'
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '8pt'

j = 1
for k,v in series.iteritems():
  
  #Print fields, values, orders
  #print (k,v,j) 
  p.rect(x=v/2, y=j, width=abs(v), height=0.4,
    width_units="data", height_units="data")
  j += 1

show(p)
