# Matching ISO Policies to Cyber Asset Documentation using DocSim Semantic Similarity Scoring

#### Neel Datta
#### July 2021

This notebook makes use of the NLP document-similarity tool to match cyber assets (such as AWS IAM) to security policies in order to check whether a cyber asset(s) sufficiently satisifies the given requirements.


In [1]:
import json
import docsim
import re
import csv
from urllib.request import urlopen
from urllib.error import HTTPError
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
%%time

docsim_obj = docsim.DocSim(verbose=True)
# docsim_obj = docsim.DocSim_threaded(verbose=True)

Loading default GloVe word vector model: glove-wiki-gigaword-50
Model loaded
CPU times: user 27.1 s, sys: 302 ms, total: 27.4 s
Wall time: 27.9 s


In [3]:
print(f'Model ready: {docsim_obj.model_ready}')

Model ready: True


## Methods 

In [4]:
# Function that takes in xml file of a list of URLs and converts to string list where each string is a URL
def xmlToList(xml): 
    with open(xml, 'r') as f:
        temp = f.read()
    temp = re.findall("<loc>.*?</loc>", temp)
    strlist = []
    for s in temp:
        s = s[5:-6]
        strlist.append(s)
    return strlist
    
# Function that converts each url into a title + Data node in JSON
def htmlToJSON(htmlIn, JSONout):
    data = {}
    data['data'] = []
    for url in htmlIn:
        while True:
            try:
                dpoint = [url]
                page = urlopen(url)
                html = page.read().decode("ISO-8859-1")
                soup = BeautifulSoup(html)
                dpoint.append(soup.get_text())
                data['data'].append(dpoint)
                break
            except HTTPError:
                print ("HTTPError at url: " + url)
                break
    with open(JSONout, 'w') as outfile:
        json.dump(data, outfile)
    return

# Function that converts the list of controls/policies from a JupiterOne PDF into a list of strings
def JSONToList(JSONin):
    policies = pd.read_json(JSONin)
    plist = []
    for sec in policies['sections']:
        for req in sec['requirements']:
            plist.append(sec['title'] + ' ' + req['ref'] + ' : ' + req['title'] + ' : ' + req['summary'])
    return plist
  
    

# Function that iterates through each of the controls and compares using docsim with the JSON documents
# and outputs file with top 5 matches for each.
    # Takes in xml list of all URLs to compare, loads them into input urlJSON files, and outputs a CSV of each control
    # and the 5 urls most similar to it
    
def finalMatches(xmlIn, urlJSON, policyJSON, CSVOut, n):
    htmlToJSON(xmlToList(xmlIn), urlJSON)
    policies = JSONToList(policyJSON)
    
#Currently only testing on the first n policies for runtime/testing purposes
    policies = policies[:n]
    
# Load test data
    with open(urlJSON) as in_file:
        urldata = json.load(in_file)
    titles = [item[0] for item in urldata['data']]
    documents = [item[1] for item in urldata['data']]
    print(f'{len(documents)} documents')
    
# Output findings into CSV file:
    with open(CSVOut, mode = 'w') as csvfile:
        data = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        data.writerow(['Policy', 'Score', 'URL'])
        for p in policies:
            query_string = p
            similarities = docsim_obj.similarity_query(query_string, documents)
            for idx, score in (sorted(enumerate(similarities), reverse=True, key=lambda x: x[1])[:5]):
                data.writerow([query_string, str(score), titles[idx]])
    return

## Testing: 

### Testing the final matching method, which takes in list of policies and URLS and outputs CSV of each policy and its top closest URLs in terms of semantic similarity.

In [14]:
finalMatches('IAMurlsexample.xml', 'IAMurlsexample.JSON', 
             'ISO27002J1.JSON',
            'examplecsv.txt', 5)
scoring = pd.read_csv('examplecsv.txt')
scoring.head()

HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html
HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/list_awssecretsmanager.html
HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/list_awskeymanagementservice.html
HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/list_identityandaccessmanagement.html
485 documents
485 documents loaded into corpus
485 documents loaded into corpus


  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


485 documents loaded into corpus


  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


485 documents loaded into corpus


  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


485 documents loaded into corpus


Unnamed: 0,Policy,Score,URL
0,5 - Information Security Policies 5.1.1 : Poli...,0.638876,https://docs.aws.amazon.com/IAM/latest/UserGui...
1,5 - Information Security Policies 5.1.1 : Poli...,0.601462,https://docs.aws.amazon.com/IAM/latest/UserGui...
2,5 - Information Security Policies 5.1.1 : Poli...,0.597436,https://docs.aws.amazon.com/IAM/latest/UserGui...
3,5 - Information Security Policies 5.1.1 : Poli...,0.579919,https://docs.aws.amazon.com/IAM/latest/UserGui...
4,5 - Information Security Policies 5.1.1 : Poli...,0.575602,https://docs.aws.amazon.com/IAM/latest/UserGui...


### Testing each individual method.

In [5]:
%%time
#Test xmlToList
IAMurls = xmlToList('IAMurlsexample.xml')
#print(IAMurls)
print(str(len(IAMurls)) + " URLs found.")

#Test htmlToJSON
url1 = htmlToJSON(IAMurls, 'IAMurlsexample.JSON')
#with open('IAMurlsexample.JSON') as json_file:
    # data = json.load(json_file)
    # print(data)
    
    
# Load test data
with open('IAMurlsexample.JSON') as in_file:
    test_data = json.load(in_file)
    
titles = [item[0] for item in test_data['data']]
documents = [item[1] for item in test_data['data']]

print(f'{len(documents)} documents')

489 URLs found.
HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html
HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/list_awssecretsmanager.html
HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/list_awskeymanagementservice.html
HTTPError at url: https://docs.aws.amazon.com/IAM/latest/service-authorization/latest/reference/list_identityandaccessmanagement.html
485 documents
CPU times: user 17 s, sys: 981 ms, total: 18 s
Wall time: 5min 36s


### Testing on one policy: 'An access control policy shall be established, documented and reviewed based on business and information security requirements.'

In [6]:
# Test on one string
query_string = 'An access control policy shall be established, documented and reviewed based on business and information security requirements.'
similarities = docsim_obj.similarity_query(query_string, documents)

# Output the similarity scores for top 5 documents
for idx, score in (sorted(enumerate(similarities), reverse=True, key=lambda x: x[1])[:5]):
    print(f'{idx} \t {score:0.3f} \t {titles[idx]}')

485 documents loaded into corpus
185 	 0.686 	 https://docs.aws.amazon.com/IAM/latest/APIReference/API_CreateRole.html
379 	 0.660 	 https://docs.aws.amazon.com/IAM/latest/APIReference/API_ResetServiceSpecificCredential.html
7 	 0.650 	 https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html
172 	 0.641 	 https://docs.aws.amazon.com/IAM/latest/APIReference/API_CreateUser.html
35 	 0.613 	 https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_terms-and-concepts.html


## So far average time is around 5 mins for 1 search query compared with 1 cyber asset. 