# Comparing Accuracy of DocSim to TF-IDF (term frequency-inverse document frequency) when Matching Policies to Asset Documentation

#### Neel Datta
#### July 2021

This notebook compares the accuracy of the DocSim nltk tool to the TF-IDF tool to see which returns more accurate results when testing the Policy-Cyber Asset matching tool on one query (in it's entirety and a shortened version of it) with four document links. If the ranking tool works correctly, it should return te cryptography pages with high scores and the other pages with low/near-0% scores.


In [5]:
import json
import docsim
import re
import csv
import nltk
nltk.download('wordnet')
from urllib.request import urlopen
from urllib.error import HTTPError
import pandas as pd
from bs4 import BeautifulSoup
from tfidf import rank_documents

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/neeldatta/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/neeldatta/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
%%time

docsim_obj = docsim.DocSim(verbose=True)
# docsim_obj = docsim.DocSim_threaded(verbose=True)
print(f'Model ready: {docsim_obj.model_ready}')

Loading default GloVe word vector model: glove-wiki-gigaword-50
Model loaded
Model ready: True
CPU times: user 21.6 s, sys: 189 ms, total: 21.8 s
Wall time: 22.1 s


In [7]:
# Function that takes in xml file of a list of URLs and converts to string list where each string is a URL
def xmlToList(xml): 
    with open(xml, 'r') as f:
        temp = f.read()
    temp = re.findall("<loc>.*?</loc>", temp)
    strlist = []
    for s in temp:
        s = s[5:-6]
        strlist.append(s)
    return strlist
    
# Function that converts each url into a title + Data node in JSON
def htmlToJSON(htmlIn, JSONout):
    data = {}
    data['data'] = []
    for url in htmlIn:
        while True:
            try:
                dpoint = [url]
                page = urlopen(url)
                html = page.read().decode("ISO-8859-1")
                soup = BeautifulSoup(html)
                dpoint.append(soup.get_text())
                data['data'].append(dpoint)
                break
            except HTTPError:
                print ("HTTPError at url: " + url)
                break
    with open(JSONout, 'w') as outfile:
        json.dump(data, outfile)
        
def txtToJson(txtIn, JSONout):
    data = {}
    data['data'] = []
    
    for url in htmlIn:
        while True:
            try:
                dpoint = [url]
                page = urlopen(url)
                html = page.read().decode("ISO-8859-1")
                soup = BeautifulSoup(html)
                dpoint.append(soup.get_text())
                data['data'].append(dpoint)
                break
            except HTTPError:
                print ("HTTPError at url: " + url)
                break
    with open(JSONout, 'w') as outfile:
        json.dump(data, outfile)
    

# Function that converts the list of controls/policies from a JupiterOne PDF into a list of strings
def JSONToList(JSONin):
    policies = pd.read_json(JSONin)
    plist = []
    for sec in policies['sections']:
        for req in sec['requirements']:
            plist.append(sec['title'] + ' ' + req['ref'] + ' : ' + req['title'] + ' : ' + req['summary'])
    return plist
  
    

# Function that iterates through n of the controls and compares using docsim with the JSON documents
# and outputs file with top 5 matches for each.
    # Takes in xml list of all URLs to compare, loads them into input urlJSON files, and outputs a CSV of each control
    # and the 5 urls most similar to it
    
def finalMatches(xmlIn, urlJSON, policyJSON, CSVOut, n):
    htmlToJSON(xmlToList(xmlIn), urlJSON)

    policies = JSONToList(policyJSON)
    
#Currently only testing on the first 5 policies for runtime/testing purposes
    policies = policies[:n]
    
# Load test data
    with open(urlJSON) as in_file:
        urldata = json.load(in_file)
    titles = [item[0] for item in urldata['data']]
    documents = [item[1] for item in urldata['data']]
    print(f'{len(documents)} documents')
    
# Output findings into CSV file:
    with open(CSVOut, mode = 'w') as csvfile:
        data = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        data.writerow(['Policy', 'Score', 'URL'])
        for p in policies:
            query_string = p
            similarities = docsim_obj.similarity_query(query_string, documents)
            for idx, score in (sorted(enumerate(similarities), reverse=True, key=lambda x: x[1])[:5]):
                data.writerow([query_string, str(score), titles[idx]])
    return


#Function that tests inputted policy string(s) against inputted list of URLs.
def testMatches(queries, urls, testJSON):
    htmlToJSON(urls, testJSON)

    with open(testJSON) as in_file:
        urldata = json.load(in_file)
    titles = [item[0] for item in urldata['data']]
    documents = [item[1] for item in urldata['data']]
    print(f'{len(documents)} documents')
    
    # Test on one string
    query_string = queries
    similarities = docsim_obj.similarity_query(query_string, documents)

    # Output the similarity scores for top 5 documents
    for idx, score in (sorted(enumerate(similarities), reverse=True, key=lambda x: x[1])[:5]):
        print(f'{idx} \t {score:0.3f} \t {titles[idx]}')
    return
    

def testMatchesTFIDF(queries, urls, testJSON):
    htmlToJSON(urls, testJSON)

    with open(testJSON) as in_file:
        urldata = json.load(in_file)
    titles = [item[0] for item in urldata['data']]
    documents = [item[1] for item in urldata['data']]
    print(f'{len(documents)} documents')
    
    # Test on one string
    document_scores = rank_documents(queries, documents)

    score_titles = [(score, title) for score, title in zip(document_scores, titles)]

    for score, title in (sorted(score_titles, reverse=True, key=lambda x: x[0])[:5]):
        print(f'{score:0.3f} \t {title}')
    return

## Testing on cryptography procedure. 
We will now run a search query on key management procedures with the following four web pages:

### Should return 0% match:
#### https://docs.sumerian.amazonaws.com/tutorials/create/getting-started/light-switch/ 
#### https://aws.amazon.com/ground-station/

### Should return 100% match:
#### https://docs.aws.amazon.com/kms/latest/cryptographic-details/intro.html
#### https://docs.aws.amazon.com/kms/latest/cryptographic-details/basic-concepts.html 



In [8]:
urllist = ['https://docs.sumerian.amazonaws.com/tutorials/create/getting-started/light-switch/', 
           'https://aws.amazon.com/ground-station/', 
           'https://docs.aws.amazon.com/kms/latest/cryptographic-details/intro.html',
    'https://docs.aws.amazon.com/kms/latest/cryptographic-details/basic-concepts.html'
]
query = "The objectives of Key management are: Supporting the users with an existing domain. Generating distribution and installation of keying. Controlling set of Keying material. Storage backup and archival of Keying. The key management techniques are: Symmetric. Key Encryption. Public Key Encryption."
testMatches(query, urllist, 'test.json')

4 documents
4 documents loaded into corpus
1 	 0.597 	 https://aws.amazon.com/ground-station/
2 	 0.580 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/intro.html
3 	 0.533 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/basic-concepts.html
0 	 0.446 	 https://docs.sumerian.amazonaws.com/tutorials/create/getting-started/light-switch/


## Testing with smaller query string. 

In [9]:
query = "The key management techniques are: Symmetric. Key Encryption. Public Key Encryption."
testMatches(query, urllist, 'test.json')

4 documents
4 documents loaded into corpus
3 	 0.815 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/basic-concepts.html
1 	 0.733 	 https://aws.amazon.com/ground-station/
0 	 0.649 	 https://docs.sumerian.amazonaws.com/tutorials/create/getting-started/light-switch/
2 	 0.637 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/intro.html


## Testing with TF IDF model instead.

In [10]:
testMatchesTFIDF(query, urllist, 'test.json')

4 documents
0.561 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/basic-concepts.html
0.241 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/intro.html
0.003 	 https://aws.amazon.com/ground-station/
0.000 	 https://docs.sumerian.amazonaws.com/tutorials/create/getting-started/light-switch/


In [11]:
# Testing with longer query again.
query = "The objectives of Key management are: Supporting the users with an existing domain. Generating distribution and installation of keying. Controlling set of Keying material. Storage backup and archival of Keying. The key management techniques are: Symmetric. Key Encryption. Public Key Encryption."
testMatchesTFIDF(query, urllist, 'test.json')

4 documents
0.368 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/basic-concepts.html
0.174 	 https://docs.aws.amazon.com/kms/latest/cryptographic-details/intro.html
0.009 	 https://docs.sumerian.amazonaws.com/tutorials/create/getting-started/light-switch/
0.004 	 https://aws.amazon.com/ground-station/


## Overall, working with TF IDF model on small query strings is the most ideal.
As we can see, the TF IDF model correctly assigned zero/near-zero scores to the two random web pages, and significantly higher scores to the two cryptography pages, while the DocSim tool gave them all high scores, and didn't even place the cryptography documentation pages as the two highest.