# Property Querying Results

The goal of this document is to discuss the results from `Property_Querying_Tutorial.ipynb`.

To summarize, we queried our CPE-CWE-CVE knowledge graph for frequency of all possible properties of the nodes of triples. We did this query for two distinct sets of triples: removed triples and non-removed triples. We then normalized the frequencies to percentages and compared the percentage occurrence of different properties between the 2 sets.

The output of our program is `.xlsx` files, where each worksheet displays the results for a different property, organized the by largest magnitude of difference. I have supplied images of the top 10 property values with the largest magnitude of difference for each property, except in the cases where there are less than 10 unique values. 

The actual `.xlsx` files are also available and posted in the same directory.

# CPE-CVE Triples Results

## CPE Node Properties Results

In our knowledge graph, each CPE node has 6 properties: `Part`, `Target Software`, `Target Hardware`, `Product`, `Vendor`, and the actual `Name` of the node. 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## CVE Node Properties Results

In our KG, each CVE node only has 1 property: the `Name` of the node. 

![image.png](attachment:image.png)

# CVE-CWE Triples Results

## CVE Node Properties

As before, in our KG each CVE node only has 1 property: `Name`.

![image.png](attachment:image.png)

## CWE Node Properties Results

In our KG, each CWE node has 5 properties: `Language`, `Likeliehood of Exploit`, `Technology`, `Consequence`, and its `Name`.

For the `Language`, `Technology`, and `Consequence` properties, a node can actually have multiple values for a single property. For example, a CWE entry could have both `C` and `C++` under the `Language` property. 

Our KG groups all the different values a node has for that property into a single string value connected to that node. Another way to build a CPE-CVE-CWE knowledge graph would be to seperate the different values into multiple unique strings/entities connected to the node. However, that is not the case with our KG and we will treat each combination of values as a unique property value. 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

# Conclusion

## Discussion

The tables show us that certain property values have significantly different frequencies of occurrence between non-removed and removed triples. For example, with CVE-CWE triples, 75.0% of the CWE nodes in the removed set have a value of `*` for the `Technology` property compared to 56.4% in the non-removed set.

In our previous works, we were able to demonstrate that basic KG embedding models in the `Ampligraph` package were able to somewhat differentiate removed and non-removed triples with their scoring functions. Those results imply that the our triples does contain some level of information that can distinguish between the two sets. We were attempting to qualify these differences via our querying here. The logic follows that our queried, qualitative property differences at least partially cause the differences in the embedding models' scores.

Naturally, some properties and property values did not show much difference. For example, our results show that the frequencies of  CVE `Name` does not provide any differentiation between the two sets. In general, property values that had extremely high or low frequencies tend not be a good differentiator, as both sets had similiarly extreme frequencies for that property value.


## Shortcomings

This analysis did not query the sets beyond the direct properties of the nodes in each triple. There could be potential deeper analysis and more qualitative differences between the sets if the sets were queried for the frequencies of related nodes, frequencies of the corresponding types of the relations, and frequencies of the related nodes' properties as well. This idea could be taken many layers deep.

Additionally, we infer the significant discrepancies between observed at least partially lead to the difference in the embedding models' scores. However, we did not do any internal analysis on the embedding models themselves and their behavior with our triples remain somewhat of a black-box. Thus, we cannot definitively say to what extent our queried qualitative differences contributed to the score differences.

Finally, it is hard to judge how much of the queried differences were actually due to the different natures of the sets or due to chance/variance, especially since the removed triples sets are relatively small. To do so would most likely require deeper statistical analysis on the variance of the values, which we did not do here.
