## For running on Google Colab
1. Upload this notebook
2. On the left pane of Google Colab UI, there will an option to upload more files
3. Upload the `for_colab.zip` archive
4. Execute the cells below sequentially

In [None]:
!unzip -o for_colab.zip

In [None]:
import os
import json
from collections import defaultdict
import itertools
import importlib
import copy
import numpy as np

# Custom plotting script stored in the "scripts" directory
from scripts import plot_common as my_plt

# Measuring CVE Exposure For All Binaries

### Section 7.3 Figure 6

This notebook generates a cumulative distribution frequency (CDF) plot for CVE Exposure. The file `data/combine_cve_reduction.json`, for each binary, the CVEs reachable at the function level and at the ELF level. We compute function-level reachability by running a breadth-first search (BFS) from each function in a binary and recording all vulnerable functions reached; this yields the number of CVEs reachable at the function level. Since the function-level graph is large and dense (~16 million nodes, ~70 million edges), performing graph traversals is compute intensive.

### Computing CVE Exposure
The file `data/combine_cve_reduction.json` was generated by the script `scripts/get_cves_for_all.py`. Running this script across all binaries in the dataset (more than 24000) requires significant time and compute resources. Because CVE reachability can be computed independently for each binary, the task parallelizes well. To process over 24000 binaries, we used a 48-core, 372 GB memory server and ran `get_cves_for_all.py` for several days.

#### Performing scaled down verison of this experiment
For artifact evaluation, we propose a scaled-down experiment -- Compute reachability for all binaries in the `coreutils` Debian package:
```
$ cd scripts
$ conda activate supply_chain_py311
$ python3 get_cves_for_all.py

# The results will be available in the scripts/results directory
```

#### Regenerating Figure 6
To regenerate Figure 6, which shows CVE reduction across all 24000+ binaries, we can run the cells below. They load the raw CVE reachability data from `data/combine_cve_reduction.json` and generate a CDF. The cells compute the percentage reduction of CVEs reachable for each binary from ELF-level to function-level.

In [None]:
cve_reduction_combine_run = {}

In [None]:
cve_reduction_combine_run = json.load(open("./data/combine_cve_reduction.json", "r"))

In [None]:
elf_cve_reduction_map = {}
invalid_cases = {}
no_elf_cves = []
elf_cve_reduction_nums = {}
elf_cve_reduction_percentage = {}
elf_cve_reduction_nums_dict = {}
diff_nums = defaultdict(list)
diff_percents = defaultdict(list)

# The values below are to calculate average percentage reduction
total_reduction = 0
total_cves_at_elf_level = 0

for elf, details in cve_reduction_combine_run.items():
    if details['ELF'] == []:
        no_elf_cves.append(elf)
    if len(details['FUNC']) > len(details['ELF']):
        # Might happen for some binaries due different binary name and SONAME causing ELF-level graph to 
        # miss the edge since nodes in ELF-level graph are based on SONAME
        invalid_cases[elf] = cve_reduction_combine_run[elf]
        continue
    elf_cve_reduction_map[elf] = copy.deepcopy(cve_reduction_combine_run[elf])
    elf_cve_reduction_nums_dict[elf] = {"ELF": len(details["ELF"]), "FUNC": len(details["FUNC"])}
    elf_cve_reduction_nums[elf] = len(details['ELF']) - len(details['FUNC'])
    
    total_reduction += len(details['ELF']) - len(details['FUNC'])
    total_cves_at_elf_level +=  len(details['ELF'])
    
    diff_nums[len(details['ELF']) - len(details['FUNC'])].append(elf)
    # Taking care of 0 ELFs here
    elf_cve_reduction_percentage[elf] = (len(details['ELF']) - len(details['FUNC']))/len(details['ELF']) if len(details['ELF']) != 0 else 0
    diff_percents[(len(details['ELF']) - len(details['FUNC']))/len(details['ELF']) if len(details['ELF']) != 0 else 0].append(elf)

### Statistics
We now compute all the percentage reductions we mention in Section 7.3

In [None]:
# Percentage of binaries that have a CVE reduction of at least 50 percent
more_than_half = 0
for red, elfs in diff_percents.items():
    if red >= 0.5:
        more_than_half += len(elfs)
print(more_than_half/len(elf_cve_reduction_percentage))

In [None]:
# Percentage and raw number of binaries with no CVE reduction at function-level vs ELF-level
print(len(diff_percents[0])/len(elf_cve_reduction_percentage))
print(len(diff_percents[0]))

In [None]:
# Number of binaries for which there is an exact 50 percent reduction
print(len(diff_percents[0.5])/len(elf_cve_reduction_percentage))
print(len(diff_percents[0.5]))

In [None]:
# Number of binaries for which there is 100 percent reduction
print(len(diff_percents[1.0]))

## Average number of CVEs reduced per binary

In [None]:
print(total_reduction/total_cves_at_elf_level)

### Plotting Figure 6: CVE Exposure CDF

In [None]:
importlib.reload(my_plt)
os.makedirs('figs', exist_ok=True)
my_plt.plot_cdf_and_hist(input_dict=elf_cve_reduction_percentage,
               xlabel="% Reduction in number of CVEs reported per binary",
               ylabel_cdf="Cumulative % of ELF binaries",
               ylabel_hist="Number of ELF Binaries",            
               x_percent=True,
               xticks=np.arange(0,1.1,0.1),
               xticks_sparse=False,
               legend_loc = "upper left",
               aspect=0.70,
               fontsize=13,
               save_path='figs/cve_redn_per_bin_cdf_hist.pdf')