# Introduction

The motivation for this Jupyter noteboook is to take snyk output, import it into Elasticsearch, and render some interesting output.  Ordinarily, I would use the snyk REST API, but that is not available on the free tier.  Web scraping is difficult because the site uses OAUTH, and getting Selenium (to render Javascript) working with modified Headers is a tricky exercise.  At least trickier than I had time to solve.

In this Jupyter notebook, I take the output from snyk CLI commands and process them with Python.  Part of this processing is to get data into ElasticSearch.  I have a 2-node ElasticSearch cluster on my Ubuntu machine, for some experiments and training.

Once the data is in ElasticSearch, I'll do some transformations both with ES and with Python.  For example, on ES I will try some queries and some data representation.  On Python, I'll use the search capabilities of the library to create dataframes and maybe even some plots.

I use two types of files.  I'll lead with JSON files from the SnykCLI and use those as inputs into ES.  I'll also generate sarif files.  I expect most work to happen with the JSON files.

I tested the following on
- Ubuntu 24 LTS 
- Elasticsearch 8.15
- Python 3.12
- Snyk CLI 1.1293.1

These are the repositories that I used to generate the output files:
- git@github.com:marcoman/vulnado.git
- git@github.com:marcoman/java-goof.git
- git@github.com:marcoman/goof.git

These are the containers I used to generate output files:
- docker.elastic.co/elasticsearch/elasticsearch:8.15.1
- A local container built from https://github.com/marcoman/goof/tree/develop/todolist

For reference, these are some of the commands I ran to get my files:

```bash
snyk container test --json --json-file-output=container-elastic.json --app-vulns docker.elastic.co/elasticsearch/elasticsearch:8.15.1
snyk container test --json --json-file-output=container-todolist-goof.json --app-vulns todolist-goof:latest

snyk container test --sarif --sarif-file-output=container-todolist-goof.sarif --app-vulns todolist-goof:latest 
snyk container test --sarif --sarif-file-output=container-elastic.sarif --app-vulns docker.elastic.co/elasticsearch/elasticsearch:8.15.1

snyk test --json-file-output=os-goof-todolist.json --json
snyk test --json-file-output=os-java-goof.json --json
snyk test --json-file-output=os-vulnado.json --json

```

In [1]:
import pandas as pd
import numpy as np
import json
from elasticsearch import Elasticsearch, helpers

import urllib3

# This call disables the InsecureRequestWarning for unverified HTTPS requests
# This is a common workaround for disabling SSL certificate verification in Python
# It should not be used in production environments, as it can lead to security risks
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

Let's load up our files and start to examine them.  The JSON files come in heavy, and we may have to reduce or only load in a subset.


In [2]:
# We're going to load in files into a variable.  
# We *might* use the file contents as-is, but more likely just within a JSON object.
json_container_elastic = None
json_container_todolist = None
json_os_goof_todolist = None
json_os_java_goof = None
json_os_vulnado = None

with open('datafiles/container-elastic.json') as f:
    json_container_elastic = json.load(f)

with open('datafiles/container-todolist-goof.json') as f:
    json_container_todolist = json.load(f)

with open('datafiles/os-goof-todolist.json') as f:
    json_os_goof_todolist = json.load(f)

with open('datafiles/os-java-goof.json') as f:
    json_os_java_goof = json.load(f)

with open('datafiles/os-vulnado.json') as f:
    json_os_vulnado = json.load(f)


This next part gets our envrionment variables to collect our API credentials.  In my environment, I set these values envvars to help me avoid adding them to the code.  My Elasticsearch server is on my computer, and it is not likely the world will be attacking it.  Still, it is a good practice.

In [3]:
import os
ELASTIC_API_URL = os.environ.get('ELASTIC_API_URL')
ELASTIC_API_KEY = os.environ.get('ELASTIC_API_KEY')
#THe authorization headers are by username + password
headers = {
    'Authorization': f'ApiKey {ELASTIC_API_KEY}'
}   


In [4]:
print(ELASTIC_API_URL)


https://172.29.213.51:9200/


# Test Elasticsearch connection

As a test, let's see if we access the ES server via a requests call.  This is different from using the ES library, which we'll test later.

In [5]:
import requests

## Read the products from the Elastic Server.  This is a GET request to /products
def get_products():
    url = f"{ELASTIC_API_URL}/products"
    
    # We specify verify=False to match curl's --insecure flag
    response = requests.get(url, headers=headers, verify=False)
    return response.json()


products = get_products()
print (f'Your products are: \n{products}')


Your products are: 
{'products': {'aliases': {}, 'mappings': {'properties': {'created': {'type': 'date', 'format': 'yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis'}, 'description': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}, 'id': {'type': 'long'}, 'in_stock': {'type': 'long'}, 'is_active': {'type': 'boolean'}, 'name': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}, 'price': {'type': 'long'}, 'sold': {'type': 'long'}, 'tages': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}, 'tags': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}, 'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}}, 'number_of_shards': '2', 'provided_name': 'products', 'creation_date': '1726425825891', 'number_of_replicas': '2', 'uuid': 'tPcO96JhRLqzI7bhDZSGXQ', 'version': {'created': '8512000'}}}}}


Back to the snyk cli output files. 

The general structure of the json file is below.  These are organized with a few differet top-level lists, and we'll spend most of our time focusing on the `vulnerabilities` and `applications` lists.  As I work over the examples, I am expecting to use the `projectName` and `path` as identifiers or query criteria.  This means I'm likely to add all vulnerabilities and application to their respective indicies, and the query will be my filter.

```json
{
    "vulnerabilities": [],
    ...
    "summary" : "",
    "projectName" : "",
    "path" : "",
    "applications" : [
        {
            "projectName":"",
            "dependencyCount":"",
            "displayTargetFile":"",
            "targetFile":"",
            "path":"",
            "packageManager":"",
            "summary" : "",
            "vulnerabilities":[]
        }
    ]
}
```


In [6]:
# Initialize the Elasticsearch client and use the API key to log on.


from elasticsearch import Elasticsearch
es = Elasticsearch(ELASTIC_API_URL, api_key=ELASTIC_API_KEY, verify_certs=False)

  _transport = transport_class(


In [17]:
# Clean up indicies to start clean
# Delete only if the indicies are present.

if es.indices.exists(index='applications'):
    print("Deleting index applications")
    res = es.indices.delete(index='applications')
    print(res)
else:
    print("Index applications does not exist")

if es.indices.exists(index='vulnerabilities'):
    print("Deleting index vulnerabilities")
    res = es.indices.delete(index='vulnerabilities')
    print(res)
else:
    print("Index vulnerabilities does not exist")


Deleting index applications
{'acknowledged': True}
Deleting index vulnerabilities
{'acknowledged': True}


## Create indicies for our test

I create indicies for both `applications` and `vulnerabilities` explicitly.  

In [18]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}

if not es.indices.exists(index="applications"):
    response = es.indices.create(index="applications", body=index_settings)

if not es.indices.exists(index="vulnerabilities"):
    response = es.indices.create(index="vulnerabilities", body=index_settings)


## Populate the indicies

The easiest solution is to iterate through the different JSON files and add their data to the indicies.  As I add more, I will automate this even better for the available JSON files.


In [22]:
json_files = [["Elasticsearch container", json_container_elastic, "container"],
              ["Java TODO List container", json_container_todolist, "container"],
              ["Goof Open Source", json_os_goof_todolist, "opensource"],
              ["Java Goof Open Source", json_os_java_goof, "opensource"],
              ["Vulnado Open Source", json_os_vulnado, "opensource"],
              ]

def iterate_through_containers(jsonfile):
    # we want to iterate and report on two lists inside of the Json body named jsonfile.
    # The first is named applications, and the second is named vulnerabilities.  
    # These two lists are independent and at the same level
    # Read through each and print out their contents
    print(f'Operating on {jsonfile[0]}')
    projectName = jsonfile[1]['projectName']
    path = jsonfile[1]['path']
    
    print(f'Working on {projectName} with {path}')

    i = 0
    for app in jsonfile[1]['applications']:
        # print(app)
        # Now load each app named "app" as a new document in ElasticSearch into the index named "applications"
        # There is variation in the records and I need to adjust how they are stored.  For example, the upgradePath is empty or contains values.async_search
        # For this part, I'll create a new record that is just a subset of the original.
        if "vulnerable dependency path" in app['summary']:
            # We know the first token is a number, so let's get it and render it as an integer
            vulns = int(app['summary'].split(' ')[0])
        else:
            vulns = 0
            
        newapp = {
            "projectName" : projectName,
            "path" : path,
            "appProjectName": app['projectName'],
            "targetFile": app['targetFile'],
            "summary": app['summary'],
            "displayTargetFile": app['displayTargetFile'],
            "id" : i,
            "vulns" : vulns
        }
        es.index(index="applications", document=newapp)
        i += 1
    print(f'There are {i} applications')

    i = 0        
    for vuln in jsonfile[1]['vulnerabilities']:
        # print(vuln)
        newvuln = {
            "projectName" : projectName,
            "path" : path,
            "id": vuln['id'],
            "CVSSv3": vuln['CVSSv3'],
            "cvssScore": vuln['cvssScore'],
            "description": vuln['description'],
            "packageName": vuln['packageName'],
        }
        es.index(index="vulnerabilities", document=newvuln)
        i += 1
    print(f'There are {i} vulnerabilities')
    
def iterate_through_opensource(jsonfile):
    print(f'Operating on {jsonfile[0]}')
    projectName = jsonfile[1]['projectName']
    path = jsonfile[1]['path']
    packageManager = jsonfile[1]['packageManager']
    print(f'Working on {projectName} with {path} and {packageManager}')
    i = 0
    for vuln in jsonfile[1]['vulnerabilities']:
        newvuln = {
            "projectName" : projectName,
            "path" : path,
            "packageManager" : packageManager,
            "CVSSv3": vuln['CVSSv3'],
            "cvssScore": vuln['cvssScore'],
            "id": vuln['id'],
            "language" : vuln['language'],
            "packageName" : vuln['packageName'],
            "severity": vuln['severity'],
            "description": vuln['description'],
            "packageName": vuln['packageName'],
        }
        es.index(index="vulnerabilities", document=newvuln)
        i += 1
    print(f'There are {i} vulnerabilities')

for jsonfile in json_files:
    if jsonfile[2] == "container":
        iterate_through_containers(jsonfile=jsonfile)
    elif jsonfile[2] == "opensource":
        iterate_through_opensource(jsonfile=jsonfile)
   


Operating on Elasticsearch container
Working on docker-image|docker.elastic.co/elasticsearch/elasticsearch with docker.elastic.co/elasticsearch/elasticsearch:8.15.1/elasticsearch/elasticsearch
There are 94 applications
There are 104 vulnerabilities
Operating on Java TODO List container
Working on docker-image|todolist-goof with todolist-goof:latest
There are 7 applications
There are 2293 vulnerabilities
Operating on Goof Open Source
Working on io.github.snyk:todolist-mvc with /home/marco/code/marcoman/goof/todolist and maven
There are 0 vulnerabilities
Operating on Java Goof Open Source
Working on io.github.snyk:java-goof with /home/marco/code/marcoman/java-goof and maven
There are 0 vulnerabilities
Operating on Vulnado Open Source
Working on com.scalesec:vulnado with /home/marco/code/marcoman/vulnado and maven
There are 126 vulnerabilities


Now that we've imported vulnerabilities from a few container and a few open-source projects, let's see what extra details or insights we can get.

NOTE: Some projects did not have vulnerabilities.  Also, with the free tier, we may be limited in the total number of scans we can run.

Let's start by getting a count of records that match our search criteria.  Since all of the data is in two different indicies, I will be looking for unique values.

In [23]:
# Let's get the number of unique projectName values in the applications and vulnerabiliities indicies
# We will use query results from elasticsearch.

# Define the aggregation query
query = {
    "size": 0,
    "aggs": {
        "unique_project_names": {
            "terms": {
                "field": "projectName.keyword",
                "size": 10000  # Adjust the size as needed
            }
        }
    }
}

# Execute the search query
response = es.search(index="applications", body=query)

# Extract the unique project names
app_project_names = [bucket['key'] for bucket in response['aggregations']['unique_project_names']['buckets']]

# Print the unique project names
print("\nUnique project names in applications index:")
for project_name in app_project_names:
    print(project_name)

# now let's do the same for the vulnerabilities index.  Same query
response = es.search(index="vulnerabilities", body=query)
vuln_project_names = [bucket['key'] for bucket in response['aggregations']['unique_project_names']['buckets']]
print("\nUnique project names in vulnerabilities index:")
for project_name in vuln_project_names:
    print(project_name)
    


Unique project names in applications index:
docker-image|docker.elastic.co/elasticsearch/elasticsearch
docker-image|todolist-goof

Unique project names in vulnerabilities index:
docker-image|todolist-goof
com.scalesec:vulnado
docker-image|docker.elastic.co/elasticsearch/elasticsearch


# Elasticsearch query examples

Here are some ideas:

* How many Critical, High, Medium, Low vulnerabilities do we have for each docker-image?
* What is the precentage of each type for the docker-images?


In [27]:
print("Application type count for each of our projects.")
for project_name in app_project_names:
    print(f"Project: {project_name}")
    # do an elasticsearch query for the project_name in the applications index, for the vulnerabilities of each type -
    # critical, high, medium, low
    # app_critical = es.search(index="applications", query={"match": {"project_name": project_name, "severity": "critical"}})
    
    
    app_critical = es.search(index="applications",
                             query= {"bool": {
                                 "must" : [
                                     { "range" : {
                                         "vulns" : {"gt" : 0}
                                         }
                                     },
                                     { "term" : {"projectName.keyword" : project_name}}
                                     ]
                                 }},
                             size=0,
                             aggs={
                                 "vuln_count" : {
                                     "value_count" : {
                                         "field" : "vulns.keyword"
                                     }
                                 }
                             },
                             )
    print(f"Apps with vulns: {app_critical['hits']['total']['value']}")

print("\n\nVulnerability type count for each of our projects.")
for unique_project in vuln_project_names:
    print(f"Project: {unique_project}")
    vuln_critical = es.search(index="vulnerabilities",
                             query= {"bool": {
                                 "must" : [
                                     { "match" : {"severity" : "critical"}},
                                     { "term" : {"projectName.keyword" : unique_project}}
                                     ]
                                 }},)
    print(f"Critical: {vuln_critical['hits']['total']['value']}")

    vuln_high = es.search(index="vulnerabilities",
                             query= {"bool": {
                                 "must" : [
                                     { "match" : {"severity" : "high"}},
                                     { "term" : {"projectName.keyword" : unique_project}}
                                     ]
                                 }},)
    print(f"High: {vuln_high['hits']['total']['value']}")

    vuln_med = es.search(index="vulnerabilities",
                             query= {"bool": {
                                 "must" : [
                                     { "match" : {"severity" : "medium"}},
                                     { "term" : {"projectName.keyword" : unique_project}}
                                     ]
                                 }},)
    print(f"Med: {vuln_med['hits']['total']['value']}")

Application type count for each of our projects.
Project: docker-image|docker.elastic.co/elasticsearch/elasticsearch
Apps with vulns: 102
Project: docker-image|todolist-goof
Apps with vulns: 7


Vulnerability type count for each of our projects.
Project: docker-image|todolist-goof
Critical: 0
High: 0
Med: 0
Project: com.scalesec:vulnado
Critical: 4
High: 162
Med: 68
Project: docker-image|docker.elastic.co/elasticsearch/elasticsearch
Critical: 0
High: 0
Med: 0
