## Finding C and C++ vulnerabilities in Python packages on PyPI and Conda with Dilligencer™

This is a Jupyter notebook demonstrating how easily the Dilligencer platform can address ad-hoc research tasks.

In this example, inspired by our prior work on Java cross-language dependencies in PyPi and Snyk's recent blog post, 
 we show how to quickly implement a third-party static code analysis tool - flawfinder - into a compliance or research
 pipeline.
 
While flawfinder is GPL-2 and only handles some subset of C/C++ versions and issues, it's a great example.

* You can learn more about Dilligencer at [https://dilligencer.io](https://dilligencer.io).
* You can learn more about licens.io, the technical due diligence and valuation company behind Dilligencer, at [https://licens.io](https://licens.io).

In [1]:
# initial imports for platform config
import os
import pathlib
import subprocess
import sys

# platform notebook config
# NOTE: override `PROJECT_PATH` if you have moved notebook root to a non-standard location
NOTEBOOK_ROOT_PATH = os.path.abspath(".")
PROJECT_PATH = os.path.abspath(os.path.join(NOTEBOOK_ROOT_PATH, "../src"))

sys.path.append(PROJECT_PATH)
print("configured with platform source at", PROJECT_PATH)

configured with platform source at /app/src


In [5]:
# platform imports
from apps.python.dal import py_project_model, py_project_release_model, py_project_file_model, PyProjectFileEntity
from apps.python.dal.repository import PyPiRepository
from apps.python.project.dist import PyPiDistParser
from apps.dist.source import BaseDistSourceFactory, TarFileSource

In [3]:
# extra imports for this script
import tempfile
import pandas
from sqlalchemy import or_

# package import for flawfinder (#SPDX: GPL-2.0)
# TODO: PR and enhance flawfinder
import flawfinder
flawfinder.quiet = True

In [4]:
# create pypi repo object
pypi_repo = PyPiRepository()

# find projects that contain at least one .c file
project_pk_with_c = py_project_file_model._query()\
    .filter(or_(PyProjectFileEntity.dist_path.endswith(".c"),
                PyProjectFileEntity.dist_path.endswith(".cc"),
                PyProjectFileEntity.dist_path.endswith(".cpp"),
               ))\
    .distinct("project_id")\
    .values("project_id")

# iterate through projects
for project_pk in project_pk_with_c:
    project = py_project_model.get_by_pk(project_pk)
    
    # iterate through all project releases
    for release in py_project_release_model.get_parsed_releases(project):
        # pull the release from pypi or cached s3
        release_dist_bytes = pypi_repo.load_dist(project, release)
        release_dist_parser = PyPiDistParser(release, pypi_repo)
        
        # get file source and iterate
        with release_dist_parser.create_file_source(release_dist_parser._release_path, 
                                                                     release_dist_bytes, 
                                                                     f"{project.name}:{release.version}") as release_dist_source:           
            for f in release_dist_source.walk():
                # filter only c/c++ files
                f_extension = f.get_file_name().split(".").pop()
                if f_extension is None:
                    continue
                if f_extension.lower() in ["c", "cc", "cpp", "h", "hh", "hpp"]:
                    # flawfinder only works with file paths on local filesystem
                    # TODO: update after PR/fork
                    with tempfile.NamedTemporaryFile() as f_temp:
                        # write
                        f_temp.write(f._get_content())
                        f_temp.flush()
                        
                        # empty flawfinder global hitlist and call process
                        flawfinder.hitlist = []
                        h = flawfinder.process_c_file(f_temp.name, None)
                        
                        # output if results
                        if len(flawfinder.hitlist) > 0:
                            print(f"release={project.name}:{release.version} @ {f.dist_path}:")
                            print(flawfinder.SarifLogger(flawfinder.hitlist).output_sarif())
                            
                            # stop after our first hit so we don't
                            raise RuntimeError("don't print a billion lines")

release=pybbi:0.2.2 @ src/asParse.c:
{
  "$schema": "https://schemastore.azurewebsites.net/schemas/json/sarif-2.1.0-rtm.5.json",
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "Flawfinder",
          "version": "2.0.19",
          "informationUri": "https://dwheeler.com/flawfinder/",
          "rules": [
            {
              "id": "FF1001",
              "name": "buffer/strcpy",
              "shortDescription": {
                "text": "Does not check for buffer overflows when copying to destination [MS-banned] (CWE-120)."
              },
              "defaultConfiguration": {
                "level": "error"
              },
              "helpUri": "https://cwe.mitre.org/data/definitions/120.html",
              "relationships": [
                {
                  "target": {
                    "id": "CWE-120",
                    "toolComponent": {
                      "name": "CWE",
                      "guid": "FFC64C9

RuntimeError: don't print a billion lines