# DEMO — OSS-Fuzz Research Prototype

## In Brief

This notebook demoes a prototype for an **OSS-Fuzz module for researchers** and explains some of the design decision behind it.

More than to demonstrate know-how, my primary goal developing it was to explore the problem domain: understand how the OSS-Fuzz project works and, thereby, how we could make it more readily accessible to researchers through a usable, intuitive Python package.

In its current state, the prototype addresses the following requirements by exposing the following functions:

> - Project Information Retrieval ℹ️:
>   1. API to list all projects currently fuzzed by OSS-Fuzz: `list_projects()`.
>   2. API to retrieve details for a specific project (e.g., language, build system, fuzzers engine used): `get_project()`, `get_projects()`, and `cache_projects()`.
>   3. Ability to filter projects based on criteria (e.g., language, library): `match_projects()` and `filter_projects()`.

⚠️ **A word of caution**: This is very much a prototype. It assumes the easy path, responsive APIs, and a cooperative user. It makes design choices that would need discussion. It barely does error-checking and wouldn't stand a chance against a fuzzer. It simply aims to demo basic functionality with clean, minimal code.

If selected, the prototype could be used as a starting point, could be refactored to correct course, or could be discarded to start fresh — I'm not married to it. Whether I'm selected or not, if any of its code could be useful to boostrap the project, feel free to use it.

## Demo

The happy path through this notebook is to run it top to bottom. If you run the earlier functions a lot, you might hit GitHub's API rate limit for unauthenticated accounts, which should reset after an hour. If you're getting results back, but they're not those you were expecting, please run `clear_cache()` and try again.

In [1]:
import api as ossfuzz

from pprint import pprint as pp

### Requirement 1:

> API to list all projects currently fuzzed by OSS-Fuzz.

This requirement implied an important design choice: **What to use as data source?** Should we clone `oss-fuzz` locally and keep it synced? Should we access the project via GitHub's API to always get fresh data? Are appropriate APIs even available?

To answer these question, I gauged the amount of data we'd need to retrieve, looked for inspiration in projects that integrate with `oss-fuzz`, such as `fuzz-introspector`, and explored GitHub's [REST](https://docs.github.com/en/rest?apiVersion=2022-11-28) and [GraphQL](https://docs.github.com/en/graphql) APIs.

**Design Decision: "Small Data" Source**

For this requirement, I decided to use GitHub's REST API because it's simple, quick, always gets fresh data, and doesn't require authentication for our use in this demo. However, by then, I also knew the REST API wouldn't be sufficent to address the later requirements. Nonetheless, since this is a prototype and exploratory by nature, it thought it interesting to use multiple data sources.

Knowing that this was a subjective design decision, and for the sake of separation of concerns, I deferred data fetching to a dedicated module (`fetcher.py`), to make it easy to swap data sources if we want.

In [2]:
# List 5 projects
ossfuzz.list_projects(limit=5)  # The limit param is mainly for debugging

['abseil-cpp', 'abseil-py', 'ada-url', 'adal', 'aiohttp']

In [3]:
# List all projects
ossfuzz.list_projects()

['abseil-cpp',
 'abseil-py',
 'ada-url',
 'adal',
 'aiohttp',
 'airflow',
 'alembic',
 'ampproject',
 'angular',
 'angus-mail',
 'aniso8601',
 'ansible',
 'antlr3-java',
 'antlr4-java',
 'apache-axis2',
 'apache-commons-bcel',
 'apache-commons-beanutils',
 'apache-commons-cli',
 'apache-commons-codec',
 'apache-commons-collections',
 'apache-commons-compress',
 'apache-commons-configuration',
 'apache-commons-csv',
 'apache-commons-fileupload',
 'apache-commons-geometry',
 'apache-commons-imaging',
 'apache-commons-io',
 'apache-commons-jxpath',
 'apache-commons-lang',
 'apache-commons-logging',
 'apache-commons-math',
 'apache-commons-net',
 'apache-commons-text',
 'apache-commons-validator',
 'apache-cxf',
 'apache-doris',
 'apache-felix-dev',
 'apache-httpd',
 'apache-logging-log4cxx',
 'apache-poi',
 'aptos-core',
 'archaius-core',
 'arduinojson',
 'argcomplete',
 'argo',
 'args',
 'args4j',
 'arrow-java',
 'arrow-py',
 'arrow',
 'askama',
 'asn1crypto',
 'aspectj',
 'aspell',
 'as

## Requirement 2

> API to retrieve details for a specific project (e.g., language, build system, fuzzer engines used).

**Design Decision: Details Extraction**

Most of the project details live in `project.yaml`, so it made sense to start by fetching it. I used GitHub's REST API for this too. The structured nature of the file made it easy to load into a `dataclass`. I introduced two modules for this: `loaders.py` and `models.py`. Even though the codebase is compact at this point, I decided to break the functionality into separate modules, to strenghen decoupling and support modularity from the get-go.

The build system, however, wasn't in `project.yaml`. For this, we had to fetch `build.sh` and parse it to try to deduce the build system. I did this with a simple heuristic that looks for mentions of certain build tools. The heuristic needs to be fleshed out; at this point, it's just a proof of concept. I limited myself to the build tools mentioned by David and Adam in [this article](https://adalogics.com/blog/fuzzing-100-open-source-projects-with-oss-fuzz).

In [4]:
# Get a specific project and show all its details
pp(ossfuzz.get_project("angular"))

Project(name='angular',
        language='javascript',
        homepage='https://angular.io/',
        main_repo='https://github.com/angular/angular.git',
        primary_contact=None,
        vendor_ccs=['wagner@code-intelligence.com',
                    'yakdan@code-intelligence.com',
                    'patrice.salathe@code-intelligence.com',
                    'hlin@code-intelligence.com',
                    'christopher.krah@code-intelligence.com',
                    'bug-disclosure@code-intelligence.com'],
        fuzzing_engines=['libfuzzer'],
        build_system='bazel',
        project_yaml='homepage: https://angular.io/\n'
                     'language: javascript\n'
                     'main_repo: https://github.com/angular/angular.git\n'
                     'fuzzing_engines:\n'
                     '- libfuzzer\n'
                     'sanitizers:\n'
                     '- none\n'
                     'vendor_ccs:\n'
                     '- "wagner@code-intelligen

In [5]:
# Get a specific project and show a specific attribute, such as its language...
ossfuzz.get_project("curl").language

'c++'

In [6]:
# ... or its build system...
ossfuzz.get_project("angular").build_system

'bazel'

In [7]:
# ... or its fuzzing engines
ossfuzz.get_project("ada-url").fuzzing_engines

['libfuzzer', 'afl', 'honggfuzz', 'centipede']

In [8]:
# We may also output details in json
print(ossfuzz.get_project("envoy").to_json())

{
    "name": "envoy",
    "language": "c++",
    "homepage": "https://www.envoyproxy.io/",
    "main_repo": "https://github.com/envoyproxy/envoy.git",
    "primary_contact": "htuch@google.com",
    "vendor_ccs": null,
    "fuzzing_engines": [
        "libfuzzer",
        "honggfuzz"
    ],
    "build_system": "make",
    "project_yaml": "homepage: \"https://www.envoyproxy.io/\"\nlanguage: c++\nprimary_contact: \"htuch@google.com\"\nauto_ccs:\n  - \"mattklein123@gmail.com\"\n  - \"jmarantz@google.com\"\n  - \"lizan@tetrate.io\"\n  - \"envoy-security@googlegroups.com\"\n  - \"yavlasov@google.com\"\n  - \"asraa@google.com\"\n  - \"adip@google.com\"\n  - \"yanjunxiang@google.com\"\n  - \"tyxia@google.com\"\n  - \"krajshiva@google.com\"\n  - \"boteng@google.com\"\n  - \"leonti@google.com\"\n  - \"copybara-watcher-pod-watcher-git@system.gserviceaccount.com\"\n  - \"copybara-worker@system.gserviceaccount.com\"\ncoverage_extra_args: -ignore-filename-regex=.*\\.cache.*envoy_deps_cache.*\nmain_

**Design Decision: "Big Data" Source**

To get the details of a single project, GitHub's REST API does the job. Most notably, below 60 requests per hour, it doesn't require authentication, which is convenient for this demo. **But what if we want the details of all projects at once?** One request gets us one file. We need two files per project, and there are over 1000 projects. Even authenticated, GitHub's REST API won't do.

Fortunately, GitHub also offers a **GraphQL API for "big data"**. The queries are a bit verbose, but this API allows to bulk fetch all 2000+ files in a dozen queries. Technically, it could just be a single query, but I implemented rudimentary query batching for robustness. With this API, we may refresh the data as often as we want.

There's a drawback: The GraphQL API does require authentication. **You won't need it for this demo**, but a general user would have to procure a GitHub token as follow:

> 1. Go to: **Settings > Developer settings > Personal access tokens > Fine-grained tokens > Generate new token**.
> 2. Create a `.env` file at the root of the project and add the token as follows: `GITHUB_TOKEN=<your-personal-github-token>`.

Since the intended audience for this project are researchers and developers, I think procuring a GitHub token shouldn't pose much problem. If it does, the modularity of our prototype would make it easy to swap to alternative data sources, though those would surely also come with tradeoffs.

**Design Decision: Data Caching**

While the GraphQL API is powerful, it isn't instant. Fetching all those files takes 30-60s. Naturally, we don't want to have to wait repeatedly. So the data is fetched once and, to keep things simple for now, cached in memory for the duration of the session. Then, we can refresh it at will.

In terms of code, I implemented the cache as a dedicated class in its own module. It's bare at this point, but having it in a separate module from the get-go will make it easy to expand and inject inject into other modules once we graduate to a distribution-ready project structure.

**Design Decision: Project Structure**

One thing to note is that our prototype has a flat structure, with all files in a single directory. As the project grows, we may want to adopt a tree-like structure and an instance-based implementation. Our functionality could be wrapped into classes, which would facilitate testing. Our decision to split the logic into modules early on would faclitate this transition. 

In [9]:
# Cache all projects
ossfuzz.cache_projects()

Please wait: Fetching project data... (~45s, only once)
We're halfway there...
Caching complete!


In [10]:
# Get all projects and show their details (triggers caching if none found)
pp(ossfuzz.get_projects())

{'abseil-cpp': Project(name='abseil-cpp',
                       language='c++',
                       homepage='abseil.io',
                       main_repo='https://github.com/abseil/abseil-cpp.git',
                       primary_contact='dmauro@google.com',
                       vendor_ccs=None,
                       fuzzing_engines=None,
                       build_system='bazel',
                       project_yaml='homepage: "abseil.io"\n'
                                    'language: c++\n'
                                    'primary_contact: "dmauro@google.com"\n'
                                    'main_repo: '
                                    "'https://github.com/abseil/abseil-cpp.git'\n",
                       build_sh='# Copyright 2020 Google Inc.\n'
                                '#\n'
                                '# Licensed under the Apache License, Version '
                                '2.0 (the "License");\n'
                                '# you m

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)






## Requirement 3

> Ability to filter projects based on criteria (e.g., language, library).

**Design Decision: Basic Filtering**

Now that we have 1000+ projects in our cache, we want to be able to search through them. A simple way of doing this is via keyword search. I implemented this as a function that allows users to supply values for project attributes (name, language, etc.) and returns the projects matching those values. In its current form, it allows partial matching at the attribute level.

The implementation uses a closure to build a predicate that is then passed to a filtering function for matching. It's a bit more abstract than our earlier functions, but in my opinion, it makes for clean, Pythonic code. Eventually, we could adopt a full-fledged specification pattern, allowing more flexible and granular matching with IF, AND, OR, etc., while retaining ease of use.

In [11]:
# Get all projects whose attrs contain specific keywords
pp(ossfuzz.match_projects(
    language="c++",
    build_system="cmake",
))

{'alembic': Project(name='alembic',
                    language='c++',
                    homepage='https://github.com/alembic/alembic',
                    main_repo='https://github.com/alembic/alembic',
                    primary_contact='miller.lucas@gmail.com',
                    vendor_ccs=None,
                    fuzzing_engines=None,
                    build_system='cmake',
                    project_yaml='homepage: '
                                 '"https://github.com/alembic/alembic"\n'
                                 'language: c++\n'
                                 'primary_contact: "miller.lucas@gmail.com"\n'
                                 'sanitizers:\n'
                                 '  - address\n'
                                 'main_repo: '
                                 "'https://github.com/alembic/alembic'\n",
                    build_sh='#!/bin/bash -eu\n'
                             '# Copyright 2020 Google Inc.\n'
                             

In [12]:
# Currently, the function allows partial matching
pp(ossfuzz.match_projects(
    name="ada",    # Matches 'ada-url', 'adal', etc.
))

{'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.com/ada-url/ada.git',
                    primary_contact='yagiz@nizipli.com',
                    vendor_ccs=None,
                    fuzzing_engines=['libfuzzer',
                                     'afl',
                                     'honggfuzz',
                                     'centipede'],
                    build_system=None,
                    project_yaml='homepage: "https://ada-url.github.io/ada"\n'
                                 'language: c++\n'
                                 'primary_contact: "yagiz@nizipli.com"\n'
                                 'auto_ccs:\n'
                                 '  - "daniel@lemire.me"\n'
                                 'main_repo: '
                                 '"https://github.com/ada-url/ada.git"\n'
                                 'sani

In [13]:
# Currently, the function requires all criteria to match (implicit AND)
pp(ossfuzz.match_projects(
    name="ada",                     # Matches 'ada-url', 'adal', etc.
    fuzzing_engines="honggfuzz",    # Of the 'ada...' projects, only matches 'ada-url'
))

{'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.com/ada-url/ada.git',
                    primary_contact='yagiz@nizipli.com',
                    vendor_ccs=None,
                    fuzzing_engines=['libfuzzer',
                                     'afl',
                                     'honggfuzz',
                                     'centipede'],
                    build_system=None,
                    project_yaml='homepage: "https://ada-url.github.io/ada"\n'
                                 'language: c++\n'
                                 'primary_contact: "yagiz@nizipli.com"\n'
                                 'auto_ccs:\n'
                                 '  - "daniel@lemire.me"\n'
                                 'main_repo: '
                                 '"https://github.com/ada-url/ada.git"\n'
                                 'sani

**Design Decision: More Sophisticated Filtering**

In addition to keyword matching, I implemented a more flexible function for filtering (accidental alliteration!). The function takes a predicate as input and returns any project that satisfies the predicate. For instance, the predicate could be a function that checks if a project is:

> (not-in-python OR build-system-missing) AND name-has-odd-characters

This filtering function actually powers the earlier keyword-matching function. The keyword-matching function simply helps users build predicates and passes them along to the filtering function. A disadvantage of this more powerful function is that it requires users to write their own predicates, typically as `lambdas`, which isn't quite as simple as providing keywords. So I think it's good to offer both types of filtering options.

In [14]:
pp(ossfuzz.filter_projects(
    lambda p: (p.language!="python" or p.build_system is None) and (len(p.name) % 2 == 1)),  # We can be as specific as we want
)

{'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.com/ada-url/ada.git',
                    primary_contact='yagiz@nizipli.com',
                    vendor_ccs=None,
                    fuzzing_engines=['libfuzzer',
                                     'afl',
                                     'honggfuzz',
                                     'centipede'],
                    build_system=None,
                    project_yaml='homepage: "https://ada-url.github.io/ada"\n'
                                 'language: c++\n'
                                 'primary_contact: "yagiz@nizipli.com"\n'
                                 'auto_ccs:\n'
                                 '  - "daniel@lemire.me"\n'
                                 'main_repo: '
                                 '"https://github.com/ada-url/ada.git"\n'
                                 'sani

This concludes the demo. Thanks for going over it! I hope some of this proves useful.