# DEMO: OSS-Fuzz Research Module Prototype

## In Brief

This notebook showcases prototype for an **OSS-Fuzz module for researchers** and explains some of the design decision behind it.

In its current state, it addresses the following requirements:

> - Project Information Retrieval ℹ️:
>   1. API to list all projects currently fuzzed by OSS-Fuzz.
>   2. API to retrieve details for a specific project (e.g., language, build system, fuzzer engines used).
>   3. Ability to filter projects based on criteria (e.g., language, library).

More than to demonstrate know-how, my primary goal developing this prototype was to explore the problem domain: understand how the OSS-Fuzz project works and, thereby, how we could make it more readily accessible through a usable, intuitive Python package.

⚠️ **A word of caution**: This is very much a prototype. It assumes the easy path, responsive APIs, and a cooperative user. It makes design choices that would need discussion. It barely does error-checking and wouldn't stand a chance against a fuzzer. It simply aims to expose basic functionality with minimal, clean code.

If selected, the prototype could be used as a starting point, be partially refactored to correct course, or be entirely discarded to start fresh — I'm not married to it. Whether I'm selected or not, if any of its code could be useful to boostrap the project, feel free to use it.

## Demo

In [1]:
import api as ossfuzz

from pprint import pprint as pp

### Requirement 1:

> API to list all projects currently fuzzed by OSS-Fuzz.

This requirement implied an important design decision: **What to use as data source?** Should we clone `oss-fuzz` locally and keep it synced? Should we access the project via GitHub's API to always get fresh data? Are appropriate APIs even available?

To answer these question, I gauged the amount of data we'd need to retrieve, looked for inspiration in projects that integrate with `oss-fuzz`, such as `fuzz-introspector`, and explored GitHub's [REST](https://docs.github.com/en/rest?apiVersion=2022-11-28) and [GraphQL](https://docs.github.com/en/graphql) APIs and put them to the test.

**Design Decision: Small Data Source**

For this functionality, I decided to use GitHub's REST API because it's simple, quick, always gets fresh data, and doesn't require authentication for our demo use. However, by then, I also knew the REST API wouldn't be sufficent for more advanced functionality. Nonetheless, since this is a prototype and exploratory by nature, it thought it interesting to use multiple data sources.

Knowing this was a subjective design decision, and for the sake of separation of concerns, I deferred data fetching to a dedicated module (`fetcher.py`), to make it easy to swap data sources.

In [2]:
# List 5 projects
ossfuzz.list_projects(limit=5)  # The limit param is mainly for debugging

['abseil-cpp', 'abseil-py', 'ada-url', 'adal', 'aiohttp']

In [3]:
# List all projects
ossfuzz.list_projects()

['abseil-cpp',
 'abseil-py',
 'ada-url',
 'adal',
 'aiohttp',
 'airflow',
 'alembic',
 'ampproject',
 'angular',
 'angus-mail',
 'aniso8601',
 'ansible',
 'antlr3-java',
 'antlr4-java',
 'apache-axis2',
 'apache-commons-bcel',
 'apache-commons-beanutils',
 'apache-commons-cli',
 'apache-commons-codec',
 'apache-commons-collections',
 'apache-commons-compress',
 'apache-commons-configuration',
 'apache-commons-csv',
 'apache-commons-fileupload',
 'apache-commons-geometry',
 'apache-commons-imaging',
 'apache-commons-io',
 'apache-commons-jxpath',
 'apache-commons-lang',
 'apache-commons-logging',
 'apache-commons-math',
 'apache-commons-net',
 'apache-commons-text',
 'apache-commons-validator',
 'apache-cxf',
 'apache-doris',
 'apache-felix-dev',
 'apache-httpd',
 'apache-logging-log4cxx',
 'apache-poi',
 'aptos-core',
 'archaius-core',
 'arduinojson',
 'argcomplete',
 'argo',
 'args',
 'args4j',
 'arrow-java',
 'arrow-py',
 'arrow',
 'askama',
 'asn1crypto',
 'aspectj',
 'aspell',
 'as

## Requirement 2

> API to retrieve details for a specific project (e.g., language, build system, fuzzer engines used).

**Design Decision: Details Extraction**

Most of the project details live in `project.yaml`, so it made sense to start by fetching it. I used GitHub's REST API for this too. The structured nature of the file made it easy to load into a `dataclass`. I introduced two modules for this: `loaders.py` and `models.py`. Even though the codebase is compact, I decided to break the functionality into separate modules early on, to strenghen decoupling and support modularity right off the bat.

The build system, however, wasn't in `project.yaml`. For this, we had to fetch `build.sh` and parse it to try to deduce the build system. I did this with a simple heuristic that looks for mentions of certain build tools. I limited myself to the build tools mentioned by David and Adam in [this article](https://adalogics.com/blog/fuzzing-100-open-source-projects-with-oss-fuzz). The heuristic needs to be fleshed out; at this point, it's just a proof of concept.

In [4]:
# Get a specific project and show all its details
pp(ossfuzz.get_project("angular"))

Project(name='angular',
        language='javascript',
        homepage='https://angular.io/',
        main_repo='https://github.com/angular/angular.git',
        primary_contact=None,
        vendor_ccs=['wagner@code-intelligence.com',
                    'yakdan@code-intelligence.com',
                    'patrice.salathe@code-intelligence.com',
                    'hlin@code-intelligence.com',
                    'christopher.krah@code-intelligence.com',
                    'bug-disclosure@code-intelligence.com'],
        fuzzing_engines=['libfuzzer'],
        build_system='bazel')


In [5]:
# Get a specific project and show a specific detail, such as its language...
ossfuzz.get_project("curl").language

'c++'

In [6]:
# ... or its build system...
ossfuzz.get_project("angular").build_system

'bazel'

In [7]:
# ... or its fuzzing engines
ossfuzz.get_project("ada-url").fuzzing_engines

['libfuzzer', 'afl', 'honggfuzz', 'centipede']

In [8]:
# We may also output details in json
print(ossfuzz.get_project("envoy").to_json())

{
    "name": "envoy",
    "language": "c++",
    "homepage": "https://www.envoyproxy.io/",
    "main_repo": "https://github.com/envoyproxy/envoy.git",
    "primary_contact": "htuch@google.com",
    "vendor_ccs": null,
    "fuzzing_engines": [
        "libfuzzer",
        "honggfuzz"
    ],
    "build_system": "make"
}


**Design Decision: Big Data Source**

To get the details of a single project, GitHub's REST API does the job. Most notably, below 60 requests per hour, it doesn't require authentication, which is convenient for this demo. **But what if we want the details of all projects?** One request gets us one file. We need two files per project, and there are over 1000 projects. Even authenticated, we're bound to hit the rate limit of GitHub's REST API.

Fortunately, GitHub offers an API for "big data": their GraphQL API. The queries are a bit verbose, but this API allows to bulk fetch all 2000+ files in a dozen queries. Technically, it could just be a single query, but I implemented rudimentary query batching for robustness.

There's a **drawback**: the GraphQL API does require authentication by default. This entails creating a token in your GitHub account. You don't need to do for this demo to work, but a general user would need to do the following:

> 1. Go to: **Settings > Developer settings > Personal access tokens > Fine-grained tokens > Generate new token**.
> 2. Create a `.env` file at the root of the project and add the token as follows: `GITHUB_TOKEN=<your-personal-github-token>`.

**Design Decision: Data Caching**

While the GraphQL API is powerful, it isn't instant (the same can be said of any other method for sourcing the data). Getting all those files takes 30-60s. Naturally, we don't want to have to wait repeatedly. So the data is fetched once and cached. Then, we can refresh it at will.

In terms of code, we implemented the cache as a dedicated class in its own module. It's bare at this point, but having it in a separate module from the get-go will make it easy to expand and inject inject into the relevant modules.

**Design Decision: Project Structure**

One thing to note is that our prototype has a flat structure, with all files in a single directory. As the project grows, we may want to adopt a more package-like structure and instance-based implementation. Our functionality could be wrapped into classes (e.g. App or Api), which would facilitate testing. Our decision to split the logic into modules early on would make this sort of transition easier. 

In [9]:
# Cache all projects
ossfuzz.cache_projects()

Please wait: caching project data... (~45s, just once)
 - build.sh files cached...
 - proyect.yaml files cached...
Done


In [10]:
# Get all projects and show their details (triggers caching if none found)
pp(ossfuzz.get_projects())

{'abseil-cpp': Project(name='abseil-cpp',
                       language='c++',
                       homepage='abseil.io',
                       main_repo='https://github.com/abseil/abseil-cpp.git',
                       primary_contact='dmauro@google.com',
                       vendor_ccs=None,
                       fuzzing_engines=None,
                       build_system='bazel'),
 'abseil-py': Project(name='abseil-py',
                      language='python',
                      homepage='https://github.com/abseil/abseil-py',
                      main_repo='https://github.com/abseil/abseil-py',
                      primary_contact=None,
                      vendor_ccs=['david@adalogics.com'],
                      fuzzing_engines=['libfuzzer'],
                      build_system='unknown'),
 'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.co

## Requirement 3

> Ability to filter projects based on criteria (e.g., language, library).

**Design Decision: Basic Filtering**

Now that we have 1000+ projects cache, we want to be able to search through them. A simple way of doing this is via keyword search. I implementing the functionality as a function that allows users to supply values for project attributes (name, language, etc.). Currently, it returns the projects whose attributes include those keywords (including partial matches).

The implementation uses a closure to build a predicate that is then passed to a filtering function for matching. It's a bit more abstract, but in my opinion, it makes for clean, Pythonic code. Eventually, we could adopt a full-fledged specification pattern, allowing more flexible and granular matching with IF, AND, OR, etc.

In [11]:
# Get all projects whose attrs contain specific keywords
pp(ossfuzz.match_projects(
    language="c++",
))

{'abseil-cpp': Project(name='abseil-cpp',
                       language='c++',
                       homepage='abseil.io',
                       main_repo='https://github.com/abseil/abseil-cpp.git',
                       primary_contact='dmauro@google.com',
                       vendor_ccs=None,
                       fuzzing_engines=None,
                       build_system='bazel'),
 'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.com/ada-url/ada.git',
                    primary_contact='yagiz@nizipli.com',
                    vendor_ccs=None,
                    fuzzing_engines=['libfuzzer',
                                     'afl',
                                     'honggfuzz',
                                     'centipede'],
                    build_system=None),
 'alembic': Project(name='alembic',
                    language='c++',
   

In [12]:
# Currently, the function allows partial matching
pp(ossfuzz.match_projects(
    name="ada",    # Matches 'ada-url', 'adal', etc.
))

{'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.com/ada-url/ada.git',
                    primary_contact='yagiz@nizipli.com',
                    vendor_ccs=None,
                    fuzzing_engines=['libfuzzer',
                                     'afl',
                                     'honggfuzz',
                                     'centipede'],
                    build_system=None),
 'adal': Project(name='adal',
                 language='python',
                 homepage='https://github.com/AzureAD/azure-activedirectory-library-for-python',
                 main_repo='https://github.com/AzureAD/azure-activedirectory-library-for-python',
                 primary_contact=None,
                 vendor_ccs=['david@adalogics.com',
                             'adam@adalogics.com',
                             'arthur.chan@adalogics.com'],
       

In [13]:
# Currently, the function requires all criteria to match (implicit AND)
pp(ossfuzz.match_projects(
    name="ada",                     # Matches 'ada-url', 'adal', etc.
    fuzzing_engines="honggfuzz",    # Of the 'ada...' projects, only matches 'ada-url'
))

{'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.com/ada-url/ada.git',
                    primary_contact='yagiz@nizipli.com',
                    vendor_ccs=None,
                    fuzzing_engines=['libfuzzer',
                                     'afl',
                                     'honggfuzz',
                                     'centipede'],
                    build_system=None)}


**Design Decision: More Sophisticated Filtering**

In addition to keyword matching, I implemented a more flexible function for filtering (accidental alliteration). The function takes a predicate as input and returns any project that satisfies the predicate. For instance, the predicate could be a function that checks if a project is:

> (not-in-python OR build-system-missing) AND name-has-odd-characters

This filtering function actually powers the earlier keyword-matching function. The keyword-matching function simply helps users build predicates and passes them along to the filtering function. A disadvantage of this more powerful function is that it requires users to write their own predicates, typically as `lambdas`, which might be difficult for some users. So I think it's good to offer both types of filtering.

In [14]:
pp(ossfuzz.filter_projects(
    lambda p: (p.language!="python" or p.build_system is None) and (len(p.name) % 2 == 1)),  # We can be as specific as we want
)

{'ada-url': Project(name='ada-url',
                    language='c++',
                    homepage='https://ada-url.github.io/ada',
                    main_repo='https://github.com/ada-url/ada.git',
                    primary_contact='yagiz@nizipli.com',
                    vendor_ccs=None,
                    fuzzing_engines=['libfuzzer',
                                     'afl',
                                     'honggfuzz',
                                     'centipede'],
                    build_system=None),
 'alembic': Project(name='alembic',
                    language='c++',
                    homepage='https://github.com/alembic/alembic',
                    main_repo='https://github.com/alembic/alembic',
                    primary_contact='miller.lucas@gmail.com',
                    vendor_ccs=None,
                    fuzzing_engines=None,
                    build_system='cmake'),
 'angular': Project(name='angular',
                    language='javascript',

Thanks for reading! I hope some of this proves useful.