Skip to content
Mark Coletti edited this page May 2, 2023 · 9 revisions

Gremlin, an adversarial evolutionary algorithm that discovers biases or weaknesses in machine learners

Gremlin learns where a given machine learner (ML) model performs poorly via an adversarial evolutionary algorithm (EA). The EA will find the worst performing feature sets such that a practitioner can then, say, tune the training data to include more examples of those feature sets. Then the ML model can be trained again with the updated training set in the hopes that the additional examples will be sufficient for the ML to train models that perform better for those sets.

2022 R&D 100 Award Winner Gremlin is a 2022 R&D 100 Award Winner!

Requires

Installation

  1. Activate your conda or virtual environment
  2. cd into top-level gremlin directory
  3. pip install .

Configuration

Gremlin is essentially a thin convenience wrapper around [LEAP] (https://github.com/AureumChaos/LEAP). Instead of writing a script in LEAP, one would instead point the gremlin executable at a YAML file that describes what LEAP classes, subclasses, and functions to use, as well as other salient run-time characteristics. gremlin will parse the YAML file and generate a CSV file containing the individuals from the run. This CSV file should contain information that can be exploited to tune training data.

More information on how to create a configuration file can be found here, and a detailed documentation of the configuration parameters, as well as examples, can be found here.

Examples

Example code and configuration for a real problem can be found in examples/MNIST. This problem involves Gremlin discovering that one of the digits for the MNIST training data is poorly represented.

This can be run simply by (must be in examples/MNIST directory):

$ gremlin config/common.yml config/bygen.yml

Versions

More detailed explanations for version changes can be found in CHANGELOG.

  • v0.6, 3/3/23
    • Allow for using Dask Client subclasses, such as SSHCluster or SlurmCluster, which should make easier to deploy on clusters, supercomputers, and in the cloud.

    • Re-organized how Dask distributed configuration is handled in YAML files.

    • The bygen algorithm, which is a traditional by-generational evolutionary algorithm, now supports distributed evaluations via Dask. One can also refer to the parents in pipeline operators; e.g., this is useful for truncation selection, which needs to take the best of offspring and parents.

    • Broke out how YAML configuration files are handled into separate modules. See examples/MNIST/run.sh for examples.

  • v0.5, 2/3/23
    • Main installed executable now gremlin and not gremlin.py. Added optional async.with_client config section. Improvements made to setup.py.
  • v0.4, 9/30/22
    • Added config variable async.with_client that allows for interacting with Dask before the EA runs; e.g., client.wait_for_workers() or client.upload_file()
    • Replaced imports with preamble in YAML config files thus giving more flexibility for importing dependencies, but also allows for defining functions and variables that may be referred to in, say, the pipeline.
  • v0.3, 3/9/22
    • Add support for config variable algorithm that denotes if using a traditional by-generation EA or an asynchronous steady-state EA
  • v0.2dev, 2/17/22
    • revamped config system and heavily refactored/simplified code
  • v0.1dev, 10/14/21
    • initial raw release

Sub-directories

  • gremlin/ -- main gremlin code
  • examples/ -- examples for using gremlin; currently only has MNIST example

Main web site

The gremlin github repository is [https://github.com/markcoletti/gremlin] (https://github.com/markcoletti/gremlin). main is the release branch and active work occurs on the develop branch.