Skip to content

lchen001/HAPI

Repository files navigation

banner

pre-commit standard-readme compliant Paper License Conference

A longitudinal database of ML API predictions.

Getting Started | Website | Contributing | About

💡 What is HAPI?

History of APIs (HAPI) is a large-scale, longitudinal database of commercial ML API predictions. It contains 1.7 million predictions collected from 2020 to 2022 and spanning APIs from Amazon, Google, IBM, and Microsoft. The database include diverse machine learning tasks including image tagging, speech recognition and text mining.

⚡️ Quickstart

We provide a lightweight python package for getting started with HAPI.

Read the guide below or follow along in Google Colab:

Open intro

pip install "hapi @ git+https://github.com/lchen001/hapi@main"

Import the library and download the data, optionally specifying the directory for the the download. If the directory is not specified, the data will be downloaded to ~/.hapi.

>> import hapi

>> hapi.config.data_dir = "/path/to/data/dir" 

>> hapi.download()

You can permanently set the data directory by adding the variable HAPI_DATA_DIR to your environment.

Once we've downloaded the database, we can list the available APIs, datasets, and tasks with hapi.summary(). This returns a Pandas DataFrame with columns task, dataset, api, date, path, cost_per_10k.

>> df = hapi.summary()

To load the predictions into memory we use hapi.get_predictions(). The keyword arguments allow us to load predictions for a subset of tasks, datasets, apis and/or dates.

>> predictions =  hapi.get_predictions(task="mic", dataset="pascal", api=["google_mic", "ibm_mic"])

The predictions are returned as a dictionary mapping from "{task}/{dataset}/{api}/{date}" to lists of dictionaries, each with keys "example_id", "predicted_label", and "confidence". For example:

{
    "mic/pascal/google_mic/20-10-28": [
        {
            'confidence': 0.9798267782,
            'example_id': '2011_000494',
            'predicted_label': ['bird', 'bird']
        },
        ...
    ],
    "mic/pascal/microsoft_mic/20-10-28": [...],
    ...
}

To load the labels into memory we use hapi.get_labels(). The keyword arguments allow us to load labels for a subset of tasks and datasets.

>> labels = hapi.get_labels(task="mic", dataset="pascal")

The labels are returned as a dictionary mapping from "{task}/{dataset}" to lists of dictionaries, each with keys "example_id" and "true_label".

💾 Manual Downloading

In this section, we discuss how to download the database without the HAPI Python API.

The database is stored in a GCP bucket named hapi-data. All model predictions are stored in hapi.tar.gz (Compressed size: 205.3MB, Full size: 1.2GB).

From the command line, you can download and extract the predictions with:

    wget https://storage.googleapis.com/hapi-data/hapi.tar.gz && tar -xzvf hapi.tar.gz 

However, we recommend downloading using the Python API as described above.

🌍 Datasets

In this section, we discuss how to download the benchmark datasets used in HAPI.

The predictions in HAPI are made on benchmark datasets from across the machine learning community. For example, HAPI includes predictions on PASCAL, a popular object detection dataset. Unfortunately, we lack the permissions required to redistribute these datasets, so we do not include the raw data in the download described above.

We provide instructions on how to download each of the datasets and, for a growing number of them, we provide automated scripts that can download the dataset. These scripts are implemented in the Meerkat Dataset Registry – a registry of machine learning datasets (similar to Torchvision Datasets).

To download a dataset and load it into memory, use hapi.get_dataset():

>> import hapi
>> dp = hapi.get_dataset("pascal")

This returns a Meerkat DataPanel – a DataFrame-like object that houses the dataset. See the Meerkat User Guide for more information. The DataPanel will have an "example_id" column that corresponds to the "example_id" key in the outputs of hapi.get_predictions() and hapi.get_labels().

If the dataset is not yet available through the Meerkat Dataset Registry, a ValueError will be raised containing instructions for manually downloading the dataset. For example:

>> dp = hapi.get_dataset("cmd")

ValueError: Data download for 'cmd' not yet available for download through the  HAPI Python API. Please download manually following the instructions below: 
 
CMD is a spoken command recognition dataset. 

It can be downloaded here: https://pyroomacoustics.readthedocs.io/en/pypi-release/pyroomacoustics.datasets.google_speech_commands.html.

✉️ About

HAPI was developed at Stanford in the Zou Group. Reach out to Lingjiao Chen (lingjiao [at] stanford [dot] edu) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu) if you would like to get involved!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published