Skip to content

harp-lab/Assemblage

Repository files navigation

Assemblage

Assemblage is a distributed binary corpus discovery, generation, and archival tool built to provide high-quality labeled metadata for the purposes of building training data for machine learning applications of binary analysis and other applications (static / dynamic analysis, reverse engineering, etc...).

The code in this repository is published under MIT license.

Cloud infrastructure support

We have run Assemblage over the course of several months within the research computing cluster at Syracuse University and Amazon Web Services.

Worker Requirement

This is the public repository of Assemblage, and it is hosts a general template for booting Assemblage on any cloud infrastructure. We will soon include stable/old versions we customized and ran on AWS, please check out beanches under name of fomat {linux|windows}_{github|vcpkg}. For example, code we used to generate Windows binaries from GitHub data will locate at branch naming windows_github (though the credentials are sanitized).

We provide Dockerfile and build script to build Docker images for Linux worker, and the Docker compose file can be used to specify the resource each worker can access.
Due to the commercial license of the Wiundows, we only provide the boot script and environment specification for workers, locating at the Windows readme

Meanwhile, a brief introduction to the APIs is provided at this link.

Dataset Availability

We include only the subset of binaries for which permissive licenses can be ascertained.

Pdb files are too large to be included, but datasets with pdb files are also available upon request.

1.Windows GitHub dataset (Processed to SQLite database, 62k, last updated: Apr 14th 2024):

2.Windows vcpkg dataset (Processed to SQLite database, 29k):

3.Linux GitHub dataset (Processed to SQLite database, 211k):

Releases

No releases published

Packages

No packages published