Skip to content

Project Ideas SoftwareHeritage API Client Library

Philippe Ombredanne edited this page Mar 13, 2020 · 3 revisions

Software Heritage API Client Library

Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

In order to access this data effectively and efficiently, we need to create a python library that can access the SoftwareHeritage data via its REST API.

The goal is to craft a well documented library that provide a clean abstraction for the SWH API. The working title for this project is "HeritedCode".

SWH contains eventually over 400 TB of FOSS source code and billion files together with their corresponding metadata on origin.

The primary purpose of HeritedCode would be to provide basic API access as well as having a specific focus on identification of the origin of thrid-party FOSS files and packages present in a codebase.

Some of the use cases to consider would include:

  • given some ScanCode-like file-level information (paths, names, sha1/256/sha1_git, size, etc) we want to query the API to find if these exist there and return file and package data and metadata where this/these exists
  • given many of these, find the most likely package
  • given URLs or Package URLs or download URLs (see also https://gitter.im/aboutcode-org/fetchcode ) do these origins exists in SWH?

This has to be new code that does not reuse SWH code code as we want the HeritedCode library to be Apache-licensed (SWH code is GPL- and AGPL-licensed).

Clone this wiki locally