This is Invest in Open Infrastructure's core data scrapers for the annual State of Open Infrastructure report. It is intended to provide the data on funding, projects, and tools in the open infrastructure space.
The 2024 report, when released at the end of May, will be available online and as a PDF for download.
If you have questions or feedback, please get in touch with "Invest in Open Infrastructure info@investinopen.org".
This project is a Python-based ETL pipeline that obtains publically available data from scientific funding organizations, then normalizes their data to a common schema for analysis. There are two types of pipelines:
- Website scrapers that obtain the data from web pages (such as grant catalogs / portals)
- Notebook-based scripts that process data from APIs or file downloads, such as bulk exports of grants
Both types of pipelines share the same common schema for the data they output, an attrs-style data class that is used to validate the data and ensure that it is consistent across all funders.
The data for each funder is output as JSON Lines files in the [data](data)
directory as <funderid>_<grant_type>.jsonl
. When funder data exceeds 100mb, it is split into multiple files like sshrc-ca.split00.jsonl
.
Additional details on the structure of the data, can be found in DATA.md and documentation on the code used to produce it can be found in CONTRIBUTING.md, along with instructions on running the code to update the data yourself.