State of Open Infrastructure Funding: Scraping Pipeline

This is Invest in Open Infrastructure's core data scrapers for the annual State of Open Infrastructure report. It is intended to provide the data on funding, projects, and tools in the open infrastructure space.

The 2024 report, when released at the end of May, will be available online and as a PDF for download.

Contact

If you have questions or feedback, please get in touch with "Invest in Open Infrastructure info@investinopen.org".

Data & Code Overview

This project is a Python-based ETL pipeline that obtains publically available data from scientific funding organizations, then normalizes their data to a common schema for analysis. There are two types of pipelines:

Website scrapers that obtain the data from web pages (such as grant catalogs / portals)
Notebook-based scripts that process data from APIs or file downloads, such as bulk exports of grants

Both types of pipelines share the same common schema for the data they output, an attrs-style data class that is used to validate the data and ensure that it is consistent across all funders.

The data for each funder is output as JSON Lines files in the [data](data) directory as <funderid>_<grant_type>.jsonl. When funder data exceeds 100mb, it is split into multiple files like sshrc-ca.split00.jsonl.

Additional details on the structure of the data, can be found in DATA.md and documentation on the code used to produce it can be found in CONTRIBUTING.md, along with instructions on running the code to update the data yourself.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.vscode		.vscode
analysis		analysis
data		data
notebook_pipelines		notebook_pipelines
oic_scrape		oic_scrape
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATA.md		DATA.md
LICENSE.txt		LICENSE.txt
README.md		README.md
design_notes.md		design_notes.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg
split_jsonl_files.sh		split_jsonl_files.sh

License

investinopen/state_of_open_funder_data_scrapers

Folders and files

Latest commit

History

Repository files navigation

State of Open Infrastructure Funding: Scraping Pipeline

Contact

Data & Code Overview

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages