Skip to content

grantseeker/gs-opendata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Grantseeker | Open Data Project for Global Charity Data

An AWS OpenData Project maintained by Grantseeker, Inc. featuring developer-friendly IRS exempt organization data.

This is a successor to the "first" AWS IRS 990 OpenData project (maintained by the IRS themselves):
https://registry.opendata.aws/irs990/
https://github.com/awslabs/open-data-registry/blob/main/datasets/irs990.yaml
https://github.com/awslabs/open-data-docs/tree/main/docs/irs-990

We are committed to ensuring 990 data is open, elegantly machine-accessible, and free everyone who wants to build upon it.

Getting Started [NOTE: USING GRANTSEEKER URLS FOR NOW UNTIL WE GET OUR NEW AWS BUCKET]

The easiest way to get started is to access our open bucket:

gs private bucket --> to migrate to 'pristine' AWS account once approved

For example, you can get the 2022 source XML file for PRO PUBLICA INC. (EIN: 142007220) like this:

# ILLUSTRATION USING CURRENT GRANTSEEKER OPEN API

# Open browswer
https://opendata.grantseeker.io/data/202242699349300499_public.xml

# Or in terminal
curl https://opendata.grantseeker.io/data/202242699349300499_public.xml

The idea is that you can then build whatever data processing pipeline you want from there:

# Grab the "TotalAssetsBOYAmt" and log it in a file 'total_assets.txt'
curl https://opendata.grantseeker.io/data/202242699349300499_public.xml | grep -oP '<TotalAssetsBOYAmt>\K[^<]+' > total_assets.txt

Schema

To help you navigate, here is the design of our full bucket:

Bucket Schema [Proposed]:

    
    README.md
    LICENSE
    
    # All IRS source data in here
    /irs

        # Core 990 data 
        /990
            /index
                /latest.csv
                /<pub_year>.csv
            /data
                /xml
                    /<object_id>_public.xml
                /pdf
                    /<object_id>_public.xml
        
            /schema
                /xml
                    /<version>

        # Publication 78 Data
        /pub78

            # Index of file versions
            /index
                index.csv
                index.json
            
            # Source files (txt format)
            /data
                latest.txt
                <dated-version>.txt

        # Business Master File Extract files (from IRS)
        /extracts
            /bmf
                /latest.csv
                /<year_partition_x>.csv

    # Metadata
    /metadata

        # File with project metadata - e.g. date of last sync, errata, notes, etc
        project.json

    # Folder for developers / users
    /dev
        /utils      # Utilities for munging data; to ensure continuity
        /examples   # Code Examples 

Grantseeker Open API

For convenience, Grantseeker also provides a simple API to query resources and more [in Early Acccess]. Please inquire if interested:

Target Users & Product Value Proposition

This project aims to provide the most complete, developer-friendly, and open source of IRS 990 data (and related publications).

Users Anyone who is interested in building / consuming data for the betterment of US nonprofits and the communities around the world they serve.

Core Product
[ ] S3 | Public storage bucket of raw, cloud-optimized IRS source data, with 100% fidelity, 48hr latency (from date of irs.gov publication) - 990 XML Files - 990 PDF legacy files - Pub 78 File - BMF/SOI Extracts - Index Files (source + concatenated)

[ ] API | Open and permissionless API index for all resources in hosted storage

[ ] README | Core documentation and community reference implementations for getting started with the datasets

Project Stewards

Organization Website Representative Role
Grantseeker, Inc. https://grantseeker.io Nathaniel B. Chase @seekerchase Lead Sponsor
Fluxx Labs, Inc. https://fluxx.io [tbd contact] Advisory Council

...others welcome! Please reach out to the team:

opendata@grantseeker.io

Project Timeline

PHASE 0: Trial Period (6mo)

Step 0 (Q3 2023) Apply for and recieve AWS sponsorship to standup project. Complete onboarding and transfer full XML file bucket from GS-S3 to new "pristine" project bucket.

Step 1 (Q3 2023) Soft launch / announcement to core members of community, past project users / contributors. Formation of basic project governance, including update cadence, team roles / redundancy, and core deliverables.

Step 2 (Q4 2023) Complete first cycle of data updates and product (data) maintenance cycle. Solicit 2-5 new or legacy AWS 990 OpenData projects to onboard to the project.

Step N+1 ...wherever the spirit takes us!

If you are interested in building on this dataset, please reach out! We are open to any and all collaborators.

Why?

While the IRS data is free and available for download by anyone, it is not in an easily consumable format for anyone looking to build software or data projects using it.

  • 990 Data is held in two seperate datasets - an XML format (for those filing electronically), and a legacy PDF format that is not fully OCR'd or easily machine readable
  • the XML 990s are stored in .zip files and grouped chronologically by year, with multiple and varying numbers of files per year and a non-standard naming system across the years
  • Index files are stored in .csv (also downloadable), one for each year. These are not easily queriable, leaving the user to make their own index if they want to identify a filing (e.g. by EIN)
  • 990 Schemas vary across year and by filing type (e.g. for 990EZ, 990PF, etc), requiring a variable schema mapping if you want normalized data across years

Continuation of AWS IRS 990 OpenData Project

This project is a successor project to the original AWS Opendata project described here: https://registry.opendata.aws/irs990/

On December 31, 2021, the IRS deprecated its support of the IRS open data, in favor of hosting files solely at irs.gov: https://www.irs.gov/newsroom/irs-makes-tax-exempt-organization-search-primary-source-to-get-exempt-organization-data

AWS OpenData team listed on their deprecation notice their openness to someone taking on stewardship. On June 14, 2023, @seekerchase and pws@amazon.com spoke about Grantseeker, Inc. picking up the stewardship for the prior opendata project.

Source Data Files

Related Source Data

Currently Available 990 Data Sources / Services

There are many wonderful and open projects that have parsed and made available 990 data to the public.

Below is a summary of the leading ones (please PR more!):

Organization Project Website Notes
ProPublica NonProfit Explorer https://projects.propublica.org/nonprofits/ Good open API; XML files require signed url(?); 12mo behind latest IRS publications
Candid GuideStar/Search https://www.guidestar.org/search IRS data behind register + paywall; good enriched data; but no access to source 990; some stale
Economic Research Institute 990 Finder https://www.erieri.com/form990finder Basic site; 990 data stale @ FY2019 on 2023-06

Reference Projects

There have been a number of interesting and productive data munging projects over the years, probably more than can be listed here. A few worth noting:

FROM AWS YAML (https://github.com/awslabs/open-data-registry/blob/main/datasets/irs990.yaml)

DataAtWork:

Tutorials:

Tools & Applications:

======= PROPUBLICA DATA NOTES https://projects.propublica.org/nonprofits/api

About

Grantseeker | Open Data Project for Global Charity Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published