This repository consists of a number of 'recipes' for using IRSx to achieve common tasks. Most recipes are present in their own notebook; some have additional dependencies beyond IRSx. The notebooks are styled to be readable rather than the model of exemplary python.
If you're just looking to dump all the filings into a relational database (~180 tables) there's a django project for that here.
To run these as notebooks on your own machine, clone the repo, run
$ pip install -r requirements.txt to install the requirements, and start a jupyter notebook with
$ jupyter notebook.
Moving files around
IRSx will happily download files, one at a time, but if you've got millions consider whether there's a faster way.
Amazon distributes a custom command line tool (AWS CLI) see more here that allows bulk copying and speeds up downloads, especially to another amazon bucket. It requires an Amazon password (Note that there's lots of ways of getting files from amazon, from FTP tools to s3 cmd too). There are lotsa ways to do this.
IRSx is focused on reading xml forms using well-defined metadata; that metadata was created using irs990_admin.