-
Notifications
You must be signed in to change notification settings - Fork 11
Refactor entire system, adapt to new FISC website #15
Conversation
… easier than setting it up on one's own
Refactor entire system, adapt to new FISC website
A note to any readers: PDFs are now being reliably downloaded and versioned as of 7e0d216. The downloader will always download PDFs it hasn't seen, and by default only re-download PDFs it has seen if the ETags don't match. An It's up to my crontab to enforce the schedule by which it will check the full archive for mismatched ETags (once a day?), and to enforce the schedule by which it will check the full archive and re-download everything (once a week?). Also, my archiving attempts show that the Court's website is not especially resilient. The first attempt to download everything failed for ~57 PDFs, the next one got all but 5, the next all but 2, the next all but 1, and then that last one took a while. So re-downloading everything might need to be re-jiggered in the future, to store the last time a PDF was downloaded, and make sure that it will re-download anything that hasn't been successfully re-downloaded in a while. |
This is a complete refactor of the @FISACourt alert system, spurred by the FISC's entirely new website.
The old website is gone (it redirects to the new one), and the old approach of tracking HTML changes is no good: the HTML of the new site has various timestamps/CDN artifacts that make change unpredictable and ephemeral. So, the system is now a proper scraper, that outputs and versions structured data about each filing published by the Court.
Major points:
check
) that takes destructive action. All other Ruby scripts can be safelyrequire
'd without triggering any meaningful execution.fisa.html
, we now versionfilings/*.yml
, where each YAML file is the metadata for a particular filing. (Title, landing page URL, PDF URL, dockets, "ID".)check archive
mode that fetches all public filings, going back to the beginning of what the FISC has posted.git
gem to therugged
gem for manipulating git.I also updated the README and added a note to the FISC about what they could improve, though it all needs some more work.
Meaningful future work for me:
unitedstates/fisacourt
. That helps the @unitedstates project, and makes it easier for others like CourtListener to potentially use our data.