This repository has been archived by the owner. It is now read-only.

Refactor entire system, adapt to new FISC website #15

Merged
merged 33 commits into from May 26, 2014

Conversation

Projects
None yet
1 participant
@konklone
Copy link
Owner

konklone commented May 26, 2014

This is a complete refactor of the @FISACourt alert system, spurred by the FISC's entirely new website.

fisc-header-2

The old website is gone (it redirects to the new one), and the old approach of tracking HTML changes is no good: the HTML of the new site has various timestamps/CDN artifacts that make change unpredictable and ephemeral. So, the system is now a proper scraper, that outputs and versions structured data about each filing published by the Court.

Major points:

  • Broke out the code into 4 sections: general FISC config/checking, git interaction, alert interaction, and a small kickoff script (now called check) that takes destructive action. All other Ruby scripts can be safely require'd without triggering any meaningful execution.
  • Instead of versioning fisa.html, we now version filings/*.yml, where each YAML file is the metadata for a particular filing. (Title, landing page URL, PDF URL, dockets, "ID".)
  • I scrapped Docker integration. It was clumsy, added complexity to the instructions, and wasn't that useful. Oh well.
  • Added a check archive mode that fetches all public filings, going back to the beginning of what the FISC has posted.
  • Added generic exception handling to the check task, that emails the admin if something goes awry.
  • Switched from using the (now deprecated) git gem to the rugged gem for manipulating git.

I also updated the README and added a note to the FISC about what they could improve, though it all needs some more work.

Meaningful future work for me:

  • Actually download the PDFs the FISC publishes, and version them. Consider changes to them alert-worthy pages, like anything else. (The FISC has changed a PDF post-publish at least once, without any notice.)
  • Grab the description for each filing from its landing page. Right now, scraped information is limited to what's present on the listings of results.
  • Separate out a bit further the scraper->data system, split it out from the check/alert code, and transfer that core to unitedstates/fisacourt. That helps the @unitedstates project, and makes it easier for others like CourtListener to potentially use our data.
  • Possibly grab more accurate dates from PDF metadata. This isn't usually a great strategy, but the FISC post dates from 4/30 on back are so obviously wrong that it will probably be more accurate.
  • Propose an actual schema for the FISC to publish structured data in.

konklone added a commit that referenced this pull request May 26, 2014

Merge pull request #15 from konklone/new-site
Refactor entire system, adapt to new FISC website

@konklone konklone merged commit 30f52cf into master May 26, 2014

@konklone konklone deleted the new-site branch May 26, 2014

konklone added a commit that referenced this pull request May 26, 2014

@konklone

This comment has been minimized.

Copy link
Owner

konklone commented Jun 4, 2014

A note to any readers: PDFs are now being reliably downloaded and versioned as of 7e0d216.

The downloader will always download PDFs it hasn't seen, and by default only re-download PDFs it has seen if the ETags don't match. An everything command can be given that forces re-downloads of whatever the script is going for (by default page 1, with archive all pages).

It's up to my crontab to enforce the schedule by which it will check the full archive for mismatched ETags (once a day?), and to enforce the schedule by which it will check the full archive and re-download everything (once a week?).

Also, my archiving attempts show that the Court's website is not especially resilient. The first attempt to download everything failed for ~57 PDFs, the next one got all but 5, the next all but 2, the next all but 1, and then that last one took a while.

So re-downloading everything might need to be re-jiggered in the future, to store the last time a PDF was downloaded, and make sure that it will re-download anything that hasn't been successfully re-downloaded in a while.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.