Refactor entire system, adapt to new FISC website #15

konklone · 2014-05-26T07:22:55Z

This is a complete refactor of the @FISACourt alert system, spurred by the FISC's entirely new website.

The old website is gone (it redirects to the new one), and the old approach of tracking HTML changes is no good: the HTML of the new site has various timestamps/CDN artifacts that make change unpredictable and ephemeral. So, the system is now a proper scraper, that outputs and versions structured data about each filing published by the Court.

Major points:

Broke out the code into 4 sections: general FISC config/checking, git interaction, alert interaction, and a small kickoff script (now called check) that takes destructive action. All other Ruby scripts can be safely require'd without triggering any meaningful execution.
Instead of versioning fisa.html, we now version filings/*.yml, where each YAML file is the metadata for a particular filing. (Title, landing page URL, PDF URL, dockets, "ID".)
I scrapped Docker integration. It was clumsy, added complexity to the instructions, and wasn't that useful. Oh well.
Added a check archive mode that fetches all public filings, going back to the beginning of what the FISC has posted.
Added generic exception handling to the check task, that emails the admin if something goes awry.
Switched from using the (now deprecated) git gem to the rugged gem for manipulating git.

I also updated the README and added a note to the FISC about what they could improve, though it all needs some more work.

Meaningful future work for me:

Actually download the PDFs the FISC publishes, and version them. Consider changes to them alert-worthy pages, like anything else. (The FISC has changed a PDF post-publish at least once, without any notice.)
Grab the description for each filing from its landing page. Right now, scraped information is limited to what's present on the listings of results.
Separate out a bit further the scraper->data system, split it out from the check/alert code, and transfer that core to unitedstates/fisacourt. That helps the @unitedstates project, and makes it easier for others like CourtListener to potentially use our data.
Possibly grab more accurate dates from PDF metadata. This isn't usually a great strategy, but the FISC post dates from 4/30 on back are so obviously wrong that it will probably be more accurate.
Propose an actual schema for the FISC to publish structured data in.

… easier than setting it up on one's own

Refactor entire system, adapt to new FISC website

konklone · 2014-06-04T13:34:28Z

A note to any readers: PDFs are now being reliably downloaded and versioned as of 7e0d216.

The downloader will always download PDFs it hasn't seen, and by default only re-download PDFs it has seen if the ETags don't match. An everything command can be given that forces re-downloads of whatever the script is going for (by default page 1, with archive all pages).

It's up to my crontab to enforce the schedule by which it will check the full archive for mismatched ETags (once a day?), and to enforce the schedule by which it will check the full archive and re-download everything (once a week?).

Also, my archiving attempts show that the Court's website is not especially resilient. The first attempt to download everything failed for ~57 PDFs, the next one got all but 5, the next all but 2, the next all but 1, and then that last one took a while.

So re-downloading everything might need to be re-jiggered in the future, to store the last time a PDF was downloaded, and make sure that it will re-download anything that hasn't been successfully re-downloaded in a while.

konklone added 30 commits May 25, 2014 00:28

upping to 2.1.2

23d2711

Remove docket detection, new site makes the logic here not reusable

f2be845

Update how README describes the docket

d65864b

move URL to top, I guess

56f5cae

repo rename

cb5a944

This is just not a good candidate for docker, it is difficult and not…

85c8c3f

… easier than setting it up on one's own

Capitalization

eb3ef3f

link to cool people

b10a471

A note to the FISC

fa64c62

adjusting deps

2a7cea0

force bundler

945ffc3

Moving alert configuration/output into its own file

e97ffc1

Removed @docket, mentioned a couple URLs

5972555

yeah no more docker

7222064

unneeded include

f198169

Moved git stuff into its own place too

532605e

okay, workflow is actually working

90ab2bf

moved most Ruby code into lib/, left a smaller check script at the root

7b68d4b

change how options get passed around, lay groundwork for paging/archive

3e1036a

I use this all the time

3988eb6

document some new syntax

0b3a832

getting closer to writing right

343865c

emailing admin in cases of exceptions

c808fc4

Getting page number

ac568f3

add nokogiri, remove safe_yaml

9343818

Improve output and emails

0fed148

revise notes, archive docs

ee6060b

Actually save filings to the archive

5d860ce

Mostly done switching to Rugged

ee5a082

fixed test_error error, looking like git and such is working now

254a872

konklone added 3 commits May 26, 2014 00:23

Greatly tame stdout, update commit message, fix bug with mutable URL

c777fcd

Use rugged to get sha, and tweak stdout more

c3ed104

geez, modified counts too

fdc8d81

konklone added a commit that referenced this pull request May 26, 2014

Merge pull request #15 from konklone/new-site

30f52cf

Refactor entire system, adapt to new FISC website

konklone merged commit 30f52cf into master May 26, 2014

konklone deleted the new-site branch May 26, 2014 07:23

konklone added a commit that referenced this pull request May 26, 2014

Moving on: clearing docket, see #15 for details

5a7dd09

This was referenced May 26, 2014

The entire FISC website has changed #13

Closed

Download any new linked docs #5

Closed

freelawbot mentioned this pull request Aug 8, 2014

FISA Court has a new website freelawproject/juriscraper#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor entire system, adapt to new FISC website #15

Refactor entire system, adapt to new FISC website #15

konklone commented May 26, 2014

konklone commented Jun 4, 2014

Refactor entire system, adapt to new FISC website #15

Refactor entire system, adapt to new FISC website #15

Conversation

konklone commented May 26, 2014

konklone commented Jun 4, 2014