AWS-based application for mining the Maryland Judiciary Case Search
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
cloudformation
db
env
img
lib/psycopg2
src
.gitignore
LICENSE
Makefile
README.md
alembic.ini
requirements.txt
secrets.json.example

README.md

Case Harvester

Case Harvester is a project designed to mine the Maryland Judiciary Case Search (MJCS) and build a near-complete database of Maryland court cases that can be queried and analyzed without the limitations of the MJCS interface. It is designed to leverage Amazon Web Services (AWS) for scalability and performance.

If you are a researcher or journalist and would like access to our database, please reach out to us at info@openjusticebaltimore.org.

Architecture

Case Harvester is split into three main components: spider, scraper, and parser. Each component is a part of a pipeline that finds, downloads, and parses case data from the MJCS. The following diagram shows at a high level how each of these components interact:

High level diagram

Spider

The spider component is responsible for discovering case numbers. It does this by submitting search queries to the MJCS and iterating through the results. Because the MJCS only returns a maximum of 500 results, the search algorithm splits queries that return 500 results into a set of more narrowed queries which are then submitted. Each of these queries is then split again if more than 500 results are returned, and so forth, until the MJCS is exhaustively searched for case numbers. Each discovered case number is submitted to a PostgreSQL database, and then added to a queue for scraping:

Spider diagram

Scraper

The scraper component downloads and stores the case details for every case number discovered by the spider. The full HTML for each case (example) is added to an S3 bucket. Version information is kept for each case, including a timestamp of when each version was downloaded, so changes to a case can be recorded and referenced.

Scraper diagram

The scraper is a Lambda function that runs once an hour, as well as when the scraper queue has items in it. When the scraper is initially invoked by one of these triggers, it spawns a limited number of worker functions which can each scrape up to 10 cases from the queue. Each of the worker functions spawns another worker function upon completion, until the scraper queue is empty. The scraper is configured to spawn usually 1-2 concurrent worker functions, in order to limit the load on the MJCS.

Parser

The parser component is another Lambda function that parses the case details from the HTML for each case, and stores those in the PostgreSQL database. Each new item added to the scraper S3 bucket spawns a new parser function, which allows for significant scaling.

Parser diagram

Case details in the MJCS are displayed differently depending on the county and type of case (e.g. district vs circuit court, criminal vs civil, etc.). MJCS assigns a code to each of these different case types, which can be thought of as schemas for rendering case details. Case Harvester currently has parsers for the following schemas:

  • CC: Circuit court civil cases
  • DSCIVIL: District court civil cases
  • DSCR: District court criminal cases
  • DSK8: Circuit court criminal cases

Each different parser breaks down the case details to a granular level and stores the data in a number of database tables. This schematic diagram illustrates how this data is represented in the database.

Installation

Case Harvester can be run or deployed from any workstation running Python 3, GNU Make, and jq. The required Python 3 modules are in requirements.txt and can be installed with pip3 install -r requirements.txt.

Next, configure AWS CLI with aws configure so that it can deploy Case Harvester. Here you'll use an Access Key ID and Secret Access Key either for your root AWS account, or an IAM user or role that has sufficient permissions.

Deploy to AWS

Case Harvester uses Cloudformation stacks to deploy, configure, and connect all of the needed AWS resources. There are separate stacks for static resources (VPC, S3 bucket, RDS instance), spider, scraper, and parser. The first step is to set strong, unique passwords for the database users in secrets.json:

{
  "development":{
    "DatabaseMasterUsername":"root",
    "DatabaseMasterPassword":"badpassword",
    "DatabaseUsername":"db_user",
    "DatabasePassword":"badpassword",
    "DatabaseReadOnlyUsername":"ro_user",
    "DatabaseReadOnlyPassword":"badpassword"
  },
  "production":{
    "DatabaseMasterUsername":"root",
    "DatabaseMasterPassword":"badpassword",
    "DatabaseUsername":"db_user",
    "DatabasePassword":"badpassword",
    "DatabaseReadOnlyUsername":"ro_user",
    "DatabaseReadOnlyPassword":"badpassword"
  }
}

You can then deploy the cloudformation stacks to AWS by running:

make deploy

Once this is finished, you can intialize the database and configure Case Harvester to use the newly deployed resources by running:

make init

Note that the above commands deploy and initialize a development environment. To deploy to a production environment:

make deploy_production
make init_production

More make targets (such as deploying a specific stack or generating documentation) can be found by looking in the Makefile.

Usage

The default deployment of Case Harvester sets up the scraper and parser to automatically run when new case numbers are submitted by the spider. The spider can be run from any workstation, though for convenience it is usually run on an EC2 instance since the search process can take a long time.

Run the spider by specifying a search time range and county:

./src/case_harvester.py spider --start-date 1/1/2000 --end-date 12/31/2000 --county 'BALTIMORE CITY'

By default, case_harvester.py runs in your development AWS environment (see Deploy to AWS). To run in your production environment, add the --environment production CLI flag:

./src/case_harvester.py spider --environment production -s 1/1/2000 -e 12/31/2000 --county 'BALTIMORE CITY'

Questions

For questions or more information, email info@openjusticebaltimore.org.