CAP database scripts.
Clone or download
Latest commit 1e348ad Nov 15, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
capstone Correct four typos Nov 14, 2018
services Add ngrams endpoint Oct 1, 2018
.gitignore Git ignore test bulk data folder; add location of prod bulk data folder Aug 27, 2018
.travis.yml Sudo required Jul 12, 2018
LICENSE.txt Add MIT license Sep 28, 2018
README.md Merge pull request #700 from jcushman/research-access-2 Nov 7, 2018
Vagrantfile Use an NFS mount Apr 14, 2017
init.sls Tweaks to SLS Apr 17, 2017
runtime.txt Update python to 3.5 Oct 20, 2017

README.md

Capstone

Build Status Coverage Status

This repository contains Capstone and CAPAPI, the applications written by the Harvard Law School Library Innovation Lab to manage and serve court opinions. Other than several cases used for our automated testing, this repository does not contain case data. Case data may be obtained through our API, or in certain instances where it's necessary for qualifying non-commercial research, in bulk from us directly.

Project Background

The Caselaw Access Project is a large-scale digitization project hosted by the Harvard Law School Library Innovation Lab. Visit case.law for more details.

The Data

  1. Format Documentation and Samples
  2. Obtaining Real Data
  3. Reporting Data Errors
  4. Errata

Format Documentation and Samples

The output of the project consists of page images, marked up case XML files, ALTO XML files, and METS XML files. This repository has a more detailed explanation of the format, and two volumes worth of sample data:

CAP Samples and Format Documentation

Obtaining Real Data

This data, with some temporary restrictions, is available to all. Please see our project site with more information about how to access the API, or get bulk access to the data:

https://case.law/

Reporting Data Errors

This is a living, breathing corpus of data. While we've taken great pains to ensure its accuracy and integrity, two large components of this project, namely OCR and human review, are utterly fallible. When we were designing Capstone, we knew that one of its primary functions would be to facilitate safe, accountable updates. If you find any errors in the data, we would be extraordinarily grateful for your taking a moment to create an issue in this GitHub repository's issue tracker to report it. If you notice a large pattern of problems that would be better fixed programmatically, or have a very large number of modifications, describe it in an issue. If we need more information, we'll ask. We'll close the issue when the issue has been corrected.

Errata

These are known issues— there's no need to file an issue in the issue if you come across one.

  • Missing Judges Tag: In many volumes, elements which should have the tag name <judges> instead have the tag name <p>. We're working on this one.
  • Nominative Case Citations: In many cases that come from nominative volumes, the citation format is wrong. We hope to have this corrected soon.
  • Jurisdictions: Though the jurisdiction values in our API metadata entries are normalized, we have not propagated those changes to the XML.
  • Court Name: We've seen some inconsistencies in the court name. We're trying to get this normalized in the data, and we'll also publish a complete court name list when we're done.
  • OCR errors: There will be OCR errors on nearly every page. We're still trying to figure out how best to address this. If you've got some killer OCR correction strategies, get at us.

The Capstone Application

Capstone is a Django application with a PostgreSQL database which stores and manages the non-image data output of the CAP project. This includes:

  • Original XML data
  • Normalized metadata extracted from the XML
  • External metadata, such as the Reporter database
  • Changelog data, tracking changes and corrections

CAPAPI

CAPAPI is the API with which users can access CAP data.

Installing Capstone and CAPAPI

Hosts Setup

Add the following to /etc/hosts:

127.0.0.1       case.test
127.0.0.1       api.case.test

Manual Local Setup

  1. Install global system requirements
  2. Clone the repository
  3. Set up python virtualenv
  4. Install application requirements
  5. Set up the postgres database and load test data
  6. Running the capstone server

1. Install global system requirements

  • Python 3.5.4— While there shouldn't be any issues with using a more recent version, we will only accept PRs that are fully compatible with 3.5.4.
  • MySQL— On Macs with homebrew, the version installed with brew install mysql works fine. On Linux, apt-get does the job
  • Redis— (Instructions)
  • Postgres > 9.5— (Instructions) For Mac developers, Postgres.app is a nice, simple way to get an instant postgres dev installation.
  • Git— (Instructions)

2. Clone the repository

$ git clone https://github.com/harvard-lil/capstone.git

3. Set up Python virtualenv (optional)

$ cd capstone/capstone  # move to Django subdirectory
$ mkvirtualenv -p python3 capstone

4. Install application requirements

(capstone)$ pip install -r requirements.txt

This will make a virtualenv entitled "capstone." You can tell that you're inside the virtualenv because your shell prompt will now inlcude the string (capstone).

5. Set up the postgres database and load test data

(capstone)$ psql -c "CREATE DATABASE capdb;"
(capstone)$ psql -c "CREATE DATABASE capapi;"
(capstone)$ fab init_dev_db  # one time -- set up database tables and development Django admin user, migrate databases
(capstone)$ fab load_test_data  # load in our test data

6. Running the capstone server

(capstone)$ fab run      # start up Django server

Capstone should now be running at 127.0.0.1:8000.

Docker Setup

We have initial support for local development via docker compose. Docker setup looks like this:

$ docker-compose up &
$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE capdb;"
$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE capapi;"
$ docker-compose exec web fab init_dev_db
$ docker-compose exec web fab load_test_data

Capstone should now be running at 127.0.0.1:8000.

tip— these commands can be shortened by adding something like this to .bash_profile:

alias d="docker-compose exec"
alias dfab="d web fab"

Administering and Developing Capstone

Testing

We use pytest for tests. Some notable flags:

Run all tests:

(capstone)$ pytest

Run one test:

(capstone)$ pytest -k test_name

Run tests without capturing stdout, to allow debugging with pdb:

(capstone)$ pytest -s

Run tests in parallel for speed:

(capstone)$ pytest -n <number of processes>

Requirements

Top-level requirements are stored in requirements.in. After updating that file, you should run

(capstone)$ pip-compile

to freeze all subdependencies into requirements.txt.

To ensure that your environment matches requirements.txt, you can run

(capstone)$ pip-sync

This will add any missing packages and remove any extra ones.

Applying model changes

Use Django to apply migrations. After you change models.py:

(capstone)$ ./manage.py makemigrations

This will write a migration script to cap/migrations. Then apply:

(capstone)$ fab migrate

Stored Postgres functions

Some Capstone features depend on stored functions that allow Postgres to deal with XML and JSON fields. See set_up_postgres.py for documentation.

Running Command Line Scripts

Command line scripts are defined in fabfile.py. You can list all available commands using fab -l, and run a command with fab command_name.

Local debugging tools

django-extensions is enabled by default, including the very handy ./manage.py shell_plus command.

django-debug-toolbar is not automatically enabled, but if you run pip install django-debug-toolbar it will be detected and enabled by settings_dev.py.

Model versioning

For database versioning we use the Postgres temporal tables approach inspired by SQL:2011's temporal databases.

See this blog post for an explanation of temporal tables and how to use them in Postgres.

We use django-simple-history to manage creation, migration, and querying of the historical tables.

Data is kept in sync through the temporal_tables Postgres extension and the triggers created in our scripts/set_up_postgres.py file.

Installing the temporal_tables extension is recommended for performance. If not installed, a pure postgres version will be installed by set_up_postgres.py; this is handy for development.

Download real data locally

These instructions are likely only going to be useful for internal users with access to our production databases and data stores, but there's no reason you couldn't set up an s3 bucket with the expected structure to ingest volumes. If you have any interest in working on something that requires this, file an issue to request that we extend the documentation. We've found very few instances where our test cases did not fully meet our dev needs.

To write test data and fixtures for given volume and case: run the fab command fab add_test_case with a volume barcode (like fab add_test_case:32044057891608_0001)

  • In settings.py, you will need to point DATABASES['tracking_tool'] to the real tracking tool db
  • You will also need to point STORAGES['ingest_storage'] to real harvard-ftl-shared

Documentation

This readme, code comments, and the API usage docs are the only docs we have. If you want something documented more thoroughly, file an issue and we'll get back to you.