This is a demo of how to analyze a large number of cases without reading them all, using some machine learning to speed up the analysis. Feel free to reuse, modify, or distribute it however you'd like.

First, I import some useful Python libraries. 

In [13]:
# Imports

from collections import Counter
from gensim.summarization.summarizer import summarize
import json 
import random 
import spacy 
from spacy.util import minibatch, compounding
from textwrap import TextWrapper
!pip install wget
import wget 



In this demo, I'll explore how cases adjudicating traffic accidents have changed over the years. I am going to compare cases with the terms **proximate cause** and **intersection** in the years 1910-20 and 2008-2018. I chose these terms because they result in a set of cases about accidents at and near intersections where the question of who was responsible is at issue. I wanted to keep the time periods identical but separated by 100 years, but cases after 2018 are sparse in the case.law database, so I've settled for a 98-year separation instead. 

I have already retrieved these cases from the [case.law API](https://api.case.law/v1/), but if you want to make your own API call to retrieve the cases, you can use the following sample API call: `https://api.case.law/v1/cases/?page_size=10&search=%22proximate+cause%22+intersection&decision_date_min=1910-01-01&decision_date_max=1920-01-01`. 

In [10]:
# Download list of cases from 1910-20
old_case_list_file = 'https://ketchupduck.s3.amazonaws.com/cars_cases_json_list_old.json'
wget.download(old_case_list_file)

# Download list of cases from 2008-18
new_case_list_file = 'https://ketchupduck.s3.amazonaws.com/cars_cases_json_list_new.json'
wget.download(new_case_list_file)

'cars_cases_json_list_new.json'

In [12]:
# Load case list into local memory 

def read_case_list(filename): 
  with open(filename, 'r') as case_list_file:
    return json.load(case_list_file)

old_case_list = read_case_list('cars_cases_json_list_old.json')
print(f"Number of cases from 1910-20: {len(old_case_list)}")
new_case_list = read_case_list('cars_cases_json_list_new.json')
print(f"Number of cases from 2008-18: {len(new_case_list)}")

Number of cases from 1910-20: 596
Number of cases from 2008-18: 863


Let's look at what a case in my case list looks like. Notice that I don't have the case text yet, only the case metadata. I didn't want to download the case text for all the cases, only the ones I find interesting, so I'll download the case text for some of these cases later in this file. 

In [13]:
print(json.dumps(old_case_list[0], indent=2))

{
  "id": 1998050,
  "url": "https://api.case.law/v1/cases/1998050/",
  "name": "REID WHITELAW, Respondent, v. GIDEON D. McGilliard, Appellant",
  "name_abbreviation": "Whitelaw v. McGilliard",
  "decision_date": "1918-12-04",
  "docket_number": "L. A. No. 4482",
  "first_page": "349",
  "last_page": "353",
  "citations": [
    {
      "type": "official",
      "cite": "179 Cal. 349"
    }
  ],
  "volume": {
    "barcode": "32044078594405",
    "volume_number": "179",
    "url": "https://api.case.law/v1/volumes/32044078594405/"
  },
  "reporter": {
    "id": 414,
    "full_name": "California Reports",
    "url": "https://api.case.law/v1/reporters/414/"
  },
  "court": {
    "name": "Supreme Court of California",
    "slug": "cal-1",
    "id": 9021,
    "name_abbreviation": "Cal.",
    "url": "https://api.case.law/v1/courts/cal-1/"
  },
  "jurisdiction": {
    "whitelisted": false,
    "name": "Cal.",
    "slug": "cal",
    "id": 30,
    "name_long": "California",
    "url": "https://ap

I am curious about who the parties in each case are. The `spacy` library has the capability to find and label people, organizations, and geopolitical entities, so let's use that to find and label the case parties here. 

In [0]:
# Function to assign and print spacy labels 
def print_labels_check(cases, model, start_idx, end_idx):
  for case in cases[start_idx:end_idx]: 
    case_name = case['name']
    parsed_case_name = model(case_name) 
    print(case_name)
    for ent in parsed_case_name.ents:
      print("   ", ent.text, ent.label_)

In [16]:
nlp = spacy.load('en_core_web_sm')
print_labels_check(new_case_list, nlp, 0, 20)

Linda M. Brown, as Administratrix of the Estate of Wayne Brown, Deceased, Appellant, v. State of New York, Respondent; Linda M. Brown, Appellant, v. State of New York, Respondent
    Linda M. Brown PERSON
    Administratrix PERSON
    Deceased PERSON
    State ORG
    New York GPE
    Respondent GPE
    Linda M. Brown PERSON
    Appellant PERSON
    State ORG
    New York GPE
James M. McILROY v. GIBSON’S APPLE ORCHARD
    James M. McILROY PERSON
    GIBSON ORG
BENNETT v. GEORGIA DEPARTMENT OF TRANSPORTATION; JOHNSON v. GEORGIA DEPARTMENT OF TRANSPORTATION
    BENNETT ORG
    GEORGIA DEPARTMENT OF TRANSPORTATION ORG
    JOHNSON ORG
    GEORGIA DEPARTMENT OF TRANSPORTATION ORG
Kevin Chang, Appellant, v. City of New York et al., Respondents, et al., Defendant
    Kevin Chang PERSON
    Appellant PERSON
    New York GPE
Kristy HUMPHERY as Personal Representative of the Estate of Charles Mandrell, Jr., Appellant-Plaintiff, v. DUKE ENERGY INDIANA, INC., Appellee-Defendant
    Kristy HUMPHERY

Alright, so above is what `spacy` gave me. This is a good start, but it is clear that there are many errors in the data. For example, look at this set of labels: 
```
Linda M. Brown, as Administratrix of the Estate of Wayne Brown, Deceased, Appellant, v. State of New York, Respondent; Linda M. Brown, Appellant, v. State of New York, Respondent
    Linda M. Brown PERSON
    Administratrix PERSON
    Deceased PERSON
    State ORG
    New York GPE
    Respondent GPE
    Linda M. Brown PERSON
    Appellant PERSON
    State ORG
    New York GPE
```
`spacy` thinks that "Administratrix" and "Appellant" are people, while I know that these are just roles that people assume when in court. `spacy` also labelled "State" as an organization and "New York" as a geopolitical entity, but "State of New York" should have been labelled together as a geopolitical entity. 

`spacy` was trained on Wikipedia pages, so the errors mainly stem from the fact that it has not seen many legal case names before, and so does not have a good idea of how to handle them. I need to train `spacy` with case names so it knows how to better handle them. For example, for the above case name, I will tell `spacy` that the correct labelling is: 
```python
(
  # First, tell spacy the case name 
  "Linda M. Brown, as Administratrix of the Estate of Wayne Brown, Deceased, Appellant, v. State of New York, Respondent; Linda M. Brown, Appellant, v. State of New York, Respondent", 
  # Then, tell spacy the entities in the case 
  {"entities": [
    # Linda M. Brown (characters 0 to 14 in the case name) is a person 
    (0, 14, "PERSON"), 
    # Wayne Brown (characters 51 to 62 in the case name) is a person 
    (51, 51+11, "PERSON"), 
    # State of New York is a geopolitical entity
    (88, 88+17, "GPE"), 
    # Linda M. Brown, again, is a person
    (119, 119+14, "PERSON"),
    # State of New York, again is a geopolitical entity 
    (149, 149+17, "GPE")
  ]}
),

```

Below is a list of the correct labels for several cases. 

In [0]:
# Training data
TRAIN_DATA = [
    ("N. R. Walters et al., Respondents, v. The City of Seattle, Appellant", {"entities": [(0, 13, "PERSON"), (38, 38+19, "GPE")]}),
    ("Fannie Cusick, Appellee, v. W. F. Miller, Appellant", {"entities": [(0, 13, "PERSON"), (28, 28+12, "PERSON")]}),
    ("Celieve Barton, an infant, by his Guardian Ad Litem etc., Appellant, v. J. H. Van Gesen, Respondent", 
      {"entities": [(0, 14, "PERSON"), (34, 34+17, "PERSON"), (72, 72+15, "PERSON")]}),
    ("C. L. LAWRENCE, Appellant, v. T. M. GOODWILL, Respondent", {"entities": [(0, 14, "PERSON"), (30, 30+14, "PERSON")]}),
    ("Glatz, Appellant, vs. Kroeger Brothers Company, Respondent", {"entities": [(0, 5, "PERSON"), (22, 22+24, "ORG")]}),
    ("Karpeles v. City Ice Delivery Co.", {"entities": [(0, 8, "PERSON"), (12, 12+21, "ORG")]}),
    ("Asserina Neilson vs. City of Worcester", {"entities": [(0, 16, "PERSON"), (21, 21+17, "GPE")]}),
    ("Ph. Glickman, Appellee, v. Crane Company, Appellant", {"entities": [(0, 12, "PERSON"), (27, 27+13, "ORG")]}),
    ("CHESTER BIDWELL, Respondent, v. LOS ANGELES AND SAN DIEGO BEACH RAILWAY COMPANY (a Corporation), Appellant", 
      {"entities": [(0, 15, "PERSON"), (32, 32+47, "ORG")]}),
    ("GRAVES v. PORTLAND RY., LIGHT & POWER CO.", {"entities": [(0, 6, "PERSON"), (10, 10+31, "ORG")]}),
    ("John E. Waterhouse, Appellee, v. City of Waterloo, Appellant", {"entities": [(0, 18, "PERSON"), (33, 33+16, "GPE")]}),
    ("MATTIE ESKRIDGE, Respondent, v. METROPOLITAN STREET RAILWAY COMPANY, Appellant", {"entities": [(0, 15, "PERSON"), (32, 32+35, "ORG")]}),
    ("GIBSON v. UTAH LIGHT & TRACTION CO.", {"entities": [(0, 6, "PERSON"), (10, 10+25, "ORG")]}),
    ("CLARK v. JONES", {"entities": [(0, 5, "PERSON"), (9, 9+5, "PERSON")]}),
    ("FRESNO TRACTION COMPANY (a Corporation), Respondent, v. ATCHISON, TOPEKA & SANTA FE RAILWAY COMPANY (a Corporation), Appellant", 
      {"entities": [(0, 23, "ORG"), (56, 56+43, "ORG")]}),
    ("PERLE C. PEMBERTON, Respondent, v. EDWARD ARNY, Appellant", {"entities": [(0, 18, "PERSON"), (35, 35+11, "PERSON")]}),
    ("Fort Wayne and Northern Indiana Traction Company et al. v. Parish", {"entities": [(0, 48, "ORG"), (59, 59+6, "PERSON")]}),
    ("AHONEN v. HRYSZKO", {"entities": [(0, 6, "PERSON"), (12, 12+7, "PERSON")]}),
    ("FREDERICK E. OPITZ, Respondent, v. PAUL W. SCHENCK, Appellant", {"entities": [(0, 18, "PERSON"), (35, 35+15, "PERSON")]}),
    ("CARTWRIGHT v. NEW ORLEANS RY. & LIGHT CO. et al.", {"entities": [(0, 10, "PERSON"), (14, 14+27, "ORG")]}),
    ("Johnson, Administratrix v. Mobile & Ohio Railroad Company, et al.", {"entities": [(0, 7, "PERSON"), (27, 27+30, "ORG")]}),
    ("Linda M. Brown, as Administratrix of the Estate of Wayne Brown, Deceased, Appellant, v. State of New York, Respondent; Linda M. Brown, Appellant, v. State of New York, Respondent", {"entities": [(0, 14, "PERSON"), (51, 51+11, "PERSON"), (88, 88+17, "GPE"), (119, 119+14, "PERSON"), (149, 149+17, "GPE")]}),
    ("James M. McILROY v. GIBSON’S APPLE ORCHARD", {"entities": [(0, 16, "PERSON"), (20, 20+22, "ORG")]}),
    ("BENNETT v. GEORGIA DEPARTMENT OF TRANSPORTATION; JOHNSON v. GEORGIA DEPARTMENT OF TRANSPORTATION", {"entities": [(0, 7, "PERSON"), (11, 11+36, "GPE"), (49, 49+7, "PERSON"), (60, 60+36, "GPE")]}),
    ("Kevin Chang, Appellant, v. City of New York et al., Respondents, et al., Defendant", {"entities": [(0, 11, "PERSON"), (27, 27+16, "GPE")]}),
    ("Kristy HUMPHERY as Personal Representative of the Estate of Charles Mandrell, Jr., Appellant-Plaintiff, v. DUKE ENERGY INDIANA, INC., Appellee-Defendant", {"entities": [(0, 15, "PERSON"), (60, 60+21, "PERSON"), (107, 107+25, "ORG")]}),
    ("HAYES et al. v. CRAWFORD", {"entities": [(0, 5, "PERSON"), (16, 16+8, "PERSON")]}),
    ("Joel Kim et al., Appellants, v. Carlos F. Acosta, Respondent", {"entities": [(0, 8, "PERSON"), (32, 32+16, "PERSON")]}),
    ("Debra Watson, Respondent, v. Jade Luxury Transportation Corp. et al., Appellants, et al., Defendants", {"entities": [(0, 12, "PERSON"), (29, 29+32, "ORG")]}),
    ("Darlene Todd, Respondent, v. PLSIII, LLC—We Care et al., Appellants and Oscar Hasley, Jr., Respondent. (Action No. 1.); Oscar Hasley, Jr., Respondent, v. PLSIII, LLC—We Care et al., Appellants. (Action No. 2.)", {"entities": [(0, 12, "PERSON"), (29, 29+26, "ORG"), (72, 72+17, "PERSON"), (120, 120+17, "PERSON"), (154, 154+26, "ORG")]}),
    ("Glen FLETCHER, Plaintiff—Appellant, and Lucille Fletcher; Lucy Fletcher, Plaintiffs, v. PIZZA HUT OF AMERICA, INCORPORATED, Defendant—Appellee, and Yum! Brands, Incorporated, Defendant", {"entities": [(0, 13, "PERSON"), (40, 40+16, "PERSON"), (58, 58+13, "PERSON"), (88, 88+34, "ORG"), (148, 148+25, "ORG")]}),
    ("Peter Poveromo et al., Respondents, v. Town of Cortlandt et al., Appellants", {"entities": [(0, 14, "PERSON"), (39, 39+17, "GPE")]}),
    ("THE PEOPLE OF THE STATE OF ILLINOIS, Plaintiff-Appellee, v. OCTAVIUS L. JOHNSON, Defendant-Appellant", {"entities": [(0, 35, "GPE"), (60, 60+19, "PERSON")]}),
    ("Antoinette McIntosh, an Incapacitated Person, by Her Guardian, Andrea Martin, et al., Appellants, v. Village of Freeport, Respondent, et al., Defendants. (And a Third-Party Action.)", {"entities": [(0, 19, "PERSON"), (63, 63+13, "PERSON"), (101, 101+19, "GPE")]}),
    ("Michelle R. Bailey et al., Respondents, v. County of Tioga et al., Appellants", {"entities": [(0, 18, "PERSON"), (43, 43+15, "GPE")]}),
    ("Susan M. Coffed, as Administrator of the Estate of James B. Coffed, Deceased, Respondent, v. John N. McCarthy et al., Appellants", {"entities": [(0, 15, "PERSON"), (51, 51+15, "PERSON"), (93, 93+16, "PERSON")]}),
    ("Rolf Ohlhausen, Respondent, v. City of New York et al., Defendants, and New York City Transit Authority, Appellant", {"entities": [(0, 14, "PERSON"), (31, 31+16, "GPE"), (72, 72+31, "GPE")]}),
    ("Marlyn Przesiek et al., Respondents, v. State of New York, Appellant", {"entities": [(0, 15, "PERSON"), (40, 40+17, "GPE")]}),
    ("Lori Noller et al., Appellants, v. Miguel Peralta et al., Defendants, and Town of Cornwall, Respondent", {"entities": [(0, 11, "PERSON"), (35, 35+15, "PERSON"), (74, 74+16, "GPE")]}),
    ("Nathaniel Martinez, Respondent, v. Dwight Wascom, Appellant", {"entities": [(0, 18, "PERSON"), (35, 35+13, "PERSON")]}),
    ("SMITH v. THE STATE", {"entities": [(0, 5, "PERSON"), (9, 9+9, "GPE")]})
]

In [18]:
# Let's train our own model with the above training data. 
# Code based on https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py 

def train_model(n_iter=100):
  nlp = spacy.load('en_core_web_sm')  # Load existing spaCy model

  ner = nlp.get_pipe("ner")
  # Add labels
  for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
      ner.add_label(ent[2])

  # get names of other pipes to disable them during training
  pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
  other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
  with nlp.disable_pipes(*other_pipes):  # only train NER
      for itn in range(n_iter):
          random.shuffle(TRAIN_DATA)
          losses = {}
          # batch up the examples using spaCy's minibatch
          batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
          for batch in batches:
              texts, annotations = zip(*batch)
              nlp.update(
                  texts,  # batch of texts
                  annotations,  # batch of annotations
                  drop=0.5,  # dropout - make it harder to memorise data
                  losses=losses,
              )
          print("Losses", losses)

  # test the trained model
  for text, _ in TRAIN_DATA:
    doc = nlp(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
  
  # TODO save new model? If so, orig code has that template. 
  return nlp 

better_nlp = train_model()

Losses {'ner': 424.8287523006771}
Losses {'ner': 332.3183234319513}
Losses {'ner': 292.2799712356578}
Losses {'ner': 258.71377059009797}
Losses {'ner': 287.0007966591465}
Losses {'ner': 251.94647533451388}
Losses {'ner': 243.77187064938698}
Losses {'ner': 225.17264195625285}
Losses {'ner': 200.54564825511218}
Losses {'ner': 218.81177878024755}
Losses {'ner': 202.52273260562748}
Losses {'ner': 150.15123056791344}
Losses {'ner': 189.62786061842291}
Losses {'ner': 184.2965066613524}
Losses {'ner': 158.64318904952816}
Losses {'ner': 161.90379864536226}
Losses {'ner': 187.5768583726522}
Losses {'ner': 159.18490618034957}
Losses {'ner': 139.4120782030161}
Losses {'ner': 127.55121107371815}
Losses {'ner': 126.48282271646252}
Losses {'ner': 120.12303268601558}
Losses {'ner': 125.84701170474094}
Losses {'ner': 118.51145071678432}
Losses {'ner': 97.97596760998923}
Losses {'ner': 105.45962459111524}
Losses {'ner': 101.13602304463271}
Losses {'ner': 95.73615609676871}
Losses {'ner': 106.9829561356

You can see that `spacy` prints out losses when it is training the new model, like so: 
```
Losses {'ner': 424.8287523006771}
Losses {'ner': 332.3183234319513}
Losses {'ner': 292.2799712356578}
Losses {'ner': 258.71377059009797}
Losses {'ner': 287.0007966591465}
Losses {'ner': 251.94647533451388}
Losses {'ner': 243.77187064938698}
Losses {'ner': 225.17264195625285}
Losses {'ner': 200.54564825511218}
Losses {'ner': 218.81177878024755}
Losses {'ner': 202.52273260562748}
Losses {'ner': 150.15123056791344}
Losses {'ner': 189.62786061842291}
Losses {'ner': 184.2965066613524}
Losses {'ner': 158.64318904952816}
```

The numbers measure how many mistakes `spacy`'s model makes on the training data I provided. These losses are generally decreasing as it trains the new model, which means the new model is getting slowly better at handling my training data. Now that the new model is ready, let's see if it handles the case names better. Let's ask it to label some cases it has never seen before. 

In [19]:
print_labels_check(new_case_list, better_nlp, 20, 40)

Nankumarie Gobin et al., Appellants, v. Brenda Y. Delgado, Respondent
    Nankumarie Gobin PERSON
    Brenda Y. Delgado PERSON
Rosario Caruso, Respondent, et al., Plaintiff, v. Nikolajs Gnatjuks et al., Appellants
    Rosario Caruso PERSON
    Nikolajs Gnatjuks PERSON
Laurel E. Gause, Plaintiff/Counterclaim Defendant-Respondent, and Darryl L. Gause, Respondent, v. Carlos Martinez, Defendant/Counterclaim Plaintiff-Appellant
    Laurel E. Gause PERSON
    Darryl L. Gause PERSON
    Carlos Martinez PERSON
Darlene Amaral, Appellant, v. Brittany Reph et al., Respondents
    Darlene Amaral PERSON
    Brittany Reph ORG
Stella Galagotis, Appellant, v. Joseph L. Armenti et al., Respondents
    Stella Galagotis PERSON
    Joseph L. Armenti PERSON
Linda M. Brown, as Administratrix of the Estate of Wayne Brown, Deceased, Respondent, v. State of New York, Appellant. (Appeal No. 2.)
    Linda M. Brown PERSON
    Wayne Brown PERSON
    State of New York GPE
Young Rae Kim, Appellant, v. Heon Young Cho

Not bad! `spacy` is no longer making as many errors as it was before. For example, it labels the below case correctly: 
```
Mark Dodge et al., Respondents, v. County of Erie, Appellant, et al., Defendants
    Mark Dodge PERSON
    County of Erie GPE
```

Before training, `spacy` would probably have labelled "Respondents" and "Appellant" as people, and may have split "County" and "Erie" into two. 

Now let's get `spacy` to label all parties in the case lists I have. 

In [22]:
def label_parties(cases):
  people_count = 0 
  org_count = 0 
  gpe_count = 0 
  cases_with_org = 0
  cases_with_gpe = 0
  for case in cases: 
    case_name = case['name']
    parsed_case_name = better_nlp(case_name) 
    people_count += sum([1 for ent in parsed_case_name.ents if ent.label_ == 'PERSON'])
    this_case_org_count = sum([1 for ent in parsed_case_name.ents if ent.label_ == 'ORG'])
    org_count += this_case_org_count
    this_case_gpe_count = sum([1 for ent in parsed_case_name.ents if ent.label_ == 'GPE'])
    gpe_count += this_case_gpe_count 
    if (this_case_org_count):
      cases_with_org += 1
    if (this_case_gpe_count):
      cases_with_gpe += 1
    #   print(case['id'])

  total_count = people_count + org_count + gpe_count
  print(f"{people_count} people were named, which is {(people_count*100)/total_count}% of all the parties named")
  print(f"{org_count} organizations were named, which is {(org_count*100)/total_count}% of all the parties named")
  print(f"{(cases_with_org*100)/len(cases)}% of cases name an organization")
  print(f"{gpe_count} geopolitical entities were named, which is {(gpe_count*100)/total_count}% of all the parties named")
  print(f"{(cases_with_gpe*100)/len(cases)}% of cases name a geopolitical entity")
  print("\n")

print("Cases from 1910-20:")
label_parties(old_case_list)
print("Cases from 2008-18:")
label_parties(new_case_list)

Cases from 1910-20:
858 people were named, which is 63.64985163204748% of all the parties named
415 organizations were named, which is 30.78635014836795% of all the parties named
63.92617449664429% of cases name an organization
75 geopolitical entities were named, which is 5.563798219584569% of all the parties named
12.248322147651006% of cases name a geopolitical entity


Cases from 2008-18:
2252 people were named, which is 71.01860611794386% of all the parties named
536 organizations were named, which is 16.903185115105646% of all the parties named
38.12282734646582% of cases name an organization
383 geopolitical entities were named, which is 12.078208766950489% of all the parties named
32.79258400926999% of cases name a geopolitical entity




Interesting! Moving from 1910-20 to 2008-18, I see the following changes: 
- Fewer organizations are named 
- More geopolitical entities are named 

Because I skimmed through the cases when collecting them, I have a guess for the first trend. My guess is that many cases involving organizations in 1910-20 were about railways. Since the number of accidents involving trains has gone down since then, fewer railways are named in cases in 2008-18, which accounts for why fewer organizations are named in 2008-18. 

However, the second trend is unexpected. Why are more geopolitical entities being named? I want to dig deeper into this. By uncommenting `#   print(case['id'])` in the function above, I can get the case IDs of the cases involving geopolitical entities. I can then put all those case IDs in a file called `case_list` and download the full case text for those case IDs using the following script: 

```python
import requests, jsonlines

with jsonlines.open('gpe_cases_old.jsonl', mode='w') as writer:
    with open('case_list', 'r') as case_list:
        for line in case_list:
            line = line.strip()
            case_no = line
            response = requests.get(
                f'https://api.case.law/v1/cases/{case_no}/?full_case=true',
                headers={'Authorization': 'Token my_caselaw_token'}
            )
            resp_json = response.json()
            writer.write(resp_json)
```

For the sake of speed, and not hitting my 500-case daily limit, I have already downloaded those cases and saved them. Let's get those cases now. 

In [2]:
# Load fullbody cases into local memory 
gpe_cases_old_file = 'https://ketchupduck.s3.amazonaws.com/gpe_cases_old.jsonl'
wget.download(gpe_cases_old_file)
gpe_cases_new_file = 'https://ketchupduck.s3.amazonaws.com/gpe_cases_new.jsonl'
wget.download(gpe_cases_new_file)

'gpe_cases_new.jsonl'

In [0]:
def load_case_body(filename):
  cases = []

  with open(filename, 'r') as case_file:
    for line in case_file:
      record = json.loads(line)
      cases.append(record)

  print(f'Number of cases: {len(cases)}')
  return cases

In [4]:
print("Cases from 1910-20:")
gpe_cases_old = load_case_body(gpe_cases_old_file.split('/')[-1])
print("\nCases from 2008-18:")
gpe_cases_new = load_case_body(gpe_cases_new_file.split('/')[-1])

Cases from 1910-20:
Number of cases: 69

Cases from 2008-18:
Number of cases: 266


Let's see what one of the cases looks like. Note that now, I have the case text as part of my data. 

In [29]:
print(json.dumps(gpe_cases_new[0], indent=2))

{
  "id": 4001922,
  "url": "https://api.case.law/v1/cases/4001922/",
  "name": "Linda M. Brown, as Administratrix of the Estate of Wayne Brown, Deceased, Appellant, v. State of New York, Respondent; Linda M. Brown, Appellant, v. State of New York, Respondent",
  "name_abbreviation": "Brown v. State",
  "decision_date": "2010-12-30",
  "docket_number": "Claim No. 108961; Claim No. 110037",
  "first_page": "1579",
  "last_page": "1587",
  "citations": [
    {
      "type": "official",
      "cite": "79 A.D.3d 1579"
    },
    {
      "type": "parallel",
      "cite": "914 N.Y.S.2d 512"
    }
  ],
  "volume": {
    "volume_number": "79",
    "url": "https://api.case.law/v1/volumes/32044132256744/",
    "barcode": "32044132256744"
  },
  "reporter": {
    "url": "https://api.case.law/v1/reporters/109/",
    "id": 109,
    "full_name": "Appellate Division Reports"
  },
  "court": {
    "url": "https://api.case.law/v1/courts/ny-app-div-3/",
    "name_abbreviation": "N.Y. App. Div.",
    "sl

Now I have about 330 cases that I find interesting, but I don't want to spend a week reading through all of them. Thankfully, the `gensim` library is pretty good at summarizing text. It does this by extracting the most "central" sentences in a text. [The documentation provides more explanation of how it does this.](https://radimrehurek.com/gensim/summarization/summariser.html)

In my imports above, I already imported the summarize function with `from gensim.summarization.summarizer import summarize`. Now, I can call that function below with `summarize(case_text, word_count=500)`, which tells `gensim` to give me a summary with about 500 words. 

I am more curious about the fact pattern of each case than about the legal reasoning employed in each case, and from my past experience, I know that the fact pattern is usually discussed in the first half of a case, while the legal reasoning is usually discussed in the second half of a case. So I also split my cases in half and only summarized the first half. 



In [0]:
wrapper = TextWrapper()
# Function to summarize cases 
def summarize_case(case):
  case_text = case['casebody']['data']['opinions'][0]['text']
  # Rough heuristic - description of case is in first half of text 
  len_case_text = len(case_text)
  if (len_case_text > 3000):
    len_case_text //= 2 
  case_text = case_text[0:len_case_text]
  summary = summarize(case_text, word_count=500)
  print(f"Case ID: {case['id']}")
  print(f"Full case text is at: {case['frontend_url']}")
  print(wrapper.fill(summary))
  print('\n-------\n')

Let's see a few summaries. 

In [8]:
for case in gpe_cases_old[:5]:
  summarize_case(case)

Case ID: 570526
Full case text is at: https://cite.case.law/wash/97/657/
Lake Dell avenue and East Alder street, both within the city of
Seattle, form what is commonly known as Lake Dell drive. East Alder
street forms the east end of Lake Dell drive, and runs approximately
east and west and intersects Erie street and Lakeside avenue. East
Alder street, from Erie street to Lakeside avenue, descends at a grade
of twelve per cent. The planking at the intersection point was,
according to the contention of the respondent, left broken, rough, and
uneven, and there was a hole about fifty feet in length, from six to
eight inches in depth, and one and one-half to two feet in width,
which, at the time of the accident, was full of water. In getting to
and from this landing it is necessary for trucks to use the Lake Dell
drive and Lakeside avenue. He made the run on the right side of the
street without difficulty until he came to the East Alder street
section of the drive, at a point a distance of

I then read through these summaries and removed cases that seemed irrelevant. For example, I removed any criminal cases, boating-related cases, or cases that did not involve a geopolitical entity. The below files contain only the cases that remained after this step. 

In [9]:
# Load revised cases into local memory 
gpe_cases_old_file_v1 = 'https://ketchupduck.s3.amazonaws.com/gpe_cases_old_v1.jsonl'
wget.download(gpe_cases_old_file_v1)
gpe_cases_new_file_v1 = 'https://ketchupduck.s3.amazonaws.com/gpe_cases_new_v1.jsonl'
wget.download(gpe_cases_new_file_v1)

'gpe_cases_new_v1.jsonl'

In [11]:
print("Cases from 1910-20:")
gpe_cases_old_v1 = load_case_body(gpe_cases_old_file_v1.split('/')[-1])
print("\nCases from 2008-18:")
gpe_cases_new_v1 = load_case_body(gpe_cases_new_file_v1.split('/')[-1])

Cases from 1910-20:
Number of cases: 53

Cases from 2008-18:
Number of cases: 171


Now I can compare randomly-chosen old cases to randomly-chosen new cases. 

In [12]:
# Summarize 5 randomly-chosen old cases
print("Summaries of randomly-chosen cases from 1910-20:") 
for i in range(5):
  random_case = random.choice(gpe_cases_old_v1)
  summarize_case(random_case)

# Summarize 5 randomly-chosen new cases 
print("\n\n------***********------\n\n")
print("Summaries of randomly-chosen cases from 2008-18:")
for i in range(5):
  random_case = random.choice(gpe_cases_new_v1)
  summarize_case(random_case)

Summaries of randomly-chosen cases from 1910-20:
Case ID: 2170272
Full case text is at: https://cite.case.law/iowa/146/624/
Prior to the injuries complained of, surface water falling upon said
lots and upon parts of West Walnut and West Fifteenth streets flowed
naturally and uninterruptedly across plaintiff’s lots and found exit
at the southwest corner thereof, and, by, means of a ditch along the
railway embankment running westward, the surface water coming from the
lot ran westward until it came to a drain tile or pipe passing through
the railroad embankment, from whence it ran across the right of way
and upon the bottoms south of. The petition charges that “said
embankment was negligently and carelessly constructed, in that it
hindered, obstructed, and prevented access to plaintiff’s property
aforesaid and to the alley thereto adjacent; in that it cut off and
destroyed the natural drainage and outlet to the surface water
aforesaid; in that it dammed up the ditch and filled the drain 

I am not sure what summaries you see above, since the cases to be summarized are randomly chosen. From what I have read so far, I think that the scope of what governments are supposed to do to prevent road accidents has expanded in the last 100 years. 

In the cases from 1910-20, the fact patterns often focus on the fact that the government did not properly maintain the surface of the road. For example, in the above five cases, the fact patterns talk about how the road did not have proper drainage (case ID 2170272), the road had a ditch (case ID 398997), and pedestrians could not jump over the street-side gutters safely (case ID 8874155). The other two cases talk about injuries caused by a streetcar, and the city not providing handrails on a bridge. 

In the cases from 2008-18, the fact patterns often focus on the fact that the government did not properly design the roads, where design encompasses not only the surface of the road, but also the system of signage and lights surrounding it. For example, the fact patterns above talk about how foliage obstructed the drivers' view (case ID 12295149), sidewalk defects being marked by vague signage (case ID 12451669), and allowing trucks to run on steep slopes (case ID 12416411). These fact patterns sometimes also focus on how the government's agents improperly use the roads. For example, the fact patterns above talk about how a USPS truck (case ID 4306156) and a police car (case ID 12303398) caused accidents. 

The number of cases involving geopolitical entities has increased because governments now must do more to keep people safe on roads. 