# Caselaw Access Project Exercise
From Harvard Law School's website: "The Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law School Library."

In this Jupyter notebook, I will pull data from the CAP API and reconfigure the data into a dataframe that is easier to understand.

### Loading and Configuring the Data

In [1]:
import json, requests # import necessary packages

In [2]:
url = "https://api.case.law/v1/cases/" # API URL

In [76]:
response = requests.get(url).text # request data from the URL as JSON

In [77]:
case_data = json.loads(response) # upload the requested data from JSON format into Python dictionary

In [78]:
type(case_data)
case_data.keys() # check that data is in "dict" format and look at the keys segmenting the data

dict_keys(['count', 'next', 'previous', 'results'])

In [79]:
import pandas as pd # import package for data manipulation

df = [] # create empty dataframe

for x in case_data['results']:
    df.append([x['id'], x['url'], x['name_abbreviation'], x['decision_date'],
    x['jurisdiction']['name_long']]) # for each case in the dictionary, pull desired key-value pairs


In [80]:
df = pd.DataFrame(data = df, columns = ['id', 'url', 'name_abbreviation', 
'decision_date', 'jurisdiction']) # convert key-value pairs into dataframe

df

Unnamed: 0,id,url,name_abbreviation,decision_date,jurisdiction
0,1021505,https://api.case.law/v1/cases/1021505/,Stone v. Boreman,1658,Maryland
1,1021450,https://api.case.law/v1/cases/1021450/,Gerard v. Willan,1659-02-28,Maryland
2,1021509,https://api.case.law/v1/cases/1021509/,Abington v. Lowry,1662-12,Maryland
3,1021488,https://api.case.law/v1/cases/1021488/,Ringgold v. Hinson,1666-10,Maryland
4,1021539,https://api.case.law/v1/cases/1021539/,Ringgold v. Purs,1666-10,Maryland
...,...,...,...,...,...
95,12084019,https://api.case.law/v1/cases/12084019/,Holbrooke v. Holbrooke,1673-07-29,Massachusetts
96,12084051,https://api.case.law/v1/cases/12084051/,Rose v. Young,1673-07-29,Massachusetts
97,12084072,https://api.case.law/v1/cases/12084072/,Hoare v. Atkinson,1673-10-28,Massachusetts
98,12084104,https://api.case.law/v1/cases/12084104/,Carver v. Wright,1673-10-28,Massachusetts


That code only returns the first 100 rows in the dataset (so only cases from 1650–1673!) – develop further code to get everything in the API.

In [85]:
url = "https://api.case.law/v1/cases/"

all_case_data = [] # empty dataframe

response = requests.get(url).text

case_data = json.loads(response)

hasNext = case_data['next']
 

Run a while loop to say that where there is a next case to look at, create a request to pull the data from the API in JSON form and then pull just the columns we want. Because there are 6 million case records stored in the API, we want to test the loop for a set amount of time before committing to downloading all the data.

In [91]:
import time

start_time = time.time() # current time
seconds = 120 # run program for two minutes (can be changed)

In [92]:
while hasNext != None:
    
    response = requests.get(url).text

    case_data = json.loads(response)

    for x in case_data['results']:
        all_case_data.append([x['id'], x['url'], 
            x['name_abbreviation'], x['decision_date'],
            x['jurisdiction']['name_long']]) # append to empty data frame

    current_time = time.time()
    elapsed_time = current_time - start_time

    if elapsed_time > seconds:
        break


The limited while loop worked! Now convert into dataframe to see how many cases it loaded into the all_case_data file.

In [93]:
df = pd.DataFrame(data = all_case_data, columns = ['id', 'url', 'name_abbreviation', 
'decision_date', 'jurisdiction']) # convert key-value pairs into dataframe

df

Unnamed: 0,id,url,name_abbreviation,decision_date,jurisdiction
0,1021505,https://api.case.law/v1/cases/1021505/,Stone v. Boreman,1658,Maryland
1,1021450,https://api.case.law/v1/cases/1021450/,Gerard v. Willan,1659-02-28,Maryland
2,1021509,https://api.case.law/v1/cases/1021509/,Abington v. Lowry,1662-12,Maryland
3,1021488,https://api.case.law/v1/cases/1021488/,Ringgold v. Hinson,1666-10,Maryland
4,1021539,https://api.case.law/v1/cases/1021539/,Ringgold v. Purs,1666-10,Maryland
...,...,...,...,...,...
177495,12084019,https://api.case.law/v1/cases/12084019/,Holbrooke v. Holbrooke,1673-07-29,Massachusetts
177496,12084051,https://api.case.law/v1/cases/12084051/,Rose v. Young,1673-07-29,Massachusetts
177497,12084072,https://api.case.law/v1/cases/12084072/,Hoare v. Atkinson,1673-10-28,Massachusetts
177498,12084104,https://api.case.law/v1/cases/12084104/,Carver v. Wright,1673-10-28,Massachusetts


Loaded data from 177,500 cases, and that still keeps us in the 17th century! Let's run the loop for a longer duration to try to get more recent cases.


In [94]:
import time

start_time = time.time() # current time
seconds = 1200 # run program for 20 minutes

while hasNext != None:
    
    response = requests.get(url).text

    case_data = json.loads(response)

    for x in case_data['results']:
        all_case_data.append([x['id'], x['url'], 
            x['name_abbreviation'], x['decision_date'],
            x['jurisdiction']['name_long']]) # append the data from desired key-value pairs to an empty data frame

    current_time = time.time() # current time
    elapsed_time = current_time - start_time # amount of time between elapsed between loop running and start of program 

    if elapsed_time > seconds: 
        break # cuts off the loop when the time elapsed hits 20 minutes

In [95]:
df = pd.DataFrame(data = all_case_data, columns = ['id', 'url', 'name_abbreviation', 
'decision_date', 'jurisdiction']) # convert key-value pairs into dataframe

df

Unnamed: 0,id,url,name_abbreviation,decision_date,jurisdiction
0,1021505,https://api.case.law/v1/cases/1021505/,Stone v. Boreman,1658,Maryland
1,1021450,https://api.case.law/v1/cases/1021450/,Gerard v. Willan,1659-02-28,Maryland
2,1021509,https://api.case.law/v1/cases/1021509/,Abington v. Lowry,1662-12,Maryland
3,1021488,https://api.case.law/v1/cases/1021488/,Ringgold v. Hinson,1666-10,Maryland
4,1021539,https://api.case.law/v1/cases/1021539/,Ringgold v. Purs,1666-10,Maryland
...,...,...,...,...,...
1398795,12084019,https://api.case.law/v1/cases/12084019/,Holbrooke v. Holbrooke,1673-07-29,Massachusetts
1398796,12084051,https://api.case.law/v1/cases/12084051/,Rose v. Young,1673-07-29,Massachusetts
1398797,12084072,https://api.case.law/v1/cases/12084072/,Hoare v. Atkinson,1673-10-28,Massachusetts
1398798,12084104,https://api.case.law/v1/cases/12084104/,Carver v. Wright,1673-10-28,Massachusetts


After running the loop for 20 minutes, still have cases in the 1670s. An important note for Python is that the earliest date Python recognizes is September 21, 1677.

In [101]:
pd.Timestamp.min

Timestamp('1677-09-21 00:12:43.145224193')

Therefore, we're going to "cheat" a little bit and pull records from the API that are later than that date and continue with the process, running the loop for 30 minutes.

In [102]:
url = "https://api.case.law/v1/cases/?decision_date__gte=1677-09-22"

all_case_data = [] # empty dataframe

import time

start_time = time.time() # current time
seconds = 1800 # run program for 20 minutes

while hasNext != None:
    
    response = requests.get(url).text

    case_data = json.loads(response)

    for x in case_data['results']:
        all_case_data.append([x['id'], x['url'], 
            x['name_abbreviation'], x['decision_date'],
            x['jurisdiction']['name_long']]) # append the data from desired key-value pairs to an empty data frame

    current_time = time.time() # current time
    elapsed_time = current_time - start_time # amount of time between elapsed between loop running and start of program 

    if elapsed_time > seconds: 
        break # cuts off the loop when the time elapsed hits 30 minutes

In [103]:
df = pd.DataFrame(data = all_case_data, columns = ['id', 'url', 'name_abbreviation', 
'decision_date', 'jurisdiction']) # convert key-value pairs into dataframe

df

Unnamed: 0,id,url,name_abbreviation,decision_date,jurisdiction
0,12089029,https://api.case.law/v1/cases/12089029/,Waterhouse v. Usher,1677-10-30,Massachusetts
1,12089067,https://api.case.law/v1/cases/12089067/,Tayler v. Usher,1677-10-30,Massachusetts
2,12089107,https://api.case.law/v1/cases/12089107/,Allein v. Vsher,1677-10-30,Massachusetts
3,12089144,https://api.case.law/v1/cases/12089144/,Ballard v. Watts,1677-10-30,Massachusetts
4,12089188,https://api.case.law/v1/cases/12089188/,Raynsfords v. Green,1677-10-30,Massachusetts
...,...,...,...,...,...
1720095,12091357,https://api.case.law/v1/cases/12091357/,Dowden v. Hayman,1678-01-29,Massachusetts
1720096,12091396,https://api.case.law/v1/cases/12091396/,Hayman v. Dowden,1678-01-29,Massachusetts
1720097,12091438,https://api.case.law/v1/cases/12091438/,Dowden v. Dell,1678-01-29,Massachusetts
1720098,12091489,https://api.case.law/v1/cases/12091489/,Jones v. Wilcocks,1678-01-29,Massachusetts
