In [1]:
# Module imports
import requests
import docx2txt
from io import BytesIO
from bs4 import BeautifulSoup
import pickle
import docx
import pdfminer



I started by generating a list of URLs. The Court of Appeals case numbers range from B001000 to (as of January 17, 2020) B303696. Not every number in this sequence has a criminal case associated with it; not every criminal case has an unpublished opinion available. However, URLs for cases *with* unpublished opinions available generally follow a pattern. My approach is to list all URLs within the possible range and attempt to reach all of them. A quick search of cases around B100000 turns up no cases with PDFs available; to save time, we only generate URLs above this case number. In the future, I should check PDF availability for case numbers below B100000.

It's worth noting that this does exclude cases with *published* opinions. They're much rarer. They're also much easier to obtain through the Appeals Court website, through LexisNexus, etc. In the long run, I'm not sure whether or not to include them in this analysis - if I do, I'll collect them separately. 

A known url is included here for testing purposes.

In [2]:
urls = [];
for i in range(100000,303696):
    # it appears no cases below 100000 have documents attached
    # 303696 is highest number as of jan 17 2020
    url = 'https://www.courts.ca.gov/opinions/nonpub/B%d.PDF' % i
    urls.append(url)
print(urls[5])
people_v_hicks = 'https://www.courts.ca.gov/opinions/nonpub/B282282.PDF'
hicks = requests.get(people_v_hicks)
print(hicks.headers["Content-Type"])

https://www.courts.ca.gov/opinions/nonpub/B100005.PDF
application/pdf


The function below was used to make shorter lists and to store case numbers alongside the URLs to which they correspond.

In [3]:
def make_case_list(s,f):
    case_list = [];
    for i in range(s,f):
        url = 'https://www.courts.ca.gov/opinions/nonpub/B%d.PDF' % i
        casenum = 'B%d' % i
        case = {
            "url": url,
            "appealsID": casenum
                }
        case_list.append(case)
    return(case_list)
        
    

Goal: Get lists of possible case numbers divided by ranges, clean them up

In [4]:
cl_10 = make_case_list(100001,150000);
cl_15 = make_case_list(150001,200000);
cl_20 = make_case_list(200001,250000);
cl_25 = make_case_list(250001,300000);
cl_30 = make_case_list(300001,303700);

all_cls = [cl_10, cl_15, cl_20, cl_25, cl_30];

In [5]:
all_cls = [];
divisions = 100;
hi = 303700;
low = hi - divisions + 1;
while 100000 < low:
    cl = make_case_list(low, hi)
    all_cls.append(cl)
    low = low - divisions
    hi = hi - divisions


In [6]:
extant_cases = [];


In [7]:
def clean_case_list(case_list):
    clean_list = [];
    for case in case_list: 
        r = requests.get(case["url"]);
        status = (r.status_code);
        if r.status_code == 200:
            clean_list.append(case)
        # If the document can be retrieved, code is 200. If not, code is 404
    return clean_list
    
    

In [None]:
def retrieve_cases(case_list):
    with open("retrieval_log.txt", "a") as s:
        for case in case_list: 
            try:
                r = requests.get(case["url"]);
                status = (r.status_code);
                if r.status_code == 200:
                    extant_cases.append(case)
                    filename = 'Opinions/cl_10/%s.pdf' % case["appealsID"]
                    with open(filename, 'wb') as f:
                        f.write(r.content)
                    s.write('Case %s retrieved\n' % case["appealsID"]) 
                elif r.status_code == 404:
                    s.write('Case %s not found\n' % case["appealsID"])
            except TimeoutError:
                print('Timed out during case %s' % case["appealsID"])
    return

In [10]:
retrieve_cases(cl_25[0:999])

Case B250001 slurped
Case B250002 not found.
Case B250003 not found.
Case B250004 slurped
Case B250005 slurped
Case B250006 not found.
Case B250007 slurped
Case B250008 not found.
Case B250009 slurped
Case B250010 not found.
Case B250011 not found.
Case B250012 not found.
Case B250013 not found.
Case B250014 not found.
Case B250015 slurped
Case B250016 slurped
Case B250017 slurped
Case B250018 slurped
Case B250019 not found.
Case B250020 not found.
Case B250021 slurped
Case B250022 slurped
Case B250023 slurped
Case B250024 not found.
Case B250025 slurped
Case B250026 not found.
Case B250027 slurped
Case B250028 slurped
Case B250029 slurped
Case B250030 not found.
Case B250031 not found.
Case B250032 slurped
Case B250033 slurped
Case B250034 not found.
Case B250035 not found.
Case B250036 not found.
Case B250037 not found.
Case B250038 not found.
Case B250039 slurped
Case B250040 slurped
Case B250041 not found.
Case B250042 slurped
Case B250043 not found.
Case B250044 slurped
Case B2500

From cffi callback <function _verify_callback at 0x00000171B33F2C18>:
Traceback (most recent call last):
  File "C:\Users\leodb\Anaconda3\envs\thesis\lib\site-packages\OpenSSL\SSL.py", line 311, in wrapper
    @wraps(callback)
KeyboardInterrupt


SSLError: HTTPSConnectionPool(host='www.courts.ca.gov', port=443): Max retries exceeded with url: /opinions/nonpub/B250244.PDF (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))

In [11]:
retrieve_cases(cl_25[243:999])

Case B250244 not found.
Case B250245 slurped
Case B250246 slurped
Case B250247 slurped
Case B250248 not found.
Case B250249 slurped
Case B250250 slurped
Case B250251 slurped
Case B250252 not found.
Case B250253 not found.
Case B250254 slurped
Case B250255 not found.
Case B250256 not found.
Case B250257 not found.
Case B250258 slurped
Case B250259 not found.
Case B250260 not found.
Case B250261 not found.
Case B250262 not found.
Case B250263 not found.
Case B250264 not found.
Case B250265 not found.
Case B250266 not found.
Case B250267 not found.
Case B250268 not found.
Case B250269 slurped
Case B250270 not found.
Case B250271 not found.
Case B250272 not found.
Case B250273 not found.
Case B250274 not found.
Case B250275 not found.
Case B250276 not found.
Case B250277 slurped
Case B250278 slurped
Case B250279 not found.
Case B250280 not found.
Case B250281 slurped
Case B250282 not found.
Case B250283 slurped
Case B250284 not found.
Case B250285 not found.
Case B250286 not found.
Case B2

KeyboardInterrupt: 

In [16]:
retrieve_cases(cl_25[319:999])

In [23]:
retrieve_cases(cl_25[0:5000])

Retrieval of cl_25 0-5000 complete.

In [24]:
retrieve_cases(cl_25[5001:10000])

Retrieval of cl_25 5001-10000 complete.

In [25]:
retrieve_cases(cl_25[10001:20000])

ConnectionError: HTTPSConnectionPool(host='www.courts.ca.gov', port=443): Max retries exceeded with url: /opinions/nonpub/B267714.PDF (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000171BDE66348>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

In [27]:
retrieve_cases(cl_25[17759:30000])

In [9]:
retrieve_cases(cl_25[33735:])

ConnectionError: HTTPSConnectionPool(host='www.courts.ca.gov', port=443): Max retries exceeded with url: /opinions/nonpub/B297834.PDF (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000015A83A795C8>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

In [11]:
retrieve_cases(cl_25[47866:])

In [13]:
retrieve_cases(cl_20[0:5000])

In [14]:
retrieve_cases(cl_20[5000:10000])

In [15]:
retrieve_cases(cl_20[10000:15000])

ConnectionError: HTTPSConnectionPool(host='www.courts.ca.gov', port=443): Max retries exceeded with url: /opinions/nonpub/B214007.PDF (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000015A83BC0288>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

In [16]:
retrieve_cases(cl_20[13582:15000])

In [17]:
retrieve_cases(cl_20[15000:20000])

In [18]:
retrieve_cases(cl_20[20000:25000])

In [19]:
retrieve_cases(cl_20[25000:30000])

ConnectionError: HTTPSConnectionPool(host='www.courts.ca.gov', port=443): Max retries exceeded with url: /opinions/nonpub/B229556.PDF (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000015A83AAF108>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

In [20]:
retrieve_cases(cl_20[29548:30000])

In [21]:
retrieve_cases(cl_20[30000:35000])

In [22]:
retrieve_cases(cl_20[35000:40000])

In [23]:
retrieve_cases(cl_20[40000:45000])

In [24]:
retrieve_cases(cl_20[45000:])

ConnectionError: HTTPSConnectionPool(host='www.courts.ca.gov', port=443): Max retries exceeded with url: /opinions/nonpub/B245231.PDF (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000015A83B06888>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

In [26]:
retrieve_cases(cl_20[45265:])

In [28]:
retrieve_cases(cl_30)

In [38]:
retrieve_cases(cl_10[38131:])

In [48]:
def retrieve_cases(folder='default',case_list=cl_15):
    with open("retrieval_log.txt", "a") as s:
        for case in case_list: 
            try:
                r = requests.get(case["url"]);
                status = (r.status_code);
                if r.status_code == 200:
                    extant_cases.append(case)
                    filename = 'Opinions/%s/%s.pdf' % (folder,case["appealsID"])
                    with open(filename, 'wb') as f:
                        f.write(r.content)
                    s.write('Case %s retrieved\n' % case["appealsID"]) 
                elif r.status_code == 404:
                    s.write('Case %s not found\n' % case["appealsID"])
            except TimeoutError:
                print('Timed out during case %s' % case["appealsID"])
                # note: this isn't actually catching TimeoutErrors. Why?
    return

In [54]:
retrieve_cases(folder='cl_15',case_list=cl_15[49725:])

One key step after downloading the files is going to be restricting the case dataset to criminal cases, particularly felonies, originating in Los Angeles Superior Court. The good news is that the appeals I've read so far list the superior court in which the original judgment occurred. The trial court case number also encodes whether the case involves a felony (if the second character of the lower court's case number is an A). 

In [78]:
with open("clean_cls_1_41.txt", "wb") as file:
    pickle.dump(clean_cls, file)


In [81]:
clean_cls_30 = clean_case_list(cl_30)

In [83]:
with open("clean_cls_30.txt", "wb") as file:
    pickle.dump(clean_cls_30, file)

In [85]:
clean_cls_30


[{'url': 'https://www.courts.ca.gov/opinions/nonpub/B300189.PDF',
  'appealsID': 'B300189'},
 {'url': 'https://www.courts.ca.gov/opinions/nonpub/B300258.PDF',
  'appealsID': 'B300258'},
 {'url': 'https://www.courts.ca.gov/opinions/nonpub/B300885.PDF',
  'appealsID': 'B300885'},
 {'url': 'https://www.courts.ca.gov/opinions/nonpub/B301088.PDF',
  'appealsID': 'B301088'}]

Why does this only produce four entries?
A: Because only 4 cases after B300000 have a PDF associated. 


Below is an example from the pdfminer documentation, used for testing.

In [41]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('B282282.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

Filed 1/11/18  P. v. Hicks CA2/5 

NOT TO BE PUBLISHED IN THE OFFICIAL REPORTS 

California Rules of Court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions 
not certified for publication or ordered published, except as specified by rule 8.1115(b).  This opinion 
has not been certified for publication or ordered published for purposes of rule 8.1115. 

 

IN THE COURT OF APPEAL OF THE STATE OF CALIFORNIA 

SECOND APPELLATE DISTRICT 

 

 

 

DIVISION FIVE 

 
 

Plaintiff and Respondent, 

      B282282 
 
      (Los Angeles County 
      Super. Ct. No. TA140515) 

v. 

THE PEOPLE, 
 
 
 
 
 
DAVION HICKS, 
 
 
 
 

Defendant and Appellant. 

 

APPEAL from a judgment of the Superior Court of the 

County of Los Angeles, Teresa P. Magno, Judge.  Affirmed. 

 

William J. Capriola, under appointment by the Court of 

Appeal, for Defendant and Appellant. 

 

Xavier Becerra, Attorney General, Gerald A. Engler, Chief 

Assistant Attorney General, Lance E. W