In [3]:
# Module imports
import requests
import os
from io import BytesIO
from bs4 import BeautifulSoup
import pickle
import PyPDF2
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import matplotlib.pyplot as plt



Guidance and sample code taken from https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f and https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb

We need to get text out of the PDFs we downloaded earlier. To do that, we use the PyPDF2 library to extract the PDF text. The below function takes a PDF file and returns a text string.

In [5]:
def pdf_to_text(filename):
    with open(filename,'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # this is the readable object to parse
        numpages = pdfReader.numPages
        count = 0
        text = ""
        while count < numpages:
            pageObj = pdfReader.getPage(count)
            count += 1
            text += pageObj.extractText()
        assert text != "", "Text not readable from %s" % filename
    return text

Now that we have the opinion in plain text form, we need to clean it up. 

In [6]:
# Well, first let's see if it works on one of the opinions. 
# Get the current directory:
currentdir = os.path.abspath('')
# Make a filepath to pass to PDFtotext:
testpath = os.path.join(currentdir, 'Opinions\cl_20\B228808.pdf')
testpdf = pdf_to_text(testpath)
print(testpdf)

Filed 2/28/12
  
B
ITH
, LLC v. Mikail CA2/8
 
NOT TO BE PUBLISHED IN THE OFFICIAL REPORTS
 
 
California Rules of Court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for 
publication or ordered published, except as specified by rule 8.1115(b).
  
This opinion has not been certified for publication 
or ordered published for purposes of rule 8.1115.
 
 
IN THE COURT OF APPEAL OF THE STATE OF CALIFORNIA
 
 
SECOND APPELLATE DISTRICT
 
 
DIVISION EIGHT
 
 
 
BITH, LLC,
 
 
 
Plaintiff and Appellant,
 
 
 
v.
 
 
RAMIN MIKAIL
,
 
 
 
Defendant and Respondent.
 
 
      
B228808
 
 
      
(Los Angeles County
 
      
Super. Ct. No. BC 379597)
 
 
 
 
 
APPEAL from a judgment of the Superior Court of Los Angeles County, 
Elizabeth Allen White, Judge.  Affirmed.
 
 
 
Urtnowski & Associates, J.
 
Brian Urt
nowski, Lisamarie Graham and Jered
 
T. Ede 
for Plaintiff and Appellant.
 
 
 
Law Offices of Deborah De Niro, Deborah De Niro; The Ehrlic

Okay, excellent! Comparing the readout to the original PDF, there are some extraneous line breaks and spaces, but the text appears to be mostly present. Some things we might have to fix: footnotes interrupting other text, special characters causing following words not to be read, spaces breaking up names/keywords.

In [16]:
print(repr(testpdf))


'Filed 2/28/12\n  \nB\nITH\n, LLC v. Mikail CA2/8\n \nNOT TO BE PUBLISHED IN THE OFFICIAL REPORTS\n \n \nCalifornia Rules of Court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for \npublication or ordered published, except as specified by rule 8.1115(b).\n  \nThis opinion has not been certified for publication \nor ordered published for purposes of rule 8.1115.\n \n \nIN THE COURT OF APPEAL OF THE STATE OF CALIFORNIA\n \n \nSECOND APPELLATE DISTRICT\n \n \nDIVISION EIGHT\n \n \n \nBITH, LLC,\n \n \n \nPlaintiff and Appellant,\n \n \n \nv.\n \n \nRAMIN MIKAIL\n,\n \n \n \nDefendant and Respondent.\n \n \n      \nB228808\n \n \n      \n(Los Angeles County\n \n      \nSuper. Ct. No. BC 379597)\n \n \n \n \n \nAPPEAL from a judgment of the Superior Court of Los Angeles County, \nElizabeth Allen White, Judge.  Affirmed.\n \n \n \nUrtnowski & Associates, J.\n \nBrian Urt\nnowski, Lisamarie Graham and Jered\n \nT. Ede \nfor Plaintiff and Appel

Documentation from PyPDF2 for the extractText() function: 

extractText()

    Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
    Returns:	a unicode string object.

It looks like if I'm relying on PyPDF2 to extract the text, I'm going to have to work with incomplete text for right now. Maybe I can remove all the line breaks and compare word counts for the documents to check how many words got missed?


In [19]:
# Pickle output for later use
pickletest = os.path.join(currentdir, 'textopinions','B228808.txt')
with open(pickletest, "wb") as file:
    pickle.dump(testpdf, file)

Drawing from https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb

In [4]:
# Initial round of cleanup
def cleantext1(text): 
    # Make text lowercase: 
    text = text.lower()
    # Strip all newlines: 
    text = re.sub('\n', '', text)
    return text

In [25]:
print(cleantext1(testpdf))

filed 2/28/12  bith, llc v. mikail ca2/8 not to be published in the official reports  california rules of court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for publication or ordered published, except as specified by rule 8.1115(b).  this opinion has not been certified for publication or ordered published for purposes of rule 8.1115.  in the court of appeal of the state of california  second appellate district  division eight   bith, llc,   plaintiff and appellant,   v.  ramin mikail,   defendant and respondent.        b228808        (los angeles county       super. ct. no. bc 379597)     appeal from a judgment of the superior court of los angeles county, elizabeth allen white, judge.  affirmed.   urtnowski & associates, j. brian urtnowski, lisamarie graham and jered t. ede for plaintiff and appellant.   law offices of deborah de niro, deborah de niro; the ehrlich law firm and jeffrey isaac ehrlich for defendant and respondent.   * * *

In [26]:
testclean = cleantext1(testpdf)
# Pickle output for later use
pickletestclean = os.path.join(currentdir, 'textopinions','B228808_clean.txt')
with open(pickletest, "wb") as file:
    pickle.dump(testclean, file)

Here, we use Scikit-Learn's feature extraction module to get a count for each word in the text, ignoring commonly used words like "the" and "and".

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(input='content',stop_words = 'english')
data_cv = cv.fit_transform([testclean])
data_dtm = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names())
# sort by frequency
data_sort = data_dtm.sort_values(by=0,axis=1, ascending=False)
# The resulting matrix is tall by default. 
data_sort = data_sort.transpose()

Now that we've checked one file, time to do the other 20,000.


In [18]:
def batchconvert(folder):
    for file in os.listdir(folder):
        try:
            if file.endswith('.pdf'):
                casenum = os.path.splitext(file)[0]
                casename = '%s.txt' % casenum
                savepath = os.path.join(currentdir, 'textopinions', casename)
                with open(savepath, "wb") as textfile:
                    pickle.dump(pdf_to_text(os.path.join(folder, file)),textfile)
        except:
            print('Error occurred during %s ' % casenum)
            continue
    return

In [50]:
folder = os.path.join(currentdir, 'Opinions','cl_15')
batchconvert(folder)

In [51]:
folder = os.path.join(currentdir, 'Opinions','cl_20')
batchconvert(folder)

TODO: Make it easy to start up again if it stops, handle errors, or both. Ideally both.

In [19]:
folder = os.path.join(currentdir, 'Opinions','cl_25')
batchconvert(folder)

Error occurred during B294547 


In [21]:
folder = os.path.join(currentdir, 'Opinions','cl_30')
batchconvert(folder)

In [22]:
folder = os.path.join(currentdir, 'Opinions','cl_10')
batchconvert(folder)

At this point, we should have all the opinions in text form. Now we can load the pickled files and associate each case number with its corpus, check to see that it came from the Los Angeles Superior Court and is a felony, and add the word-frequency matrix.

For testing this part, we want to have a felony case. B2500042 is a felony which originated in the Los Angeles superior court. 

In [None]:
with open(os.path.join(currentdir, 'text_opinions', 'B250042.txt'), "rb") as file:
    felony_case = pickle.load(file)
    
print(type(felony_case))


We're looking for the presence of a string of the form "Super. Ct. No. NA088447" - specifically , the lower court case number should be [letter] A [6-digit number]. I am not sure whether all cases use 6 digits following the A. We're also looking for the words "Los Angeles County" - and we want them to be in the first page of the document. 

In [7]:
# test how os.path.split() works
testpath = r'C:\Users\leodb\Documents\THESIS\ldbw\textopinions\B300885.txt'
print(os.path.split(testpath)[1])
print(os.path.splitext(os.path.split(testpath)[1]))

B300885.txt
('B300885', '.txt')


Upon further consideration, it may be worthwhile to process all the opinions and look for the case number afterward. 

In [40]:
# What form should the set of opinions take? 
# Dictionary where key = casenum, value = dataframe with corpus and wordcount matrix?
# each case is object with properties?
# Let's try setting up a class Case with properties "corpus", "wordcount", "casenum""
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

class Case:
    def __init__(self, textfile):
        assert os.path.isfile(textfile) == True
        # We start out with just the raw text string from the pdf
        # We should also initialize the casenumber attribute?
        with open(textfile, "rb") as file: 
            self.text = pickle.load(file)
            self.casenumber = os.path.splitext(os.path.split(testpath)[1])[0]
    # We should have a method for doing text cleanup and a method for 
    # count vectorization
    def make_corpus(self):
        '''Adds corpus attribute to the Case object by cleaning
        the raw text string'''
        # Make text lowercase: 
        text = self.text
        text = text.lower()
        # Strip all newlines: 
        text = re.sub('\n', '', text)
        self.corpus = text
    def make_unigram_matrix(self):
        '''Adds frequency-matrix-of-unigrams attribute to Case object. Requires
        that corpus attribute already exists.'''
        # currently redundant with ngram matrix
        cv = CountVectorizer(input='content',stop_words = 'english')
        freq_cv = cv.fit_transform([self.corpus])
        freq_matrix = pd.DataFrame(freq_cv.toarray(), columns = cv.get_feature_names())
        # Sort the words by frequency, most frequent first
        # freq_matrix = freq_matrix.sort_values(by=0,axis=1, ascending=False)
        # freq_matrix = freq_matrix.transpose()
        self.unigram_matrix = freq_matrix
    def make_ngram_matrix(self, n):
        '''Adds frequency-matrix-of-ngrams attribute to Case object. Requires
        that corpus attribute already exists.'''
        cv = CountVectorizer(input='content',stop_words = 'english',strip_accents='unicode',ngram_range=(1,n))
        freq_cv = cv.fit_transform([self.corpus])
        freq_matrix = pd.DataFrame(freq_cv.toarray(), columns = cv.get_feature_names())
        # Sort the words by frequency, most frequent first
        # freq_matrix = freq_matrix.sort_values(by=0,axis=1, ascending=False)
        # freq_matrix = freq_matrix.transpose()
        # note: transposing matrix makes the to_string method produce
        # a nice output, but makes indexing by ngram slightly harder?
        self.ngram_matrix = freq_matrix

  
        

In [41]:
# testing initialization method for class
b = Case(testpath)
print(b.casenumber)
# print(b.text)
b.make_corpus()
# print(b.corpus)
b.make_unigram_matrix()
b.make_ngram_matrix(5)
print(b.unigram_matrix)
print(b.ngram_matrix)

B300885
   10  11  1115  11278  11280  1128111284  1128411285  1170  1192  12  ...  \
0   2   1     3      1      1           1           1     1     1   3  ...   

   witnesses  wl  woodell  working  writ  writs  xavier  year  years  yun  
0          1   1        2        1     4      1       1     3      5    1  

[1 rows x 709 columns]
   10  10 348  10 348 352353  10 348 352353 gallardo  \
0   2       1              1                       1   

   10 348 352353 gallardo intended  10 years  10 years prior  \
0                                1         1               1   

   10 years prior felony  10 years prior felony enhancements  11  ...  \
0                      1                                   1   1  ...   

   years life robbery 25 years  years prior  years prior felony  \
0                            1            1                   1   

   years prior felony enhancements  years prior felony enhancements 667  yun  \
0                                1                     

In [28]:
b.ngram_matrix


Unnamed: 0,0
court,54
trial,47
gallardo,43
petitioner,42
conviction,41
...,...
demonstrate conviction type subjects defendant,1
demonstrate conviction type subjects,1
demonstrate conviction type,1
demonstrate conviction,1


In [42]:
print(b.ngram_matrix.to_string())

   10  10 348  10 348 352353  10 348 352353 gallardo  10 348 352353 gallardo intended  10 years  10 years prior  10 years prior felony  10 years prior felony enhancements  11  11 record  11 record sufficient  11 record sufficient demonstrate  11 record sufficient demonstrate conviction  1115  1115 court  1115 court appeal  1115 court appeal state  1115 court appeal state california  1115 opinion  1115 opinion certified  1115 opinion certified publication  1115 opinion certified publication ordered  1115 prohibits  1115 prohibits courts  1115 prohibits courts parties  1115 prohibits courts parties citing  11278  11278 gallardo  11278 gallardo does  11278 gallardo does apply  11278 gallardo does apply retroactively  11280  11280 2019  11280 2019 wl  11280 2019 wl 6485068  11280 2019 wl 6485068 milton  1128111284  1128111284 teague  1128111284 teague analysis  1128111284 teague analysis 1128411285  1128111284 teague analysis 1128411285 johnson  1128411285  1128411285 johnson  1128411285 j

In [43]:
print('los angeles county super' in b.ngram_matrix.to_string())
print('los angeles county super' in b.ngram_matrix)
print('los angeles county super' in b.corpus)
# The reason that these give different results is that the corpus
# includes whitespace between "county" and "super".

True
True
False


In [32]:
type(b.corpus)

str

In [33]:
print(b.corpus)

filed 12/24/19  in re chamberlain ca2/4 not to be published in the official reports  california rules of court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for publication or ordered published, except as specified by rule 8.1115(b).  this opinion has not been certified for publication or ordered published for purposes of rule 8.1115.  in the court of appeal of the state of california  second appellate district  division four   in re   lee chamberlain,   on habeas corpus.        b300885        (los angeles county        super. ct. no. ba198121)    original proceedings; petition for writ of habeas corpus.  craig j. mitchell, judge.  petition denied.  maureen l. fox, under appointment by the court of appeal, for defendant and appellant.  xavier becerra, attorney general, lance e. winters, chief assistant attorney general, susan sullivan pithey, michael r. johnsen and yun k. lee, deputy attorneys general, for plaintiff and respondent.      

In [44]:
b.ngram_matrix['super ct']

0    1
Name: super ct, dtype: int64