In [10]:
# Module imports
import requests
import os
from io import BytesIO
from bs4 import BeautifulSoup
import pickle
import PyPDF2
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import matplotlib.pyplot as plt



ModuleNotFoundError: No module named 'matplotlib'

Guidance and sample code taken from https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f and https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb

We need to get text out of the PDFs we downloaded earlier. To do that, we use the PyPDF2 library to extract the PDF text. The below function takes a PDF file and returns a text string.

In [5]:
def PDFtotext(filename):
    with open(filename,'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # this is the readable object to parse
        numpages = pdfReader.numPages
        count = 0
        text = ""
        while count < numpages:
            pageObj = pdfReader.getPage(count)
            count += 1
            text += pageObj.extractText()
        assert text != "", "Text not readable from %s" % filename
    return text

Now that we have the opinion in plain text form, we need to clean it up. 

In [15]:
# Well, first let's see if it works on one of the opinions. 
# Get the current directory:
currentdir = os.path.abspath('')
# Make a filepath to pass to PDFtotext:
testpath = os.path.join(currentdir, 'Opinions\cl_20\B228808.pdf')
testpdf = PDFtotext(testpath)
print(testpdf)

Filed 2/28/12
  
B
ITH
, LLC v. Mikail CA2/8
 
NOT TO BE PUBLISHED IN THE OFFICIAL REPORTS
 
 
California Rules of Court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for 
publication or ordered published, except as specified by rule 8.1115(b).
  
This opinion has not been certified for publication 
or ordered published for purposes of rule 8.1115.
 
 
IN THE COURT OF APPEAL OF THE STATE OF CALIFORNIA
 
 
SECOND APPELLATE DISTRICT
 
 
DIVISION EIGHT
 
 
 
BITH, LLC,
 
 
 
Plaintiff and Appellant,
 
 
 
v.
 
 
RAMIN MIKAIL
,
 
 
 
Defendant and Respondent.
 
 
      
B228808
 
 
      
(Los Angeles County
 
      
Super. Ct. No. BC 379597)
 
 
 
 
 
APPEAL from a judgment of the Superior Court of Los Angeles County, 
Elizabeth Allen White, Judge.  Affirmed.
 
 
 
Urtnowski & Associates, J.
 
Brian Urt
nowski, Lisamarie Graham and Jered
 
T. Ede 
for Plaintiff and Appellant.
 
 
 
Law Offices of Deborah De Niro, Deborah De Niro; The Ehrlic

Okay, excellent! Comparing the readout to the original PDF, there are some extraneous line breaks and spaces, but the text appears to be mostly present. Some things we might have to fix: footnotes interrupting other text, special characters causing following words not to be read, spaces breaking up names/keywords.

In [16]:
print(repr(testpdf))


'Filed 2/28/12\n  \nB\nITH\n, LLC v. Mikail CA2/8\n \nNOT TO BE PUBLISHED IN THE OFFICIAL REPORTS\n \n \nCalifornia Rules of Court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for \npublication or ordered published, except as specified by rule 8.1115(b).\n  \nThis opinion has not been certified for publication \nor ordered published for purposes of rule 8.1115.\n \n \nIN THE COURT OF APPEAL OF THE STATE OF CALIFORNIA\n \n \nSECOND APPELLATE DISTRICT\n \n \nDIVISION EIGHT\n \n \n \nBITH, LLC,\n \n \n \nPlaintiff and Appellant,\n \n \n \nv.\n \n \nRAMIN MIKAIL\n,\n \n \n \nDefendant and Respondent.\n \n \n      \nB228808\n \n \n      \n(Los Angeles County\n \n      \nSuper. Ct. No. BC 379597)\n \n \n \n \n \nAPPEAL from a judgment of the Superior Court of Los Angeles County, \nElizabeth Allen White, Judge.  Affirmed.\n \n \n \nUrtnowski & Associates, J.\n \nBrian Urt\nnowski, Lisamarie Graham and Jered\n \nT. Ede \nfor Plaintiff and Appel

Documentation from PyPDF2 for the extractText() function: 

extractText()

    Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
    Returns:	a unicode string object.

It looks like if I'm relying on PyPDF2 to extract the text, I'm going to have to work with incomplete text for right now. Maybe I can remove all the line breaks and compare word counts for the documents to check how many words got missed?


In [19]:
# Pickle output for later use
pickletest = os.path.join(currentdir, 'textopinions','B228808.txt')
with open(pickletest, "wb") as file:
    pickle.dump(testpdf, file)

Drawing from https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb

In [24]:
# Initial round of cleanup
def cleantext1(text): 
    # Make text lowercase: 
    text = text.lower()
    # Strip all newlines: 
    text = re.sub('\n', '', text)
    return text

In [25]:
print(cleantext1(testpdf))

filed 2/28/12  bith, llc v. mikail ca2/8 not to be published in the official reports  california rules of court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for publication or ordered published, except as specified by rule 8.1115(b).  this opinion has not been certified for publication or ordered published for purposes of rule 8.1115.  in the court of appeal of the state of california  second appellate district  division eight   bith, llc,   plaintiff and appellant,   v.  ramin mikail,   defendant and respondent.        b228808        (los angeles county       super. ct. no. bc 379597)     appeal from a judgment of the superior court of los angeles county, elizabeth allen white, judge.  affirmed.   urtnowski & associates, j. brian urtnowski, lisamarie graham and jered t. ede for plaintiff and appellant.   law offices of deborah de niro, deborah de niro; the ehrlich law firm and jeffrey isaac ehrlich for defendant and respondent.   * * *

In [26]:
testclean = cleantext1(testpdf)
# Pickle output for later use
pickletestclean = os.path.join(currentdir, 'textopinions','B228808_clean.txt')
with open(pickletest, "wb") as file:
    pickle.dump(testclean, file)

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(input='content',stop_words = 'english')
data_cv = cv.fit_transform([testclean])

In [32]:
print(type(testclean))

<class 'str'>


In [35]:
print(data_cv)
data_dtm = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names())


  (0, 424)	4
  (0, 60)	1
  (0, 9)	5
  (0, 227)	73
  (0, 521)	6
  (0, 546)	40
  (0, 239)	1
  (0, 663)	3
  (0, 583)	1
  (0, 702)	1
  (0, 241)	4
  (0, 729)	1
  (0, 311)	27
  (0, 727)	3
  (0, 8)	3
  (0, 653)	1
  (0, 312)	1
  (0, 608)	1
  (0, 264)	1
  (0, 694)	2
  (0, 588)	1
  (0, 251)	2
  (0, 662)	2
  (0, 593)	2
  (0, 764)	1
  :	:
  (0, 601)	1
  (0, 803)	1
  (0, 133)	1
  (0, 366)	1
  (0, 519)	1
  (0, 331)	1
  (0, 581)	1
  (0, 703)	2
  (0, 476)	1
  (0, 811)	1
  (0, 336)	1
  (0, 330)	2
  (0, 677)	1
  (0, 380)	1
  (0, 450)	1
  (0, 586)	1
  (0, 844)	1
  (0, 707)	1
  (0, 571)	1
  (0, 202)	1
  (0, 687)	1
  (0, 429)	1
  (0, 287)	1
  (0, 225)	1
  (0, 726)	1


In [36]:
data_dtm

Unnamed: 0,00,000,10,106,1093,11,110,1114,1115,12,...,white,wire,wired,withholds,witkin,woods,worth,wrongful,young,zerin
0,1,7,3,1,1,2,1,1,3,5,...,1,6,1,1,3,2,1,7,2,5


In [42]:
# sort by frequency
data_sort = data_dtm.sort_values(by=0,axis=1, ascending=False)
data_sort = data_sort.transpose()
data_sort.head(30)

Unnamed: 0,0
bith,73
mikail,40
cal,37
complaint,32
conversion,32
app,27
escrow,27
court,27
amended,25
funds,25


Now that we've checked one file, time to do the other 20,000.


In [49]:
def batchconvert(folder):
    for file in os.listdir(folder):
        if file.endswith('.pdf'):
            casenum = os.path.splitext(file)[0]
            casename = '%s.txt' % casenum
            savepath = os.path.join(currentdir, 'textopinions', casename)
            with open(savepath, "wb") as textfile:
                pickle.dump(PDFtotext(os.path.join(folder, file)),textfile)
    return

In [50]:
folder = os.path.join(currentdir, 'Opinions','cl_15')
batchconvert(folder)

In [51]:
folder = os.path.join(currentdir, 'Opinions','cl_20')
batchconvert(folder)

TODO: Make it easy to start up again if it stops, handle errors, or both. Ideally both.