# PDFs to text

We're going to use [pdfminer.six](https://pdfminersix.readthedocs.io/en/latest/) to convert PDFs to text.

```
pip install pdfminer.six
```

In [1]:
! pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0mm
Collecting cryptography>=36.0.0
  Downloading cryptography-38.0.4-cp36-abi3-macosx_10_10_universal2.whl (5.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m[36m0:00:01[0m:01[0mm
Installing collected packages: cryptography, pdfminer.six
Successfully installed cryptography-38.0.4 pdfminer.six-20221105


In [2]:
from pdfminer.high_level import extract_text

In [6]:
# Extract the text from a pdf
text = extract_text("new-york/nys-bill.pdf")

# Get the first n characters in the text
text[:1000]

'No. 115. An act relating to commercial catering licenses, the export of\nvinous beverages, and outside consumption permits.\n\n(H.506)\n\nIt is hereby enacted by the General Assembly of the State of Vermont:\n\nSec. 1. 7 V.S.A. § 2 is amended to read:\n\n§ 2. DEFINITIONS\n\nThe following words as used in this title, unless a contrary meaning is\n\nrequired by the context, shall have the following meaning:\n\n* * *\n\n(6) “Caterer’s permit license”: a permit license issued by the liquor\n\ncontrol board authorizing the holder of a first class license or first and third\n\nclass licenses for a cabaret, restaurant, or hotel premises to serve malt or\n\nvinous beverages or spirituous liquors at a function located on premises other\n\nthan those occupied by a first, first and third, or second class licensee to sell\n\nalcoholic beverages.\n\n(7) “Club”: an unincorporated association or a corporation authorized to\n\ndo business in this state, that has been in existence for at least two con

## I have a whole bunch of documents...
what to do?

Often we will have a folder with a lot of documents, and we want to do something to all the pdfs. How do we get the text out of all of them?

### Step 1: Get a list of PDFs?

In [9]:
import glob

# Get a list of all the PDFs in the folder 'new-york' that is named something with 'nys-bill'
filenames = glob.glob("new-york/nys-bill*.pdf")
filenames

['new-york/nys-bill.pdf',
 'new-york/nys-bill copy.pdf',
 'new-york/nys-bill copy 3.pdf',
 'new-york/nys-bill copy 2.pdf',
 'new-york/nys-bill copy 5.pdf',
 'new-york/nys-bill copy 4.pdf']

### Use list comprehension

1. Make a new list of texts using the text extracted from each of the PDFs

In [13]:
texts = [extract_text(filename) for filename in filenames]
len(text)

15320

In [14]:
import pandas as pd

df = pd.DataFrame({
    'filename': filenames,
    'contents': texts
})
df.head()

Unnamed: 0,filename,contents
0,new-york/nys-bill.pdf,No. 115. An act relating to commercial caterin...
1,new-york/nys-bill copy.pdf,No. 115. An act relating to commercial caterin...
2,new-york/nys-bill copy 3.pdf,No. 115. An act relating to commercial caterin...
3,new-york/nys-bill copy 2.pdf,No. 115. An act relating to commercial caterin...
4,new-york/nys-bill copy 5.pdf,No. 115. An act relating to commercial caterin...


### Use df.apply

In [17]:
df = pd.DataFrame({
    'filename': filenames,
})

Unnamed: 0,filename
0,new-york/nys-bill.pdf
1,new-york/nys-bill copy.pdf
2,new-york/nys-bill copy 3.pdf
3,new-york/nys-bill copy 2.pdf
4,new-york/nys-bill copy 5.pdf


In [18]:
df['content'] = df.filename.apply(lambda filename: extract_text(filename))
df.head()

Unnamed: 0,filename,content
0,new-york/nys-bill.pdf,No. 115. An act relating to commercial caterin...
1,new-york/nys-bill copy.pdf,No. 115. An act relating to commercial caterin...
2,new-york/nys-bill copy 3.pdf,No. 115. An act relating to commercial caterin...
3,new-york/nys-bill copy 2.pdf,No. 115. An act relating to commercial caterin...
4,new-york/nys-bill copy 5.pdf,No. 115. An act relating to commercial caterin...


# OCR

Optical character recognition

* **App:** Adobe Acrobat, ABBYY FineReader
* **Cloud:** AWS Textract, Google Cloud
* **Python/etc:** OCRmyPDF, EasyOCR, PaddleOCR
* **Vaguely in-between:** Apple Vision Kit