# Extract Text from PDF

The first thing we need to do is get the text out of the PDF. Sadly, PDFs have no notion of actual text structure, so we cannot programmatically get out of any PDF its paragraphs and sections. All we have are words which become lines which become boxes which become pages. Text from one page to another? A novelty!

## PyPDF

In [29]:
from PyPDF2 import PdfFileReader

def extract_information(pdf_path):
    with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

    txt = f"""
    Information about {pdf_path}: 

    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """

    print(txt)
    return information

extract_information("2025.pdf")

Xref table not zero-indexed. ID numbers for objects will be corrected.



    Information about 2025.pdf: 

    Author: None
    Creator: Adobe InDesign 18.3 (Macintosh)
    Producer: Adobe PDF Library 17.0
    Subject: None
    Title: None
    Number of pages: 920
    


{'/CreationDate': "D:20230711123642-04'00'",
 '/Creator': 'Adobe InDesign 18.3 (Macintosh)',
 '/ModDate': "D:20230711123716-04'00'",
 '/Producer': 'Adobe PDF Library 17.0',
 '/Trapped': '/False'}

## PDFminer

In [20]:
from pdfminer.high_level import extract_text

text = extract_text("cowdell.pdf")

print(type(text))
print(text[0:101])

<class 'str'>
 Folklore 125 (Aprii 2014): 80-91

 http://dx.doi.org/l0.1080/0015587X.2013.853516

 RESEARCH ARTICLE


In [17]:
from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

output_string = StringIO()

with open('cowdell.pdf', 'rb') as fin:
    extract_text_to_fp(fin, output_string, laparams=LAParams(),
                       output_type='html', codec=None)

In [18]:
print(output_string.getvalue()[0:100])

<html><head>
<meta http-equiv="Content-Type" content="text/html">
</head><body>
<span style="positio


In [55]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.utils import open_filename


def iter_text_per_page(pdf_file, password='', page_numbers=None, maxpages=0,
                 caching=True, codec='utf-8', laparams=None):
    if laparams is None:
        laparams = LAParams()

    with open_filename(pdf_file, "rb") as fp:
        rsrcmgr = PDFResourceManager(caching=caching)

        idx = 1
        for page in PDFPage.get_pages(
                fp,
                page_numbers,
                maxpages=maxpages,
                password=password,
                caching=caching,
        ):
            with StringIO() as output_string:
                device = TextConverter(rsrcmgr, output_string, codec=codec,
                                       laparams=laparams)
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                interpreter.process_page(page)
                yield idx, output_string.getvalue()
                idx += 1

In [None]:
cowdell_pages = iter_text_per_page("cowdell.pdf")
cowdells = list(cowdell_pages)

# for page in cowdells[10]:
#     print(type(page), page)

## Data into Dataframe

In [65]:
import pandas as pd

In [66]:
df = pd.DataFrame(cowdells, columns =['page', 'text'])
df.head()

Unnamed: 0,page,text
0,1,Folklore 125 (Aprii 2014): 80-91\n\n http://d...
1,2,Ghosts and their Relationship with the Age of...
2,3,82 Paul Cowdell\n\n tendency- possibly even a...
3,4,Ghosts and their Relationship with the Age of...
4,5,84 Paul Cowdell\n\n not. She remained very in...


In [67]:
%%time

document = iter_text_per_page("2025.pdf")

CPU times: user 7 µs, sys: 1e+03 ns, total: 8 µs
Wall time: 11.9 µs


In [68]:
pages = list(document)
df = pd.DataFrame(pages, columns =['page', 'text'])
print(df.shape)
print(df.head())

Unnamed: 0,page,text
0,1,Project 2025\n\nPRESIDENTIAL TRANSITION PROJ...
1,2,© 2023 by The Heritage Foundation\n214 Massach...
2,3,"Foreword by Kevin D. Roberts, PhD\nEdited by P..."
3,4,\n
4,5,Contents\n\nACKNOWLEDGMENTS. . . . . . . . . ....


In [71]:
# Replace all newline characters with spaces. 
df['text'] = df['text'].str.replace('\n',' ')

In [72]:
df.head()

Unnamed: 0,page,text
0,1,Project 2025 PRESIDENTIAL TRANSITION PROJEC...
1,2,© 2023 by The Heritage Foundation 214 Massachu...
2,3,"Foreword by Kevin D. Roberts, PhD Edited by Pa..."
3,4,
4,5,Contents ACKNOWLEDGMENTS. . . . . . . . . . ....


In [73]:
# df.to_csv("2025.csv")