In [1]:
from pathlib import Path

import pdfplumber
import re

import sys
src_path = str(Path.cwd().parent / "src")
sys.path.append(src_path)
from pdf_processing import *

TODO: 
 - Q&As / interviews
 - check more PDFs for differences in format
 - locations with multiple comma's (ex. when Washington, D.C. is included)
 - locations not included after the date
 - write documentation

Get the path to the directory in which the PDFs are stored.

In [2]:
pdf_dir = Path.cwd().parent / "pdfs"

Filepaths of all the PDFs in the folder `pdf_dir`, can be used to iterate over all the PDFs to store the extracted speeches in a dataframe.

In [3]:
pdfs = list(pdf_dir.glob('*.pdf'))  
print("current number of PDFs:", len(pdfs))

current number of PDFs: 436


Get filepath of the PDF you want to process.

In [4]:
filepath = pdfs[400]

Make a PDFHandler object for the given filepath.

In [6]:
pdf = PDFHandler(filepath)

Print the first page of the PDF before it has been processed.

In [7]:
print(pdf.original_page(0))

  
AA RR
mmeerriiccaann hheettoorriicc..ccoomm  
 
Barack Obama 
Eulogy for Beau Biden III 
delivered 6 June 2015, St. Anthony of Padua Church Wilmington, Delaware 
 
 
AUTHENTICITY CERTIFIED: Text version below transcribed directly from audio 
“A man,” wrote an Irish poet, “is original when he speaks the truth that has always been 
known to all good men.”  Beau Biden was an original.  He was a good man.  A man of 
 
character.  A man who loved deeply, and was loved in return.
Your Eminences, your Excellencies, General Odierno, distinguished guests; to Hallie, Natalie 
and Hunter; to Hunter, Kathleen, Ashley, Howard; the rest of Beau’s beautiful family, friends, 
colleagues; to Jill and to Joe -- we are here to grieve with you, but more importantly, we are 
 
here because we love you.
Without love, life can be cold and it can be cruel.  Sometimes cruelty is deliberate -- the 
action of bullies or bigots, or the inaction of those indifferent to another’s pain.  But often, 
cruelty is si

Print the second page of the PDF before it has been processed.

In [8]:
print(pdf.original_page(1))

  
AA RR
mmeerriiccaann hheettoorriicc..ccoomm  
 
To know Beau Biden is to know which choice he made in his life.  To know Joe and the rest of 
the Biden family is to understand why Beau lived the life he did.  For Beau, a cruel twist of 
fate came early -- the car accident that took his mom and his sister, and confined Beau and 
 
Hunter, then still toddlers, to hospital beds at Christmastime.
But Beau was a Biden.  And he learned early the Biden family rule:  If you have to ask for 
help, it’s too late.  It meant you were never alone; you don’t even have to ask, because 
 
someone is always there for you when you need them.
And so, after the accident, Aunt Valerie rushed in to care for the boys, and remained to help 
raise them.  Joe continued public service, but shunned the parlor games of Washington, 
choosing instead the daily commute home, maintained for decades, that would let him meet 
his most cherished duty -- to see his kids off to school, to kiss them at night, to let them

Print the last page of the PDF before it has been processed.

In [9]:
print(pdf.original_page(-1))

  
AA RR
mmeerriiccaann hheettoorriicc..ccoomm  
 
Beau figured that out so early in life.  What an inheritance Beau left us.  What an example he 
 
set.
“Through our great good fortune, in our youth our hearts were touched with fire,” said Oliver 
Wendell Holmes, Jr.  “But, above all, we have learned that whether a man accepts from 
Fortune her spade, and will look downward and dig, or from Aspiration her axe and cord, and 
will scale the ice, the one and only success which it is his to command is to bring to his work a 
 
mighty heart.”
Beau Biden brought to his work a mighty heart.  He brought to his family a mighty heart. 
 
What a good man.  What an original.
May God bless his memory, and the lives of all he touched. 
AmericanRhetoric.com         Page 6 


Define a regular expression to get the date, location, and content of the speech. Extract the entire speech from the PDF.

In [10]:
start = r"(?:hheettoorriicc\.\.ccoomm)"
date = r"(.*[dD]elivered\s+(?P<day>[0-9]{1,2})\s+(?P<mon>[A-Z][a-z]+)\s+(?P<year>[0-9]{2,4})"
loc = r"(,\s+(?P<location_small>[A-Za-z0-9. ]+),\s+(?P<location_big>[A-Za-z0-9., ]+))?"
auth = r"(?:\s+AUTHENTICITY CERTIFIED: Text version below transcribed directly from audio))?"
content = r"\s+(?P<content>.*)\n+"
end = r"(?:(Transcription\s+by\s+.*)?(Property\s+of\s+)?AmericanRhetoric\.com)"

pat = re.compile(start + date + loc + auth + content + end, re.DOTALL)

speech = pdf.extract_speech(pat)
print(speech)

“A man,” wrote an Irish poet, “is original when he speaks the truth that has always been 
known to all good men.”  Beau Biden was an original.  He was a good man.  A man of 
 
character.  A man who loved deeply, and was loved in return.
Your Eminences, your Excellencies, General Odierno, distinguished guests; to Hallie, Natalie 
and Hunter; to Hunter, Kathleen, Ashley, Howard; the rest of Beau’s beautiful family, friends, 
colleagues; to Jill and to Joe -- we are here to grieve with you, but more importantly, we are 
 
here because we love you.
Without love, life can be cold and it can be cruel.  Sometimes cruelty is deliberate -- the 
action of bullies or bigots, or the inaction of those indifferent to another’s pain.  But often, 
cruelty is simply born of life, a matter of fate or God’s will, beyond our mortal powers to 
comprehend.  To suffer such faceless, seemingly random cruelty can harden the softest 
hearts, or shrink the sturdiest.  It can make one mean, or bitter, or full of 

Print the relevant info of the PDF.

In [11]:
pdf.print_info()

Title: Beau_Biden_Eulogy
Number of pages: 6
Date: ['6', 'June', '2015']
Location: ['St. Anthony of Padua Church Wilmington', 'Delaware']


Replace or delete some characters to clean the speech.

In [12]:
old = [r'-+', r'\.{2,}', r'[’‘]', r'"', r'’’', r'‘‘', r'“', r'”', r',', r'\[sic\]', r'\s+']
new = [r' ' , r' '     , r"'"   , r'' , r''  , r''  , r'' , r'' , r',', r' '      , r' '  ]

clean_speech = pdf.replace(speech, old, new)
print(clean_speech)



Old code, used for debugging.

In [None]:
pdfs = ["Farewell_to_Staff_and_Supporters", "Flint_Michigan_Community", "Guantanamo_Bay_Closing", "Post_G7_Presser_Japan"]

pdf_dir = Path.cwd().parent / "pdfs"
file_to_open = pdfs[2] + ".pdf" 
pdf = pdfplumber.open(pdf_dir / file_to_open)

print('Title:', pdf.metadata['Title'])
print("number of pages:", len(pdf.pages))

In [None]:
text = pdf.pages[0].extract_text()
    
#start = r"hheettoorriicc\.\.ccoomm(?:.*AUTHENTICITY CERTIFIED: Text version below transcribed directly from audio)?"
start = r"(?:hheettoorriicc\.\.ccoomm)"
date = r"(.*delivered\s+(?P<day>[ 123][0-9])\s+(?P<mon>[A-Z][a-z]+)\s+(?P<year>[0-9][0-9][0-9][0-9]),"
loc = r"\s+(?P<location_small>[A-Za-z0-9 ]+),\s+(?P<location_big>.*)\n+"
auth = r"(?:.*AUTHENTICITY CERTIFIED: Text version below transcribed directly from audio))?"
mid = r"\s*\n+(?P<content>.*)\s*\n+"
end = r"(?:(?:Property\s+of\s+)?AmericanRhetoric\.com)"

core_pat = re.compile(start + date+loc+auth + mid + end, re.DOTALL)

print("ORIGINAL:\n")
print(text)
print("\n\n" + 100*"-" + "\n\n") 

search = re.search(core_pat, text)
core = search.group("content")

dash = re.compile(r"-+")
no_dash_core = dash.sub(r" ", core)

dots = re.compile(r"\.{2,}")
no_dots_core = dots.sub(r" ", no_dash_core)

spaces = re.compile(r"\s+")
single_space_core = spaces.sub(r" ", no_dots_core)

print("PROCESSED:\n")
print(single_space_core)

In [None]:
def extract_speech(pages):
    full_text = ""
    
    start = r"hheettoorriicc\.\.ccoomm(?:.*AUTHENTICITY CERTIFIED: Text version below transcribed directly from audio)?"
    mid = r"\s+(?P<content>.*)\s+"
    end = r"(?:Property of )?AmericanRhetoric\.com"
    core_pat = re.compile(start+mid+end, re.DOTALL)
        
    for i in range(len(pages)):
        text = pages[i].extract_text()
        
        core = re.search(core_pat, text).group("content")
        
        core = replace(core, [r"-+", r"\.{2,}", r"\s+"], r" ")
        
        full_text += core + " "
        
    return full_text


def replace(text, old, new):
    for i in range(len(old)):
        pat = re.compile(old[i])
        text = pat.sub(new, text)
    return text


In [None]:
full_speech = extract_speech(pdf.pages)
print(full_speech)

In [None]:
pdf.close()