### Disclaimer

Copyright and other regulations surrounding programmatic gathering of data vary by website and jurisdiction. It is your responsibility to check all applicable laws and terms of use of any website or service before attempting any web harvesting or similar activities.

# PDF Extraction

To extract information from text-based PDFs, we can use package PyPDF2 (https://pythonhosted.org/PyPDF2/) for text extraction

## Example - US Non-farm Unemployment Rate (Weekly)

<img src="assets/PDF-screenshot.png" style="width: 720px"/>



#### Step 1: Locate Weekly Files

https://www.dol.gov    => `Topics` => `statistics` => `Employment and Unemployment` => `Unemployment Insurance Data and Statistics` => `Weekly Claims Report` => `Choose New Release and click Submit`

Pattern found! 

Weekly files are available at `https://oui.doleta.gov/press/<YYYY>/<MM><DD><YY>.pdf`

In [None]:
!pip install -U PyPDF2==1.26.0

In [None]:
import io
import pandas as pd
import requests
import PyPDF2

In [None]:
example_pdf = 'https://oui.doleta.gov/press/2020/020620.pdf'

#### Step 2: Download PDF files using `requests` package

In [None]:
pdf_data = requests.get(example_pdf).content

#### Step 3: Load PDF file with `PyPDF2.PdfFileReader`

Check more usage at https://pythonhosted.org/PyPDF2/PdfFileReader.html

In [None]:
reader = PyPDF2.PdfFileReader(io.BytesIO(pdf_data))

num_pages = reader.getNumPages()

print("This PDF file has {} pages in total".format(num_pages))

#### Step 4: Extract the specific page using  `PdfFileReader.getPage(page_num)` 

`PdfFileReader.PageObject` represents a page in the pdf file (https://pythonhosted.org/PyPDF2/PageObject.html)


In [None]:
page4 = reader.getPage(3)

contents = page4.extractText()
contents

#### Step 5: Extract required fields

Simple rules by counting the special charater `'\n'`

In [None]:
components = contents.split("\n")

week_ending = components[2].strip()
initial_claims_sa = components[8].strip()

print("Week Ending: ", week_ending)
print("Initial Claims (SA): ", initial_claims_sa)

#### Step 6: Create a function to perform automatic extractions from a given URL

Input: pdf URL

Output: Extracted `week_ending` and `initial_claims` number

In [None]:
import io
import datetime
import requests
import PyPDF2


def extract_us_weekly_initial_claims(url):
    pdf_data = requests.get(url).content
    
    reader = PyPDF2.PdfFileReader(io.BytesIO(pdf_data))

    page4 = reader.getPage(3)

    contents = page4.extractText()
    
    components = contents.split("\n")

    week_ending = components[2].strip()
    initial_claims_sa = components[8].strip()
    
    return {
        'url': url,
        'Week Ending': week_ending,
        'Initial Claims (SA)': initial_claims_sa
    }

In [None]:
extract_us_weekly_initial_claims('https://oui.doleta.gov/press/2020/020620.pdf')

In [None]:
extract_us_weekly_initial_claims('https://oui.doleta.gov/press/2020/022720.pdf')

## Batch Example

In [None]:
def get_us_weekly_initial_claims_url(date):
    assert isinstance(date, datetime.date)
    return 'https://oui.doleta.gov/press/{year}/{month:02d}{day:02d}{yy:02d}.pdf'.format(
                year=date.year, 
                month=date.month, 
                day=date.day, 
                yy=(date.year // 100)
            ) 

In [None]:
get_us_weekly_initial_claims_url(datetime.date(2020, 2, 20))

In [None]:
def thursdays(from_date, to_date):
    assert isinstance(from_date, datetime.date)
    assert isinstance(to_date, datetime.date)
    
    first_thursday = from_date + datetime.timedelta(days=(11-from_date.isoweekday()) % 7)
    num_thursdays = int((to_date - first_thursday).days / 7) + 1
    
    return [first_thursday + datetime.timedelta(days=7*i) for i in range(num_thursdays)]

In [None]:
thursdays(datetime.date(2020,2,1), datetime.date(2020,5,20))

In [None]:
begin_date = datetime.date(2020,2,1)
end_date = datetime.date(2020,5,20)

pdf_urls = [get_us_weekly_initial_claims_url(date) for date in thursdays(begin_date, end_date)]

pdf_urls

In [None]:
results = [extract_us_weekly_initial_claims(url) for url in pdf_urls]

pd.DataFrame(results)

## A More Robust Extraction Algorithm With Regular Expression

Learn Regular Expression https://en.wikipedia.org/wiki/Regular_expression

In [None]:
import re

def extract_us_weekly_initial_claims(url):
    pdf_data = requests.get(url).content
    
    reader = PyPDF2.PdfFileReader(io.BytesIO(pdf_data))

    page4 = reader.getPage(3)

    contents = page4.extractText()
    
    contents = contents.replace("\n", " ").upper()
    
    # more robust value extraction using regular expression
    week_ending = re.findall((r"WEEK ENDING\s+(\w+\s+\d+)"), contents)[0]

    initial_claims_sa = re.findall((r"INITIAL CLAIMS \(SA\)\s+([0-9,]+)"), contents)[0]
    
    return {
        'url': url,
        'Week Ending': week_ending,
        'Initial Claims (SA)': initial_claims_sa
    }


In [None]:
results = [extract_us_weekly_initial_claims(url) for url in pdf_urls]

pd.DataFrame(results)