<h1>Extract Structure Data from PDF Document</h1>

<p>In this article, I will show how to extract headers and paragraphs from PDF document with PyMuPDF and save it as JSON format.The PDF format is one of the most common formats for document format and it's really usefull to know how to read text for it.</p>

<h3>Create an enviroment with conda</h3>
<p>With conda, we can create, export, list, remove, and update enviroments that have different versions of Python and/or packages installed in them.</p>

In [26]:
# conda create --name pdf_extractor

# virtualenv --python python venv


<p>Replace <code>pdf_extractor</code> with your enviroment name.</p>
<p>To activate an enviroment:</p>

In [27]:
# conda activate pdf_project

<h3> Install Jupyter Notebook</h3>

In [28]:
# conda install -c conda-forge notebook
# conda install -c conda-forge nb_conda_kernels

<h3>Start Jupyter Notebook</h3>

In [29]:
#jupyter notebook

<h3>Install libraries</h3>

In [30]:
#pip instal pymupdf
#pip install fitz

<h3>Read PDF Document and Extract Text</h3>

In [1]:
from operator import itemgetter
import fitz
import json

<h3>Open a PDF Document</h3>
<p>Open a pdf document with <code>PyMuPDF</code> package, imported as <code>fitz</code>.</p>

In [2]:
filename = 'Australia legislation credit cards F2014L01710ES.pdf'
doc = fitz.open(filename)

<p>Document Methods and Attributes:</p>

In [3]:
# The number of pages
doc.page_count

38

In [4]:
# The table contents
doc.get_toc()

[[1, 'Bookmarks', 1],
 [2, 'EXPLANATORY STATEMENT', 1],
 [2, 'SELECT LEGISLATIVE INSTRUMENT NO. 207, 2014', 1],
 [3, 'Issued by authority of the Treasurer', 1],
 [4, 'ATTACHMENT A', 3],
 [4, 'Statement of Compatibility with Human Rights', 3],
 [5, 'Overview of the Legislative Instrument', 4],
 [5, 'Human rights implications', 4],
 [5, 'Conclusion', 4],
 [4, 'CONTENTS', 6],
 [2, 'Limiting Access to ADIs is Preventing New Entry', 12],
 [2, 'Prudential Framework', 13],
 [2, 'Option 1: Maintain the Status Quo', 16],
 [2,
  'Option 2: Remove the APRA SCCI Regime, but Retain Some Controls via the Access Regimes',
  16],
 [2, 'Option 3: Remove All Access Regulation', 17],
 [2, 'Options Not Considered to be Feasible', 17],
 [2, 'Note: Visa Debit Access Regime', 17],
 [2, 'Option 1: Maintain the Status Quo', 19],
 [3, 'Benefits', 19],
 [3, 'Costs', 20],
 [4, 'Regulatory costs', 20],
 [4, 'Reduced participation and competition', 20],
 [4, 'Public sector costs', 20],
 [2,
  'Option 2: Remove the 

<h3>Working with Pages</h3>

<p> Method <code>get_text()</code> return dictionary that contains the page's text content.</p>
<ul>
    <li>A page consists of a list of block dictionaries.</li>
    <li>A block consists of a list of line dictionaries.</li>
    <li>A line consists of a list of span distionaries.</li>
    <li>A span consists of the text itself, font, size, color, etc.</li>
</ul>

<p>A <b>text</b> page consists of blocks.</p>
<p>A <b>block</b> consists of either lines and their characters, or an image.</p>
<p>A <b>line</b> consists of spans.</p>
<p>A <b>span</b> consists of characters with identical font properties: name, size, flags and color.</p>

<h3>DICT</h3>
<p><code>page.get_text("dict")</code> returns the structures of page and provides content for every block, line and span.</p>
<img src="image.png">

In [6]:
styles = {}
font_counts = {}

for page in doc:
    blocks = page.get_text("dict")['blocks']
    for block in blocks:
        # Text block
        if block['type']==0:
            for line in block['lines']:
                for span in line['spans']:
                    span_id = "{0}".format(round(span['size']))
                    styles[span_id] = {'size': round(span['size']), 'font': span['font']}
                    font_counts[span_id] = font_counts.get(span_id, 0) + 1

In [7]:
font_counts = sorted(font_counts.items(), key=itemgetter(1), reverse=True)
print(font_counts)

[('11', 1087), ('12', 210), ('10', 170), ('7', 133), ('9', 62), ('8', 38), ('20', 30), ('14', 25), ('16', 21), ('6', 20)]


In [52]:
styles

{'14': {'size': 14, 'font': 'Palatino Linotype,Bold'},
 '12': {'size': 12, 'font': 'Times New Roman,Bold'},
 '8': {'size': 8, 'font': 'Times-Roman'},
 '7': {'size': 7, 'font': 'Palatino Linotype,Bold'},
 '20': {'size': 20, 'font': 'Palatino Linotype,Bold'},
 '16': {'size': 16, 'font': 'Palatino Linotype,Bold'},
 '11': {'size': 11, 'font': 'Calibri'},
 '6': {'size': 6, 'font': 'Calibri'},
 '9': {'size': 9, 'font': 'Calibri'},
 '10': {'size': 10, 'font': 'Calibri,Bold'}}

In [8]:
p_style = styles[font_counts[0][0]]  # get style for most used font by count (paragraph)
p_size = float(p_style['size'])  # get the paragraph's size

# sorting the font sizes high to low, so that we can append the right integer to each tag
font_sizes = []
for (font_size, count) in font_counts:
    font_sizes.append(float(font_size))
font_sizes.sort(reverse=True)

In [9]:
font_sizes

[20.0, 16.0, 14.0, 12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0]

In [10]:
# aggregating the tags for each font size
idx = 0
size_tag = {}

for size in font_sizes:
    idx += 1
    if size > p_size:
        size_tag[size] = '<h{0}>'.format(idx)
    elif size == p_size:
        size_tag[p_size] = '<p>'
        idx = 0
    elif size < p_size:
        size_tag[size] = '<s{0}>'.format(idx)

In [11]:
size_tag

{20.0: '<h1>',
 16.0: '<h2>',
 14.0: '<h3>',
 12.0: '<h4>',
 11.0: '<p>',
 10.0: '<s1>',
 9.0: '<s2>',
 8.0: '<s3>',
 7.0: '<s4>',
 6.0: '<s5>'}

In [12]:
header_paragraph = []  # list with headers and paragraphs
first = True  # boolean operator for first header
previous_s = {}  # previous span

for page in doc:
    blocks = page.get_text("dict")["blocks"]
    for b in blocks:  # iterate through the text blocks
        if b['type'] == 0:  # this block contains text

            # REMEMBER: multiple fonts and sizes are possible IN one block

            block_string = ""  # text found in block
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    if s['text'].strip():  # removing whitespaces:
                        if first:
                            previous_s = s
                            first = False
                            block_string = size_tag[round(s['size'])] + s['text']
                        else:
                            if s['size'] == previous_s['size']:

                                if block_string and all((c == "|") for c in block_string):
                                    # block_string only contains pipes
                                    block_string = size_tag[round(s['size'])] + s['text']
                                if block_string == "":
                                    # new block has started, so append size tag
                                    block_string = size_tag[round(s['size'])] + s['text']
                                else:  # in the same block, so concatenate strings
                                    block_string += " " + s['text']

                            else:
                                header_paragraph.append(block_string)
                                block_string = size_tag[round(s['size'])] + s['text']

                            previous_s = s

                # new block started, indicating with a pipe
                block_string += "|"

            header_paragraph.append(block_string)

In [13]:
headers=[]
para=[]
h_=""
document=[]
d=dict()
for h in header_paragraph:
    d=dict()
    if h.startswith('<h') or h.startswith('<p'):
        #print(h)
        #print('------------------------')
        if len(h.split(" "))<20 and not h.startswith('<p'):
            d[h.split('>')[0]+">"]={"text":h.split('>')[-1].replace('|',''),"paragraph":[]}
            document.append(d)

        elif h.startswith('<p'):
            last_element = document.pop()
            #print(last_element[list(last_element.keys())[0]]["paragraph"])
            #last_element[list(last_element.keys())[0]]["paragraph"]=last_element[list(last_element.keys())[0]]["paragraph"]+h.split('>')[-1].replace('|','')
            last_element[list(last_element.keys())[0]]["paragraph"].append(h.split('>')[-1].replace('|',''))
            document.append(last_element)
        else:
            last_element = document.pop()
            #print(last_element[list(last_element.keys())[0]]["paragraph"])
            last_element[list(last_element.keys())[0]]["paragraph"].append(h.split('>')[-1].replace('|',''))
            document.append(last_element)
            
with open("doc.json", 'w') as json_out:
    json.dump(document, json_out,indent=2)

In [59]:
#document