<a href="https://colab.research.google.com/github/plaban1981/POCs/blob/main/PDF_pre_processing_using_PyMuPDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
 from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


## Extracting headers and paragraphs from a PDF using PyMuPDF

In [2]:
!pip install --quiet fitz

[K     |████████████████████████████████| 3.2 MB 11.6 MB/s 
[K     |████████████████████████████████| 78 kB 8.3 MB/s 
[K     |████████████████████████████████| 130 kB 69.2 MB/s 
[K     |████████████████████████████████| 421 kB 69.8 MB/s 
[K     |████████████████████████████████| 482 kB 77.2 MB/s 
[K     |████████████████████████████████| 5.1 MB 50.4 MB/s 
[K     |████████████████████████████████| 41 kB 693 kB/s 
[K     |████████████████████████████████| 6.4 MB 49.3 MB/s 
[?25h  Building wheel for configobj (setup.py) ... [?25l[?25hdone
  Building wheel for pyxnat (setup.py) ... [?25l[?25hdone


In [3]:
!pip install --quiet PyMuPDF

[K     |████████████████████████████████| 8.8 MB 6.4 MB/s 
[?25h

## Steps used

* Use PyMuPDF to identify the paragraphs as text with the most used font in the document, headers as anything larger, and subscripts as anything smaller than the paragraph style.
* Create a dictionary with HTML style element tags for the headers, paragraphs, and subscripts.
* Annotate pieces of text with these element <tags>.

#### Create a dictionary with all the different styles and attributes and a list of [(font_size, count)] for all these styles.

In [4]:
def fonts(doc, granularity=False):
    """Extracts fonts and their usage in PDF documents.
    :param doc: PDF document to iterate through
    :type doc: <class 'fitz.fitz.Document'>
    :param granularity: also use 'font', 'flags' and 'color' to discriminate text
    :type granularity: bool
    :rtype: [(font_size, count), (font_size, count}], dict
    :return: most used fonts sorted by count, font style information
    """
    styles = {}
    font_counts = {}

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:  # iterate through the text blocks
            if b['type'] == 0:  # block contains text
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
                        if granularity:
                            identifier = "{0}_{1}_{2}_{3}".format(s['size'], s['flags'], s['font'], s['color'])
                            styles[identifier] = {'size': s['size'], 'flags': s['flags'], 'font': s['font'],
                                                  'color': s['color']}
                        else:
                            identifier = "{0}".format(s['size'])
                            styles[identifier] = {'size': s['size'], 'font': s['font']}

                        font_counts[identifier] = font_counts.get(identifier, 0) + 1  # count the fonts usage

    font_counts = sorted(font_counts.items(), key=itemgetter(1), reverse=True)

    if len(font_counts) < 1:
        raise ValueError("Zero discriminating fonts found!")

    return font_counts, styles

#### Create Element tag dictionary

In [5]:
def font_tags(font_counts, styles):
    """Returns dictionary with font sizes as keys and tags as value.
    :param font_counts: (font_size, count) for all fonts occuring in document
    :type font_counts: list
    :param styles: all styles found in the document
    :type styles: dict
    :rtype: dict
    :return: all element tags based on font-sizes
    """
    p_style = styles[font_counts[0][0]]  # get style for most used font by count (paragraph)
    p_size = p_style['size']  # get the paragraph's size

    # sorting the font sizes high to low, so that we can append the right integer to each tag 
    font_sizes = []
    for (font_size, count) in font_counts:
        font_sizes.append(float(font_size))
    font_sizes.sort(reverse=True)

    # aggregating the tags for each font size
    idx = 0
    size_tag = {}
    for size in font_sizes:
        idx += 1
        if size == p_size:
            idx = 0
            size_tag[size] = '<p>'
        if size > p_size:
            size_tag[size] = '<h{0}>'.format(idx)
        elif size < p_size:
            size_tag[size] = '<s{0}>'.format(idx)

    return size_tag

#### Extracting headers and paragraphs

In [6]:
#<header> --> <h>, <paragraph> --> <p> or <subscript> --> <s3>
def headers_para(doc, size_tag):
    """Scrapes headers & paragraphs from PDF and return texts with element tags.
    :param doc: PDF document to iterate through
    :type doc: <class 'fitz.fitz.Document'>
    :param size_tag: textual element tags for each size
    :type size_tag: dict
    :rtype: list
    :return: texts with pre-prended element tags
    """
    header_para = []  # list with headers and paragraphs
    first = True  # boolean operator for first header
    previous_s = {}  # previous span

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:  # iterate through the text blocks
            if b['type'] == 0:  # this block contains text

                # REMEMBER: multiple fonts and sizes are possible IN one block

                block_string = ""  # text found in block
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
                        if s['text'].strip():  # removing whitespaces:
                            if first:
                                previous_s = s
                                first = False
                                block_string = size_tag[s['size']] + s['text']
                            else:
                                if s['size'] == previous_s['size']:

                                    if block_string and all((c == "|") for c in block_string):
                                        # block_string only contains pipes
                                        block_string = size_tag[s['size']] + s['text']
                                    if block_string == "":
                                        # new block has started, so append size tag
                                        block_string = size_tag[s['size']] + s['text']
                                    else:  # in the same block, so concatenate strings
                                        block_string += " " + s['text']

                                else:
                                    header_para.append(block_string)
                                    block_string = size_tag[s['size']] + s['text']

                                previous_s = s

                    # new block started, indicating with a pipe
                    block_string += "|"

                header_para.append(block_string)

    return header_para

## Function to retrieve header and text from PDFs

In [79]:
def getPdfHeaderText(path):
  #1. read the document
  doc = fitz.open(path)
  #2. retrive the count size and style from the document
  font_counts, styles = fonts(doc, granularity=False)
  #3. retrieve the size of the tags (Header + Para)
  size_tag = font_tags(font_counts, styles)
  #4. Retrieve the header and corresponding paragraph
  header_para = headers_para(doc, size_tag)
  #print(header_para)
  # segrregate the header and text with the help of <h> and <p> tags
  header = []
  text = []
  for items in header_para:
    if items.startswith('<h'):
      hdr_txt = items.split("|")[0]
      #print(hdr_txt)
      #header.append(items[items.index(">")+1:].replace("|",""))
      header.append(hdr_txt[hdr_txt.index(">")+1:].replace("|",""))
      text.append(np.nan)
    elif items.startswith('<p'):
      header.append(np.nan)
      tmp = items[items.index(">")+1:].replace("|","")
      tmp = tmp.replace("\t"," ")
      text.append(tmp)
    elif items.startswith('<s'):
      header.append(np.nan)
      text.append(items[items.index(">")+1:].replace("|",""))
  #Create a dataframe
  df = pd.DataFrame({"Header":header,'Text': text})
  #df.to_csv("headers.csv",index=False)
  #forward fill the dataframe to have the haeder populated with the corresponding paragraphs
  df['Header'] = df['Header'].ffill()
  #print(df)
  #
  header = df['Header'].values.tolist()
  text = df['Text'].values.tolist()
  #initialize dictionary to hold header and corresponding text
  header_dict = {}
  #filter tags
  tag_list = ["Sources:", "Source:", "Tags-", "Tags:","CONTENTS","ANNEX","EXERCISES","Project/Activity"]
  for h,t in zip(header,text):
    #h = str(h)
    #h = h.replace("\t"," ")
    if type(h) == str and (h != np.nan) and \
    (h not in tag_list) and \
    not h.startswith("Table ") and \
    not h.startswith("Figure ") and \
    not h.startswith("EXERCISES") :
      
      h = h.replace("\t"," ")
      #print(type(h))
      h = h.strip()
      if h in header_dict.keys():
        t = str(t).replace("\t","")
        header_dict[h] += str(t).replace("nan","")
      else:
        t = str(t).replace("\t","")
        header_dict[h] = str(t).replace("nan","")
  # filter out the headers having empty text associated
  filtered_dict = {k:v for k,v in header_dict.items() if len(v) > 0}
  return filtered_dict


## Execute  the function to retrieve haeders and corresponding paragraphs

In [82]:
import fitz
import pandas as pd
import numpy as np
from operator import itemgetter
#path = "/content/drive/MyDrive/ZeoanAI_Poc/PDF/Economic Survey 2021-22.pdf"
path = "/content/drive/MyDrive/ZeoanAI_Poc/PDF/legy101.pdf"
header_dict = getPdfHeaderText(path)
print(header_dict)

{'Nature and Scope': 'You have already studied ‘Geography as a Discipline’ in Chapter I of the book, Fundamentals of Physical Geography  (NCERT, 2006). Do you recall the contents? This chapter has broadly covered and introduced you to the nature of geography. You are also acquainted with the important branches that sprout from the body of geography. If you re-read the chapter you will be able to recall the link of human geography with the mother discipline i.e. geography. As you know geography as a field of study is integrative, empirical, and practical. Thus, the reach of geography is extensive and each and every event or phenomenon which varies over space and time can be studied geographically. How do you see the earth’s surface? Do you realise that the earth comprises two major components: nature (physical environment) and life forms including human beings? Make a list of physical and human components of your surroundings. Physical geography studies physical environment and human ge

In [83]:
header_dict

{'Chapter-1': '2021-222 Fundamentals of Human Geographyphenomena are described in metaphors using symbols from the human anatomy. We often talk of  the ‘face’ of the earth, ‘eye’ of the storm, ‘mouth’ of the river, ‘snout’ (nose) of the glacier, ‘neck’ of the isthmus and ‘profile’ of the soil. Similarly regions, villages, towns have been described as ‘organisms’. German geographers describe the ‘state/country’ as a ‘living organism’. Networks of road, railways and water ways have often been described as “arteries of circulation”. Can you collect such terms and expressions from your own language? The basic questions now arises, can we separate nature and human when they are so intricately intertwined?',
 'Human Geography Defined': '• “Human geography is the synthetic study of relationship between human societies and earth’s surface”.                                  RatzelSynthesis has been emphasised in the above definition.• “Human geography is the study of “the changing relationship 