Headings #9

MrUnknown789556 · 2023-06-10T01:09:07Z

If headings (and subheadings) are internally defined in the PDF file, these headings can be very easily distinguished and extracted from the generated JSON file when calling Burdoc. If there are no such internally headings (and subheadings), all the headings can be seen in the JSON file mixed together with the describing text, tables etc. etc.

How is it possible to distinguish and extract (for instance by using Regex) from the generated JSON file ONLY all the headings (and subheadings), if there are no internally headings (and subheadings) internally in the PDF file and defined in the "toc" section?

Investigation.pdf
Investigation.txt

jennis0 · 2023-07-01T08:21:15Z

Sorry for the delayed reply! Burdoc actually already makes a best effort attempt to do this! If you look in the produced JSON file there is a top-level entry called 'page_hierarchy', which contains all headings detected in the text. Unfortunately Burdoc currently doesn't handle maths so in your example a lot of formulas are extracted as 'h6' headings but it looks like it correctly identifies all other headings aside from the document title as 'h5' entries.

The following code shows how to get the page headings and then get all content between the first heading found and the 2nd.
Hope this helps!

import json

#Load data
with open("investigation.txt") as f:
    extract = json.load(f)

headings = []
#Get all 'h5' headings from document
for page,headings in extract['page_hierarchy].items():
    for h in headings:
        if h['assigned_heading'] == 'h5':
            headings.append(h)


first_content = []
page = heading[0]['page']
index = heading[0]['index'][0]

#Iterate over all items until we reach the next heading
while page < heading[1]['page'] or index < heading[1]['index'][0]:
  if index >= len(extract['content'][page]):
    page  += 1
    index = 0
    continue

  first_content.append(extract['content'][page][index]
  index += 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headings #9

Headings #9

MrUnknown789556 commented Jun 10, 2023

jennis0 commented Jul 1, 2023

Headings #9

Headings #9

Comments

MrUnknown789556 commented Jun 10, 2023

jennis0 commented Jul 1, 2023