Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headings #9

Open
MrUnknown789556 opened this issue Jun 10, 2023 · 1 comment
Open

Headings #9

MrUnknown789556 opened this issue Jun 10, 2023 · 1 comment

Comments

@MrUnknown789556
Copy link

If headings (and subheadings) are internally defined in the PDF file, these headings can be very easily distinguished and extracted from the generated JSON file when calling Burdoc. If there are no such internally headings (and subheadings), all the headings can be seen in the JSON file mixed together with the describing text, tables etc. etc.

How is it possible to distinguish and extract (for instance by using Regex) from the generated JSON file ONLY all the headings (and subheadings), if there are no internally headings (and subheadings) internally in the PDF file and defined in the "toc" section?

Investigation.pdf
Investigation.txt

@jennis0
Copy link
Owner

jennis0 commented Jul 1, 2023

Sorry for the delayed reply! Burdoc actually already makes a best effort attempt to do this! If you look in the produced JSON file there is a top-level entry called 'page_hierarchy', which contains all headings detected in the text. Unfortunately Burdoc currently doesn't handle maths so in your example a lot of formulas are extracted as 'h6' headings but it looks like it correctly identifies all other headings aside from the document title as 'h5' entries.

The following code shows how to get the page headings and then get all content between the first heading found and the 2nd.
Hope this helps!

import json

#Load data
with open("investigation.txt") as f:
    extract = json.load(f)

headings = []
#Get all 'h5' headings from document
for page,headings in extract['page_hierarchy].items():
    for h in headings:
        if h['assigned_heading'] == 'h5':
            headings.append(h)


first_content = []
page = heading[0]['page']
index = heading[0]['index'][0]

#Iterate over all items until we reach the next heading
while page < heading[1]['page'] or index < heading[1]['index'][0]:
  if index >= len(extract['content'][page]):
    page  += 1
    index = 0
    continue

  first_content.append(extract['content'][page][index]
  index += 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants