You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If headings (and subheadings) are internally defined in the PDF file, these headings can be very easily distinguished and extracted from the generated JSON file when calling Burdoc. If there are no such internally headings (and subheadings), all the headings can be seen in the JSON file mixed together with the describing text, tables etc. etc.
How is it possible to distinguish and extract (for instance by using Regex) from the generated JSON file ONLY all the headings (and subheadings), if there are no internally headings (and subheadings) internally in the PDF file and defined in the "toc" section?
Sorry for the delayed reply! Burdoc actually already makes a best effort attempt to do this! If you look in the produced JSON file there is a top-level entry called 'page_hierarchy', which contains all headings detected in the text. Unfortunately Burdoc currently doesn't handle maths so in your example a lot of formulas are extracted as 'h6' headings but it looks like it correctly identifies all other headings aside from the document title as 'h5' entries.
The following code shows how to get the page headings and then get all content between the first heading found and the 2nd.
Hope this helps!
importjson#Load datawithopen("investigation.txt") asf:
extract=json.load(f)
headings= []
#Get all 'h5' headings from documentforpage,headingsinextract['page_hierarchy].items():
forhinheadings:
ifh['assigned_heading'] =='h5':
headings.append(h)
first_content= []
page=heading[0]['page']
index=heading[0]['index'][0]
#Iterate over all items until we reach the next headingwhilepage<heading[1]['page'] orindex<heading[1]['index'][0]:
ifindex>=len(extract['content'][page]):
page+=1index=0continuefirst_content.append(extract['content'][page][index]
index+=1
If headings (and subheadings) are internally defined in the PDF file, these headings can be very easily distinguished and extracted from the generated JSON file when calling Burdoc. If there are no such internally headings (and subheadings), all the headings can be seen in the JSON file mixed together with the describing text, tables etc. etc.
How is it possible to distinguish and extract (for instance by using Regex) from the generated JSON file ONLY all the headings (and subheadings), if there are no internally headings (and subheadings) internally in the PDF file and defined in the "toc" section?
Investigation.pdf
Investigation.txt
The text was updated successfully, but these errors were encountered: