Headings (Table of Content, TOC) #10

MrUnknown789556 · 2023-06-22T13:28:01Z

Trying to extract the table of content ("Introduction", ..., "References"), I looked into the extracted html file from Burdoc. It could fairly good distinguish the headings from other items in the text. Burdoc extracted all the named outlines correctly, but also an additional item, that is not part of the TOC. It additional extracted the item "Table 4".

I use the string "" to search in the generated html file for the TOC.

There seems to be no difference, if I use Burdoc with the parameter "--no-ml-tables" or not.

The.pdf

jennis0 · 2023-07-01T08:27:02Z

Ah, I think this one might be challenging as it's a false positive for one of the rules used to identify headings (a short bold piece of text directly preceding a standard paragraph and visually spaced from any prior text). Arguably it is a heading, albeit not one that'd be presented in a standard ToC.

I wouldn't expect --no-ml-tables to change this as turning off table-finding means we don't actually try to identify tables in the text, the text the contain still goes through the main text parsing pipeline (and Burdoc doesn't yet identify captions associated with tables so it wouldn't make a difference even if the table had been found)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headings (Table of Content, TOC) #10

Headings (Table of Content, TOC) #10

MrUnknown789556 commented Jun 22, 2023

jennis0 commented Jul 1, 2023

Headings (Table of Content, TOC) #10

Headings (Table of Content, TOC) #10

Comments

MrUnknown789556 commented Jun 22, 2023

jennis0 commented Jul 1, 2023