Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction of headings from academic journal articles #11

Open
MrUnknown789556 opened this issue Jul 7, 2023 · 0 comments
Open

Extraction of headings from academic journal articles #11

MrUnknown789556 opened this issue Jul 7, 2023 · 0 comments

Comments

@MrUnknown789556
Copy link

I fully understand that Burdoc for now is a beta-version under development. As Burdoc is now, it is not able to extract headings from academic journal articles. With some articles it does a great job, other times (mostly) it is useless. It is certainly not the least reliable. I hope this topic will have a priority, when working on making Burdoc going from a beta version and ahead.

When Burdoc is not working properly, it either doesn't put any headings where they should be found in the JSON file, or it may give all the headings, but mixed with many different objects from the article in the extracted JSON file.

I use either "h5" or "h6" to look for headings in the generated JSON file. I use the CLI.

h5 = extractBetween(JSONfile, '"h5", "block_text": "' , '", "items":');
h6 = extractBetween(JSONfile, '"h6", "block_text": "' , '", "items":');

Sometimes the headings are also only found like here: {"type": "paragraph", "block_text": "UDEC MODELLING OF P-WAVE PROPAGATION ACROSS JOINTS", "items": [{"spans": [{"text": "UDEC MODELLING OF P-WAVE", "font": {"name": "font", "font":.

My impression is, that mostly all headings from an article are found by Burdoc, but they are impossible to be found from the JSON file by a program, because the headings are not stored in the JSON systematically, but spread within and "hidden" together normal text identification in the JSON.

I append to here a log file from a test run of several (816) randomly chosen academic articles. Some are not new, some are of more recent date.

I also append a few of the articles, where the extracted headings are not as expected. A single article ("Theoretical and Numerical Research on V-Cut Parameters and Auxiliary Cuthole Criterion in Tunnelling") also appended here, where heading were extracted as expected. Further PDF articles as listed in the log file can be supplied if requested (frank230458@yahoo.dk).

When this error in extracting headings from academic journal articles will be fixed, all headings should all be found in one place in the JSON, either under 'h5' or 'h6', not with some of the headings after h5, others after h6 or under other quite different ID's.

2023.07.06 Test of TOC.txt
The effect of impact velocity and target thickness on ballistic performance of layered plates using Taguchi method-compressed.pdf
The effect of shell material and load coefficient on the expansion of shell driven by detonation.pdf
The effects of axial length on the fracture and fragmentation of expanding rings.pdf
The energy absorption enhancement in aramid fiber-reinforced poly(benzoxazine-co-urethane) composite armors under ballistic impacts.pdf
The influence of asymmetries in shaped charge performance.pdf
Theoretical and Numerical Research on V-Cut Parameters and Auxiliary Cuthole Criterion in Tunnelling.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant