Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link footnotes in the text #944

Merged
merged 14 commits into from Sep 27, 2022
Merged

Link footnotes in the text #944

merged 14 commits into from Sep 27, 2022

Conversation

lfoppiano
Copy link
Collaborator

This PR aims to link footnotes (already extracted from the segmentation model) to the text.
This current implementation uses an heuristic that uses the footnote number and search for the marker in the paragraph text from the same page as the footnote.

As an example:
image

The result is injected in the output XML as (using the same strategy as the references):

The building blocks for these theories are phrasal or clausal units, and the targets of the analyses are usually very short texts, typically one to three paragraphs in length.
                    <ref type="foot" target="#b5">5</ref> Many problems in discourse analysis, such as dialogue generation and turntaking 
                    <ref type="bibr" target="#b47">(Moore and Pollack 1992;</ref>

What still to verify:

  • when we look up in the text, we might incorrectly link to the wrong place. When we have a list usually is when we get false positives. Example:

Untitled

and the output is linked to the list item:

   <p>the relative length of the document,
                   <ref type="foot" target="#b2">2</ref>. the frequency of the term sets in the document, and 3. the distribution of the term sets with respect to the document and to each other.
               </p>
   ```

- the subscript/superscript are not reliable and I did not find a consistent alternative way to know when a marker in the text is not a footnote marker. 

@coveralls
Copy link

coveralls commented Aug 27, 2022

Coverage Status

Coverage increased (+0.08%) to 39.961% when pulling 655ccdf on features/footnotes into 54d1c29 on master.

@kermitt2
Copy link
Owner

Thanks a lot Luca !

I think without the constraint on superscript for footnote callout, this approach cannot work (too many false attachments).

Normally the superscript attribute is reliable when it is set to true, but coverage is incomplete. There are several cases where pdfalto does not detect superscript for the moment. However, as pdfalto improves on this, the coverage of a heuristics with superscript condition will improve.

Do you have examples of superscript attributes incorrectly set to true? This would be useful for pdfalto as I don't have any for the moment.

Note: there is a typing of the reference callouts at document-level done in Grobid (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/engines/citations/CalloutAnalyzer.java#L21). This would allow to know if the reference callouts are superscript too, and prevent some false positive in the rare case we have both footnotes and references as superscript numbers and the unlucky case with same number index on the same page for both reference and footnote.

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Aug 30, 2022

Thanks!
When looking for the footnote marker in the text, I've added the constraints requiring to have superscript = true.
I only have examples of documents where the superscripts attributes not being set at all.

I forgot to check the code in your note, I will have a look when after I land in JP.

Moreover I share a the list of all the segmentation model training files containing footnotes:

022152v1.training.segmentation.tei.xml
056150v1.training.segmentation.tei.xml
0807.3577.training.segmentation.tei.xml
0811_0088.training.segmentation.tei.xml
0911_5430.training.segmentation.tei.xml
100.v84-264.training.segmentation.tei.xml
1003._0908.0095.training.segmentation.tei.xml
1013._technote.training.segmentation.tei.xml
1022._GoNaPhSe07_CustomersGeneralProperties.training.segmentation.tei.xml
1027._PhyRevD77-064013.training.segmentation.tei.xml
104.v84-299.training.segmentation.tei.xml
1046._p33-hearst.training.segmentation.tei.xml
1050._CLEF08Working_Notes_QA_Overview.training.segmentation.tei.xml
1105.training.segmentation.tei.xml
119.v83-444.training.segmentation.tei.xml
12.v88-171.training.segmentation.tei.xml
120._10.1.1.31.3616.training.segmentation.tei.xml
121._10.1.1.47.6586.training.segmentation.tei.xml
128._61008.training.segmentation.tei.xml
130._10.1.1.31.8153.training.segmentation.tei.xml
1309.7222.training.segmentation.tei.xml
131._10.1.1.45.9641.training.segmentation.tei.xml
135._sigcomm98.training.segmentation.tei.xml
145._shade.training.segmentation.tei.xml
146._10.1.1.43.8658.training.segmentation.tei.xml
150._75067.training.segmentation.tei.xml
1512.00014.training.segmentation.tei.xml
155._10.1.1.52.3535.training.segmentation.tei.xml
167._10.1.1.25.1950.training.segmentation.tei.xml
17.v88-048.training.segmentation.tei.xml
2.v91-008.training.segmentation.tei.xml
2020.02.17.20023747v2.full.training.segmentation.tei.xml
27.v87-316.training.segmentation.tei.xml
270._45580.training.segmentation.tei.xml
3.v90-330.training.segmentation.tei.xml
31.v87-066.training.segmentation.tei.xml
368._10.1.1.49.6162.training.segmentation.tei.xml
390._woolf.training.segmentation.tei.xml
394._10.1.1.47.8740.training.segmentation.tei.xml
439._GoyalVT98.training.segmentation.tei.xml
440._10.1.1.41.3430.training.segmentation.tei.xml
452._29904.training.segmentation.tei.xml
474._10.1.1.117.8006.training.segmentation.tei.xml
491._FrH8.training.segmentation.tei.xml
492._10.1.1.62.4528.training.segmentation.tei.xml
50.v86-097.training.segmentation.tei.xml
55001267.training.segmentation.tei.xml
55001337.training.segmentation.tei.xml
60.v85-592.training.segmentation.tei.xml
71.v85-432.training.segmentation.tei.xml
9.v89-169.training.segmentation.tei.xml
9911409.training.segmentation.tei.xml
Amilhat_Parinas.training.segmentation.tei.xml
AUSSANT2014INTER.training.segmentation.tei.xml
Bioinformatics-2007-Rivals-401-7.training.segmentation.tei.xml
C02-1160.training.segmentation.tei.xml
C12-1005.training.segmentation.tei.xml
Document_image_zone_classification_A_simple_high-p.training.segmentation.tei.xml
E14-1007.training.segmentation.tei.xml
E14-1075.training.segmentation.tei.xml
ecdl_dilia.training.segmentation.tei.xml
exception-analysis-resilience-ist.training.segmentation.tei.xml
f7247483-2721-4a2f-ace0-6113d752418a.training.segmentation.tei.xml
HCII07-LongSteph.training.segmentation.tei.xml
ims.training.segmentation.tei.xml
ipamin2014_paper4.training.segmentation.tei.xml
JAP0897669-CC.training.segmentation.tei.xml
MNRAS-2015-Richard-L16-20.training.segmentation.tei.xml
nihms743075.training.segmentation.tei.xml
P98-2139.training.segmentation.tei.xml
PMC4317227.training.segmentation.tei.xml
SSRN-id1425692-2.training.segmentation.tei.xml
W00-0734.training.segmentation.tei.xml
W09-1401.training.segmentation.tei.xml
W09-1403.training.segmentation.tei.xml
W09-1417.training.segmentation.tei.xml
W12-4305.training.segmentation.tei.xml
Wang-paperAVE2008.training.segmentation.tei.xml

@lfoppiano
Copy link
Collaborator Author

I've reviewed the code of the CalloutAnalyzer and my code and I think it's ready to review.

@lfoppiano lfoppiano marked this pull request as ready for review September 7, 2022 03:44
@kermitt2
Copy link
Owner

kermitt2 commented Sep 24, 2022

I made quite a lot of changes:

  • I remove some code not used and some redundant codes. There are two types of notes, foot notes and margin notes. Margin note was using the old code, but foot note the new one. I simplify with only one object Note, covering the two types and using then the same methods.

  • The TEI inline serialization of foot notes was not supporting more than one footnote callout match per paragraph and was "eating" some paragraph content in one case. I re-wrote it to support several matches, sort by position and rebuild the paragraph part by part.

  • Added a post-processing for foot notes not correctly segmented (it seems that the segmentation model tends to agglutinate several foot notes together when they follow each other)

There's still one thing to do to have it working: most of the superscript numbers will be recognized as bibliographical markers. There is a filtering of them based on the value of MarkerType. If bibliographical markers are mainly in parenthesis or bracket or superscript, the MarkerType will be set preliminary to this marker style to avoid mixing with table markers, figure markers and footnote markers (which must be of a different style in a proper document). So most note markers will not appear labeled as paragraph, but as bibliographical markers, which are then filtered out because not looking like the bibliographical style of the current document.

-> So what needs to be done: to match the note labels with the filtered out superscript bibliographical markers.

@kermitt2
Copy link
Owner

kermitt2 commented Sep 24, 2022

I added the bibliographical callout "recovery" as footnote callout.
It probably needs a bit more more test and there is a minor issue with space after the footnote callout in the TEI.

For instance in this PDF CIKM_2021_final_1085.pdf, we have 14 foot notes. We were matching 3 only in the text body. Now we match 10 and the 4 missing ones are foot notes not recognized by the segmentation model, so not matchable.

@kermitt2 kermitt2 merged commit f9dc68f into master Sep 27, 2022
@lfoppiano lfoppiano deleted the features/footnotes branch September 28, 2022 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants