Finalize getting Layout token lists for header fields #746

lfoppiano · 2021-04-21T07:56:15Z

In this PR, I
~~- added a method to fill up the layoutTokens map that contains the layout tokens of the various component (title, abstract, etc..) and it's filled up when the clusteror is processed~~
- added getter of titleLayoutToken, abstractLayoutToken and authorsLayoutToken lists.
(probably these two items are duplicated but they are used in different part so I did not feel like to remove one of them...

call the method generalResultMapping() in resultExtraction of HeaderParser
remove lists of layout tokens for abstract, title and authors, from the BiblioItem class
various cosmetic improvements, removed some not-needed nulls-checks
removed unused parameters in certain methods
added a way to better understand the exception when calling glutton

The use case is that I process the header to get, let's say title, abstract, keywords and I want to process through their layout tokens..

Update: See comment

…nd add getter for title, authors and abstract layout token lists

coveralls · 2021-04-21T08:08:52Z

Coverage decreased (-0.005%) to 38.953% when pulling 9338e2e on bugfix/header-layout-tokens-results into 0b42a3c on master.

kermitt2 · 2021-04-21T15:53:22Z

Hi @lfoppiano !

To get the layout tokens corresponding to any fields of a parsed header, the idea is to use the following already implemented stuff:

resHeader = new BiblioItem();
resHeader.generalResultMapping(doc, labeledResult, tokenizationHeader);

// title
List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);

// abstract
List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);

// keywords
List<LayoutToken> keywordTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_KEYWORD);

etc.

This is generic for every labels, and similar to the segmentation model for instance.

Those structures in BiblioItem should be remove, and the approach above generalized in Grobid and grobid modules, but when I worked on it (when I updated the header model last year), I got stuck with other things.

lfoppiano · 2021-04-21T23:21:20Z

Thanks @kermitt2.

I saw that method but I could not find out how to get the labels and the tokens. I need both the clean bibliographic information and the layout tokens, and I would prefer not to call the label() method of the HeaderParser twice.

here a snippet of my code, see the entire class here:

// from the header, we are interested in title, abstract and keywords
        SortedSet<DocumentPiece> headerDocumentParts = doc.getDocumentPart(SegmentationLabels.HEADER);
        if (headerDocumentParts != null) {
            BiblioItem resHeader = new BiblioItem();
            //TODO: check if to use consolidation or not
            GrobidAnalysisConfig config = GrobidAnalysisConfig.builder().consolidateHeader(1).build();
            parsers.getHeaderParser().processingHeaderSection(config, doc, resHeader, false);

           // title
            List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);
            if (isNotEmpty(titleTokens)) {
                documentBlocks.add(new DocumentBlock(DocumentBlock.SECTION_HEADER,
                    DocumentBlock.SUB_SECTION_TITLE,
                    normaliseAndCleanup(titleTokens)));
                biblioInfo.setTitle(resHeader.getTitle());
            }
       [...]
...

but I could not find how to get the list of the labels, and the tokenized input.

kermitt2 · 2021-04-21T23:44:00Z

This is almost good, you just need to call generalResultMapping() after labeling. In the future calling generalResultMapping() will be done in resultExtraction() and nothing more will be necessary. I will try to finalize that soon now that I remember it's not finished :D.

            SortedSet<DocumentPiece> documentParts = doc.getDocumentPart(SegmentationLabels.HEADER);
            if (documentParts != null) {
                Pair<String,List<LayoutToken>> headerFeatured = parsers.getHeaderParser().getSectionHeaderFeatured(doc, documentParts);
                String header = headerFeatured.getLeft();
                List<LayoutToken> headerTokenization = doc.getTokenizationParts(documentParts, doc.getTokenizations());
                // or following your taste:
                // List<LayoutToken> headerTokenization = headerFeatured.getRight();

                if ((header != null) && (header.trim().length() > 0)) {
                    String labeledResult = parsers.getHeaderParser().label(header);

                    // below to have the resHeader object with metadata ready for consolidation too
                    BiblioItem resHeader = resultExtraction(labeledResult, headerTokenization, resHeader, doc);

                    // to be able to get the list of LayoutToken list for any header field
                    resHeader.generalResultMapping(doc, labeledResult, headerTokenization);

                    // title
                    List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);
                    if (titleTokens != null) {
                        // do something
                    } 

                    // abstract
                    List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
                    if (abstractTokens != null) {
                        // do something
                    } 

                    // keywords
                    List<LayoutToken> keywordTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_KEYWORD);
                    if (keywordTokens != null) {
                        // do something
                    }
                }
            }

kermitt2 · 2021-04-21T23:46:58Z

You can use something like the code excerpt or wait that I update resultExtraction() to have it available without an additional call to generalResultMapping().

The titleLayoutTokens, authorsLayoutTokens and abstractLayoutTokens will be removed from BiblioItem, it's really hacky and not extensible.

lfoppiano · 2021-04-21T23:57:39Z

Ah, I see, unfortunately, with the snippet you suggest I will miss all the code that is extracting and post-processing the various bibliographic information (e.g. extracting authors and affiliations and link them together). I wanted to avoid copy-pasta the same code as in the processingHeaderSection.

Anyway, I can modify this PR to have such modifications:

remove the lists of layout tokens
call generalResultMapping

would that be ok?

kermitt2 · 2021-04-22T00:05:06Z

Sure if you do the work I won't complain :D

In resultExtraction() there's a few calls to the setters for these lists to be removed too. And in FullTextParser() there's a call to getAbstractTokens() to be replaced by resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT); normally that's it...

This might impact some other grobid modules if they use this getAbstractTokens() too...

…essing and add getter for title, authors and abstract layout token lists" This reverts commit 0306aa0

lfoppiano · 2021-04-22T00:53:47Z

OK, it should be fine now. There are a couple of cosmetic improvements in the consolidation.java that I have been going through.

I removed the argument Document in a couple of places, where it was passed but not used.
I've also removed some of the null checks where the object was supposed to be never null (thanks to the IDE's inspection 😸 )

I also tried not to format the code 😅

grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java

…ed code

grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java

kermitt2 · 2021-04-22T02:31:53Z

All look good thanks ! I am doing some background tests...

kermitt2 · 2021-04-22T19:17:11Z

I was too quick when merging this PR, there's a regression in the abstract recognition in the PMC set:

Before:

abstract             96.82        86.81        85.4         86.1         1911

After:

abstract             94.37        74.84        73.63        74.23        1911

So we need to review the changes regarding abstract, I might have missed something in the way the abstract is further structured and the link with the old abstractLayoutTokens thingy.

kermitt2 · 2021-04-22T20:05:04Z

OK I found the issue. For the abstract, we don't allow discontinuous abstract, we only keep the first abstract sequence (otherwise we gather junk, given our low amount of training data for headers).

so:

            if (clusterLabel.equals(TaggingLabels.HEADER_ABSTRACT)) {
                if (biblio.getAbstract() != null) {
                    // this will need to be reviewed with more training data, for the moment
                    // avoid concatenation for abstracts as it brings more noise than correct pieces
                    //biblio.setAbstract(biblio.getAbstract() + " " + clusterContent);
                } else {
                    biblio.setAbstract(clusterContent);
                    List<LayoutToken> tokens = getLayoutTokens(cluster);
                    biblio.addAbstractTokens(tokens);
                }
            }

the layout token list for abstract is not equivalent to List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT); which will return all the abstract sequences (so more tokens).

Solution: unfortunately, having also a "working" layout token list for abstract :/

kermitt2 · 2021-04-22T20:09:50Z

pushing a fix in a few minutes after more tests...

lfoppiano · 2021-04-23T00:43:55Z

Oh, sorry about that.

I will think about it how to find a way to pass working / modified copies of the various components (abstract, authors) around, without having them in the BiblioInfo which represent the bibiligraphic output data

kermitt2 · 2021-04-23T01:32:55Z

For the abstract it's a temporary fix and the idea is to remove it when more training data will be available (it's why the call to getLayoutTokens() was in comment in FulltextParser class). With only 600 headers, there are very few examples of non-continuous abstracts, so the junk happens for this particular case.

lfoppiano · 2021-04-23T05:46:54Z

We still have the issue that we are using BiblioItem for passing internal data to the process. I think with the new naming convention and the comment I should remember next time I tumble on it ;-)

lfoppiano added 2 commits April 21, 2021 16:51

fill up the layout token lists of the header's part when processing a…

0306aa0

…nd add getter for title, authors and abstract layout token lists

show the exception in the log when the consolidation fails

48117ef

lfoppiano requested a review from kermitt2 April 21, 2021 08:06

lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Apr 21, 2021

kermitt2 removed the bug From Hemiptera and especially its suborder Heteroptera label Apr 21, 2021

lfoppiano added 2 commits April 22, 2021 09:13

Revert "fill up the layout token lists of the header's part when proc…

b68b96f

…essing and add getter for title, authors and abstract layout token lists" This reverts commit 0306aa0

remove authors, abstract and title layout tokens

af229cb

lfoppiano added this to the 0.7.0 milestone Apr 22, 2021

kermitt2 requested changes Apr 22, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java Show resolved Hide resolved

revert authorLayoutTokens and with a different name, remove some unus…

de7abbe

…ed code

lfoppiano commented Apr 22, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java Show resolved Hide resolved

cosmetics

9338e2e

lfoppiano commented Apr 22, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java Show resolved Hide resolved

kermitt2 approved these changes Apr 22, 2021

View reviewed changes

kermitt2 changed the title ~~Header 's layout token lists of title, authors and abstract are empty or without getter~~ Finalize getting Layout token lists for header fields Apr 22, 2021

kermitt2 merged commit 4114e74 into master Apr 22, 2021

lfoppiano deleted the bugfix/header-layout-tokens-results branch April 22, 2021 06:17

kermitt2 added a commit that referenced this pull request Apr 22, 2021

fix abstract field regression due to PR #746

a5f46ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize getting Layout token lists for header fields #746

Finalize getting Layout token lists for header fields #746

lfoppiano commented Apr 21, 2021 •

edited

Loading

coveralls commented Apr 21, 2021 •

edited

Loading

kermitt2 commented Apr 21, 2021 •

edited

Loading

lfoppiano commented Apr 21, 2021 •

edited

Loading

kermitt2 commented Apr 21, 2021 •

edited

Loading

kermitt2 commented Apr 21, 2021

lfoppiano commented Apr 21, 2021

kermitt2 commented Apr 22, 2021

lfoppiano commented Apr 22, 2021

kermitt2 commented Apr 22, 2021

kermitt2 commented Apr 22, 2021

kermitt2 commented Apr 22, 2021 •

edited

Loading

kermitt2 commented Apr 22, 2021

lfoppiano commented Apr 23, 2021

kermitt2 commented Apr 23, 2021

lfoppiano commented Apr 23, 2021

Finalize getting Layout token lists for header fields #746

Finalize getting Layout token lists for header fields #746

Conversation

lfoppiano commented Apr 21, 2021 • edited Loading

coveralls commented Apr 21, 2021 • edited Loading

kermitt2 commented Apr 21, 2021 • edited Loading

lfoppiano commented Apr 21, 2021 • edited Loading

kermitt2 commented Apr 21, 2021 • edited Loading

kermitt2 commented Apr 21, 2021

lfoppiano commented Apr 21, 2021

kermitt2 commented Apr 22, 2021

lfoppiano commented Apr 22, 2021

kermitt2 commented Apr 22, 2021

kermitt2 commented Apr 22, 2021

kermitt2 commented Apr 22, 2021 • edited Loading

kermitt2 commented Apr 22, 2021

lfoppiano commented Apr 23, 2021

kermitt2 commented Apr 23, 2021

lfoppiano commented Apr 23, 2021

lfoppiano commented Apr 21, 2021 •

edited

Loading

coveralls commented Apr 21, 2021 •

edited

Loading

kermitt2 commented Apr 21, 2021 •

edited

Loading

lfoppiano commented Apr 21, 2021 •

edited

Loading

kermitt2 commented Apr 21, 2021 •

edited

Loading

kermitt2 commented Apr 22, 2021 •

edited

Loading