Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize getting Layout token lists for header fields #746

Merged
merged 6 commits into from
Apr 22, 2021

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Apr 21, 2021

In this PR, I
- added a method to fill up the layoutTokens map that contains the layout tokens of the various component (title, abstract, etc..) and it's filled up when the clusteror is processed
- added getter of titleLayoutToken, abstractLayoutToken and authorsLayoutToken lists.
(probably these two items are duplicated but they are used in different part so I did not feel like to remove one of them...

  • call the method generalResultMapping() in resultExtraction of HeaderParser
  • remove lists of layout tokens for abstract, title and authors, from the BiblioItem class
  • various cosmetic improvements, removed some not-needed nulls-checks
  • removed unused parameters in certain methods
  • added a way to better understand the exception when calling glutton

The use case is that I process the header to get, let's say title, abstract, keywords and I want to process through their layout tokens..

Update: See comment

@lfoppiano lfoppiano requested a review from kermitt2 April 21, 2021 08:06
@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Apr 21, 2021
@coveralls
Copy link

coveralls commented Apr 21, 2021

Coverage Status

Coverage decreased (-0.005%) to 38.953% when pulling 9338e2e on bugfix/header-layout-tokens-results into 0b42a3c on master.

@kermitt2
Copy link
Owner

kermitt2 commented Apr 21, 2021

Hi @lfoppiano !

To get the layout tokens corresponding to any fields of a parsed header, the idea is to use the following already implemented stuff:

resHeader = new BiblioItem();
resHeader.generalResultMapping(doc, labeledResult, tokenizationHeader);

// title
List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);

// abstract
List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);

// keywords
List<LayoutToken> keywordTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_KEYWORD);

etc.

This is generic for every labels, and similar to the segmentation model for instance.

Those structures in BiblioItem should be remove, and the approach above generalized in Grobid and grobid modules, but when I worked on it (when I updated the header model last year), I got stuck with other things.

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Apr 21, 2021

Thanks @kermitt2.

I saw that method but I could not find out how to get the labels and the tokens. I need both the clean bibliographic information and the layout tokens, and I would prefer not to call the label() method of the HeaderParser twice.

here a snippet of my code, see the entire class here:

// from the header, we are interested in title, abstract and keywords
        SortedSet<DocumentPiece> headerDocumentParts = doc.getDocumentPart(SegmentationLabels.HEADER);
        if (headerDocumentParts != null) {
            BiblioItem resHeader = new BiblioItem();
            //TODO: check if to use consolidation or not
            GrobidAnalysisConfig config = GrobidAnalysisConfig.builder().consolidateHeader(1).build();
            parsers.getHeaderParser().processingHeaderSection(config, doc, resHeader, false);

           // title
            List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);
            if (isNotEmpty(titleTokens)) {
                documentBlocks.add(new DocumentBlock(DocumentBlock.SECTION_HEADER,
                    DocumentBlock.SUB_SECTION_TITLE,
                    normaliseAndCleanup(titleTokens)));
                biblioInfo.setTitle(resHeader.getTitle());
            }
       [...]
...

but I could not find how to get the list of the labels, and the tokenized input.

@kermitt2 kermitt2 removed the bug From Hemiptera and especially its suborder Heteroptera label Apr 21, 2021
@kermitt2
Copy link
Owner

kermitt2 commented Apr 21, 2021

This is almost good, you just need to call generalResultMapping() after labeling. In the future calling generalResultMapping() will be done in resultExtraction() and nothing more will be necessary. I will try to finalize that soon now that I remember it's not finished :D.

            SortedSet<DocumentPiece> documentParts = doc.getDocumentPart(SegmentationLabels.HEADER);
            if (documentParts != null) {
                Pair<String,List<LayoutToken>> headerFeatured = parsers.getHeaderParser().getSectionHeaderFeatured(doc, documentParts);
                String header = headerFeatured.getLeft();
                List<LayoutToken> headerTokenization = doc.getTokenizationParts(documentParts, doc.getTokenizations());
                // or following your taste:
                // List<LayoutToken> headerTokenization = headerFeatured.getRight();

                if ((header != null) && (header.trim().length() > 0)) {
                    String labeledResult = parsers.getHeaderParser().label(header);

                    // below to have the resHeader object with metadata ready for consolidation too
                    BiblioItem resHeader = resultExtraction(labeledResult, headerTokenization, resHeader, doc);

                    // to be able to get the list of LayoutToken list for any header field
                    resHeader.generalResultMapping(doc, labeledResult, headerTokenization);

                    // title
                    List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);
                    if (titleTokens != null) {
                        // do something
                    } 

                    // abstract
                    List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
                    if (abstractTokens != null) {
                        // do something
                    } 

                    // keywords
                    List<LayoutToken> keywordTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_KEYWORD);
                    if (keywordTokens != null) {
                        // do something
                    }
                }
            }

@kermitt2
Copy link
Owner

You can use something like the code excerpt or wait that I update resultExtraction() to have it available without an additional call to generalResultMapping().

The titleLayoutTokens, authorsLayoutTokens and abstractLayoutTokens will be removed from BiblioItem, it's really hacky and not extensible.

@lfoppiano
Copy link
Collaborator Author

Ah, I see, unfortunately, with the snippet you suggest I will miss all the code that is extracting and post-processing the various bibliographic information (e.g. extracting authors and affiliations and link them together). I wanted to avoid copy-pasta the same code as in the processingHeaderSection.

Anyway, I can modify this PR to have such modifications:

  • remove the lists of layout tokens
  • call generalResultMapping

would that be ok?

@kermitt2
Copy link
Owner

Sure if you do the work I won't complain :D

In resultExtraction() there's a few calls to the setters for these lists to be removed too. And in FullTextParser() there's a call to getAbstractTokens() to be replaced by resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT); normally that's it...

This might impact some other grobid modules if they use this getAbstractTokens() too...

…essing and add getter for title, authors and abstract layout token lists"

This reverts commit 0306aa0
@lfoppiano
Copy link
Collaborator Author

OK, it should be fine now. There are a couple of cosmetic improvements in the consolidation.java that I have been going through.

I removed the argument Document in a couple of places, where it was passed but not used.
I've also removed some of the null checks where the object was supposed to be never null (thanks to the IDE's inspection 😸 )

I also tried not to format the code 😅

@lfoppiano lfoppiano added this to the 0.7.0 milestone Apr 22, 2021
@kermitt2
Copy link
Owner

All look good thanks ! I am doing some background tests...

@kermitt2 kermitt2 changed the title Header 's layout token lists of title, authors and abstract are empty or without getter Finalize getting Layout token lists for header fields Apr 22, 2021
@kermitt2 kermitt2 merged commit 4114e74 into master Apr 22, 2021
@lfoppiano lfoppiano deleted the bugfix/header-layout-tokens-results branch April 22, 2021 06:17
@kermitt2
Copy link
Owner

I was too quick when merging this PR, there's a regression in the abstract recognition in the PMC set:

Before:

abstract             96.82        86.81        85.4         86.1         1911   

After:

abstract             94.37        74.84        73.63        74.23        1911   

So we need to review the changes regarding abstract, I might have missed something in the way the abstract is further structured and the link with the old abstractLayoutTokens thingy.

@kermitt2
Copy link
Owner

kermitt2 commented Apr 22, 2021

OK I found the issue. For the abstract, we don't allow discontinuous abstract, we only keep the first abstract sequence (otherwise we gather junk, given our low amount of training data for headers).

so:

            if (clusterLabel.equals(TaggingLabels.HEADER_ABSTRACT)) {
                if (biblio.getAbstract() != null) {
                    // this will need to be reviewed with more training data, for the moment
                    // avoid concatenation for abstracts as it brings more noise than correct pieces
                    //biblio.setAbstract(biblio.getAbstract() + " " + clusterContent);
                } else {
                    biblio.setAbstract(clusterContent);
                    List<LayoutToken> tokens = getLayoutTokens(cluster);
                    biblio.addAbstractTokens(tokens);
                }
            }

the layout token list for abstract is not equivalent to List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT); which will return all the abstract sequences (so more tokens).

Solution: unfortunately, having also a "working" layout token list for abstract :/

@kermitt2
Copy link
Owner

pushing a fix in a few minutes after more tests...

kermitt2 added a commit that referenced this pull request Apr 22, 2021
@lfoppiano
Copy link
Collaborator Author

Oh, sorry about that.

I will think about it how to find a way to pass working / modified copies of the various components (abstract, authors) around, without having them in the BiblioInfo which represent the bibiligraphic output data

@kermitt2
Copy link
Owner

For the abstract it's a temporary fix and the idea is to remove it when more training data will be available (it's why the call to getLayoutTokens() was in comment in FulltextParser class). With only 600 headers, there are very few examples of non-continuous abstracts, so the junk happens for this particular case.

@lfoppiano
Copy link
Collaborator Author

We still have the issue that we are using BiblioItem for passing internal data to the process. I think with the new naming convention and the comment I should remember next time I tumble on it ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants