-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize getting Layout token lists for header fields #746
Conversation
…nd add getter for title, authors and abstract layout token lists
Hi @lfoppiano ! To get the layout tokens corresponding to any fields of a parsed header, the idea is to use the following already implemented stuff: resHeader = new BiblioItem();
resHeader.generalResultMapping(doc, labeledResult, tokenizationHeader);
// title
List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);
// abstract
List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
// keywords
List<LayoutToken> keywordTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_KEYWORD);
etc. This is generic for every labels, and similar to the segmentation model for instance. Those structures in |
Thanks @kermitt2. I saw that method but I could not find out how to get the labels and the tokens. I need both the clean bibliographic information and the layout tokens, and I would prefer not to call the here a snippet of my code, see the entire class here: // from the header, we are interested in title, abstract and keywords
SortedSet<DocumentPiece> headerDocumentParts = doc.getDocumentPart(SegmentationLabels.HEADER);
if (headerDocumentParts != null) {
BiblioItem resHeader = new BiblioItem();
//TODO: check if to use consolidation or not
GrobidAnalysisConfig config = GrobidAnalysisConfig.builder().consolidateHeader(1).build();
parsers.getHeaderParser().processingHeaderSection(config, doc, resHeader, false);
// title
List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);
if (isNotEmpty(titleTokens)) {
documentBlocks.add(new DocumentBlock(DocumentBlock.SECTION_HEADER,
DocumentBlock.SUB_SECTION_TITLE,
normaliseAndCleanup(titleTokens)));
biblioInfo.setTitle(resHeader.getTitle());
}
[...]
... but I could not find how to get the list of the labels, and the tokenized input. |
This is almost good, you just need to call SortedSet<DocumentPiece> documentParts = doc.getDocumentPart(SegmentationLabels.HEADER);
if (documentParts != null) {
Pair<String,List<LayoutToken>> headerFeatured = parsers.getHeaderParser().getSectionHeaderFeatured(doc, documentParts);
String header = headerFeatured.getLeft();
List<LayoutToken> headerTokenization = doc.getTokenizationParts(documentParts, doc.getTokenizations());
// or following your taste:
// List<LayoutToken> headerTokenization = headerFeatured.getRight();
if ((header != null) && (header.trim().length() > 0)) {
String labeledResult = parsers.getHeaderParser().label(header);
// below to have the resHeader object with metadata ready for consolidation too
BiblioItem resHeader = resultExtraction(labeledResult, headerTokenization, resHeader, doc);
// to be able to get the list of LayoutToken list for any header field
resHeader.generalResultMapping(doc, labeledResult, headerTokenization);
// title
List<LayoutToken> titleTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_TITLE);
if (titleTokens != null) {
// do something
}
// abstract
List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
if (abstractTokens != null) {
// do something
}
// keywords
List<LayoutToken> keywordTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_KEYWORD);
if (keywordTokens != null) {
// do something
}
}
} |
You can use something like the code excerpt or wait that I update The |
Ah, I see, unfortunately, with the snippet you suggest I will miss all the code that is extracting and post-processing the various bibliographic information (e.g. extracting authors and affiliations and link them together). I wanted to avoid copy-pasta the same code as in the Anyway, I can modify this PR to have such modifications:
would that be ok? |
Sure if you do the work I won't complain :D In This might impact some other grobid modules if they use this |
…essing and add getter for title, authors and abstract layout token lists" This reverts commit 0306aa0
OK, it should be fine now. There are a couple of cosmetic improvements in the consolidation.java that I have been going through. I removed the argument I also tried not to format the code 😅 |
All look good thanks ! I am doing some background tests... |
I was too quick when merging this PR, there's a regression in the abstract recognition in the PMC set: Before:
After:
So we need to review the changes regarding abstract, I might have missed something in the way the abstract is further structured and the link with the old abstractLayoutTokens thingy. |
OK I found the issue. For the abstract, we don't allow discontinuous abstract, we only keep the first abstract sequence (otherwise we gather junk, given our low amount of training data for headers). so: if (clusterLabel.equals(TaggingLabels.HEADER_ABSTRACT)) {
if (biblio.getAbstract() != null) {
// this will need to be reviewed with more training data, for the moment
// avoid concatenation for abstracts as it brings more noise than correct pieces
//biblio.setAbstract(biblio.getAbstract() + " " + clusterContent);
} else {
biblio.setAbstract(clusterContent);
List<LayoutToken> tokens = getLayoutTokens(cluster);
biblio.addAbstractTokens(tokens);
}
} the layout token list for abstract is not equivalent to Solution: unfortunately, having also a "working" layout token list for abstract :/ |
pushing a fix in a few minutes after more tests... |
Oh, sorry about that. I will think about it how to find a way to pass working / modified copies of the various components (abstract, authors) around, without having them in the BiblioInfo which represent the bibiligraphic output data |
For the abstract it's a temporary fix and the idea is to remove it when more training data will be available (it's why the call to |
We still have the issue that we are using BiblioItem for passing internal data to the process. I think with the new naming convention and the comment I should remember next time I tumble on it ;-) |
In this PR, I
- added a method to fill up the layoutTokens map that contains the layout tokens of the various component (title, abstract, etc..) and it's filled up when the clusteror is processed- added getter oftitleLayoutToken
,abstractLayoutToken
andauthorsLayoutToken
lists.(probably these two items are duplicated but they are used in different part so I did not feel like to remove one of them...
generalResultMapping()
inresultExtraction
of HeaderParserThe use case is that I process the header to get, let's say title, abstract, keywords and I want to process through their layout tokens..
Update: See comment