Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphen at line break removed #180

Open
de-code opened this issue May 3, 2017 · 16 comments
Open

Hyphen at line break removed #180

de-code opened this issue May 3, 2017 · 16 comments
Labels
bug From Hemiptera and especially its suborder Heteroptera enhancement
Milestone

Comments

@de-code
Copy link
Collaborator

de-code commented May 3, 2017

In the first pubmed evaluation manuscript, a number of times 'α2-integrin' is at a line break, e.g.:
"was mediated through the inhibition of expression of α2-
integrin (1,2). Integrins are receptors that mediate attachment"

In the output it becomes:
"was mediated through the inhibition of expression of α 2integrin..."
(The space is another issue #179)

In some cases it may be desirable to remove the hyphen. Not in this case. Probably never when there is a number?

@de-code
Copy link
Collaborator Author

de-code commented May 3, 2017

Actually there is another not so clear word hyphenation example:
"Mean PK parameters CL, V, and F were calculated by non-
compartmental analysis. The tumor growth experiments in"

Becomes:
"Mean PK parameters CL, V, and F were calculated by noncompartmental analysis...."

In the nxml file it is annotated as 'non-compartmental'. I believe both versions are valid. So I would probably not treat that as a bug. (But I thought it was worth mentioning anyhow)

@kermitt2
Copy link
Owner

kermitt2 commented May 3, 2017

Thanks for the issue!

Dehyphenization is tricky ;). I was aware of these issues, but it's very useful to have a dedicated issue for that and discuss.

So far what is implemented now is very simplistic - it does not require dictionaries and resources, but it produces the errors you are mentioning. Basically, when there is an hyphen at the end of the line, we always dehyphenize.

We could introduce additional rules to improve it, like looking if we have numbers in the tokens before concatenating it, and/or to have a language-specific list of prefix (like anti-, non-, post-). We might always have some rare errors/exceptions. Adding rules manually is endless and not really in the spirit of GROBID...

Another approach could be to use a Machine Learning based text cleaner/normalizer, for instance like seq2seq, but then the problem is to have enough training data. The advantage would be maybe to tackle at the same time other text cleaning problems like diacritics combinations, invalid spacing, etc.

@de-code
Copy link
Collaborator Author

de-code commented May 4, 2017

Intuitively I was going to claim that it isn't common that words are broken across lines. Scanning through the same PDF the evidence shows it is quite common actually. One approach could also be to look at other examples within the same document. I checked a few examples (around 7), the only one I couldn't find another time so far was 'NON-MEM'. If it was for me, then I would like to have the option to include an element for the hyphens at line boundaries. Then I could do some post processing.

Would you be happy to use the PMC dataset as training data?

Otherwise that seems to be training data that should be fairly easily extracted from any PDF with XML / text data (at least in English). Maybe even without PDF.

@liar666
Copy link
Collaborator

liar666 commented May 4, 2017 via email

@kermitt2
Copy link
Owner

As a complement, here is a blog post from one founder of Authorea about the ratio of scientific papers written with LaTeX (~18% of all articles, but, as expected, largely dominating in a couple of domains).

https://www.authorea.com/users/3/articles/107393-how-many-scholarly-articles-are-written-in-latex/_show_article

@borkdude
Copy link

borkdude commented Mar 12, 2018

We are also running into this issue. Example PDF: http://ecp.acponline.org/sepoct01/kent.pdf
There are a lot of hyphens in the text at the end of sentences, breaking words:
screenshot 2018-03-12 17 23 38
E.g. "modeling techniques" becomes "modeling tech- niques".

Is it possible to at least return the newlines, so we can modify the result ourself based on some rules?

@kermitt2
Copy link
Owner

@borkdude thank you for reporting these errors - I think the best would be to have a more robust dehyphenization process (it works not that bad normally...).

The problem with outputing the End Of Line in the final TEI result are:

  1. sometimes EOL are pure garbage in some PDF (like one word per line), and often there are much more EOL in the actual PDF stream than what we see, so it might not be so useful for post-processing (GROBID on the other hand has all the coordinates information in each token to improve a dehyphenation)

  2. the principle of the TEI is to give the logical structure of the document, abstracting from any presentation information. So this would require to write some sort of debug alternative TEI output, and everybody would want different information out for post-processing and it would be a pain to develop and maintain I think. So the best imho is to try to get the dehyphanization as good as possibly in GROBID, because dehyphenation is really part of its job.

@lfoppiano hello Luca, would you have some time to look at these dehyphenization errors? my bad excuse: you're the last one who has modified it :D :D ?

@lfoppiano lfoppiano self-assigned this Mar 27, 2018
@lfoppiano lfoppiano added bug From Hemiptera and especially its suborder Heteroptera enhancement labels Mar 27, 2018
@lfoppiano lfoppiano added this to the 0.6.0 milestone Mar 27, 2018
@lfoppiano
Copy link
Collaborator

FYI I'm checking on the pdf kent.pdf:

  1. the text coming from the abstract (BiblioItem.getAbstract()) contains missing break lines which make the dehypenisation failing (according to the naive assumption that a hypen + breakline (whatever form) = dehypenisation). See the raw text before the call to dehypenise():
CONTEXT. A meta-analysis found that primary percutaneous transluminal coronary angioplasty (PTCA) was more effective than thrombolytic therapy in reducing mor-tality from acute myocardial infarction. However, fewer than 20% of U.S. hospitals have facilities to perform PTCA and many clinicians must choose between immedi-ate thrombolytic therapy and delayed PTCA. COUNT. The number of minutes of PTCA-related delay that would nullify its bene-fits. CALCULATION. For 10 published randomized trials, we calculated the following: PTCA-related delay = median "door-to-balloon" time -median "door-to-needle" time Survival benefit = 30-day mortality after thrombolytic therapy -30-day mortality after PTCA The relationship between delay and benefit was assessed with linear regression. RESULTS. The reported PTCA-related delay ranged from 7 to 59 minutes, while the absolute survival benefit ranged from -2.2% (favoring thrombolytic therapy) to 7.4% (favoring PTCA). Across trials, the survival benefit decreased as the PTCA-related delay increased: For each additional 10-minute delay, the benefit was predicted to decrease 1.7% (P< 0.001). Linear regression showed that at a PTCA-related delay of 50 minutes, PTCA and thrombolytic therapy yielded equivalent reductions in mor-tality . CONCLUSIONS. In clinical trials with short PTCA-related delays, PTCA produced better outcomes, while trials with longer delays favored thrombolytic therapy. A more precise estimate of the time interval to equipoise between the two therapies needs to be modeled with patient-level data. At experienced cardiac centers, PTCA is probably still preferable, even with delays longer than 50 minutes.
  1. In the fulltext the dehypenisation is not applied (it was removed, perhaps because it wasn't working so well?).

What I would do is:

  1. apply dephypenisation using LayoutToken and not text (method taking text can always tokenized on the fly)
  2. review (and migrate using the Clusteror) how the abstract is extracted (perhaps this should be another task?) because at first glance, it looks like some breakline are lost

@kermitt2
Copy link
Owner

  1. yes that's why we apply in this case another dehyphenization method (dehyphenizeHard()) which is not excepting a break line. It explains why mor- tality is correctly dehyphenized as mortality in the abstratc of the pdf example.
    The problem with header model is that it is very old and is not working with LayoutToken. The best would be to update this complete model and put it in line with the other models, but that's quite a lot of work.

  2. Ah it's my mistake it seems, but I don't remember exactly why I removed it. Probably I wanted to have it explicitly called where it is relevant (even in the full text, in some field, like formula, we don't want to dehyphenized, but still normalize the text), via TextUtilities.dehyphenize() in the appropriate fields in TEIFormatter.java.

About the dephypenisation, the current method using LayoutToken is not working well. Dephypenisation using text is much better for the moment because more flexible with the spaces around, it's why this one was used. The method using LayoutToken should be reviewed/extended I think.

Be careful that dehyphenize must be called only in certain fields where we are sure to have only text, performing it at clusteror level does not seem the right moment, because we still not know what is exactly the type of the current labelled segment.

@lfoppiano
Copy link
Collaborator

yes that's why we apply in this case another dehyphenization method (dehyphenizeHard()) which is not excepting a break line. It explains why mor- tality is correctly dehyphenized as mortality in the abstratc of the pdf example.
The problem with header model is that it is very old and is not working with LayoutToken. The best would be to update this complete model and put it in line with the other models, but that's quite a lot of work.

  1. the issue with mor- tality is before the dehypenisation since the \n is lost. First reason of why my suggestion to work directly on the layout tokens.

Ah it's my mistake it seems, but I don't remember exactly why I removed it. Probably I wanted to have it explicitly called where it is relevant (even in the full text, in some field, like formula, we don't want to dehyphenized, but still normalize the text), via TextUtilities.dehyphenize() in the appropriate fields in TEIFormatter.java.

  1. Yes, indeed it has to be applied only to text

About the dephypenisation, the current method using LayoutToken is not working well. Dephypenisation using text is much better for the moment because more flexible with the spaces around, it's why this one was used. The method using LayoutToken should be reviewed/extended I think.

The current dehypenisation method using layout tokens is not complete. I would aim to merge the three methods and produce a single one using layout tokens and having the possibility to have more aggressive approach.

Be careful that dehyphenize must be called only in certain fields where we are sure to have only text, performing it at clusteror level does not seem the right moment, because we still not know what is exactly the type of the current labelled segment.

The idea was to use the clusteror to extract, and apply the dehypenisation after the text is recomposed, not at the same moment.

@kermitt2
Copy link
Owner

OK I see, you were talking about the abstract for the clusteror. As I said the old-fashioned Header model is not using LayoutToken for decoding the CRF results, it follows a different logic where the EOL are (voluntarily) not preserved - they were actually used to represent two discontinuous segments for the same field, for instance for keyword or author fields... so the different dehyphenization method (which works fine in the kent.pdf example).

It would be necessary to rewrite entirely the method HeaderParser.resultExtraction() (with clusteror for decoding CRF results) and pay attention to some other stuff in BiblioItem (there is a special hack to propagate LayoutToken for authors, in order to make bounding boxes for authors present in the TEI - we would need to find a way to generalize that, in order to keep the layout tokens for any fields for creating corresponding bounding boxes).

For me it was a different task, issue #136, to have every aspects updated at the same time - which is why I mentioned that it is quite a lot of work (and also why it is still an open issue since one and half year ;) ). Then the textual fields extracted from the header would be aligned with all the other models, and ready to use the common dehyphenization method.

@lfoppiano
Copy link
Collaborator

OK so I will focus on the dehypenise() method using LayoutTokens and we could have a version getting text and tokenizing it under the hood. I will see whether to merge also with the aggressive version or not.

@lfoppiano
Copy link
Collaborator

I've implemented something to fix the dehypenisation. I'm sure it will require a couple of iterations.
Could someone test it, focusing only on the body (not the abstract)?

@lfoppiano
Copy link
Collaborator

I've ran the pubmed end 2 end evaluation.

======= Header metadata ======= 

Evaluation on 1942 random PDF files out of 1943 PDF (ratio 1.0).

======= Strict Matching ======= (exact matches)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             82.02        15.67        14.45        15.04  
authors              93.12        69.59        67.61        68.58  
first_author         97.99        93.65        90.56        92.08  
keywords             93.28        69.29        56.27        62.1   
title                93.34        71.44        68.68        70.03  

all fields           91.95        64.1         59.86        61.91   (micro average)
                     91.95        63.93        59.51        61.57   (macro average)


======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             88.33        48.38        44.61        46.42  
authors              93.22        70.06        68.08        69.06  
first_author         98.07        94.03        90.92        92.45  
keywords             94.08        75.8         61.57        67.95  
title                94.66        77.92        74.91        76.39  

all fields           93.67        73.34        68.49        70.83   (micro average)
                     93.67        73.24        68.02        70.45   (macro average)


==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             94.67        81.26        74.92        77.96  
authors              95.84        82.75        80.4         81.56  
first_author         98.08        94.08        90.97        92.5   
keywords             95.7         89.02        72.3         79.79  
title                96.11        84.99        81.71        83.32  

all fields           96.08        86.26        80.56        83.31   (micro average)
                     96.08        86.42        80.06        83.03   (macro average)


= Ratcliff/Obershelp Matching = (Minimum Ratcliff/Obershelp similarity at 0.95)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             93.63        75.87        69.95        72.79  
authors              94.05        74.1         72           73.03  
first_author         97.99        93.65        90.56        92.08  
keywords             95.15        84.46        68.6         75.71  
title                95.3         81.03        77.9         79.43  

all fields           95.22        81.66        76.26        78.87   (micro average)
                     95.22        81.82        75.8         78.61   (macro average)

===== Instance-level results =====

Total expected instances:       1941
Total correct instances:        146 (strict) 
Total correct instances:        437 (soft) 
Total correct instances:        874 (Levenshtein) 
Total correct instances:        710 (ObservedRatcliffObershelp) 

Instance-level recall:  7.52    (strict) 
Instance-level recall:  22.51   (soft) 
Instance-level recall:  45.03   (Levenshtein) 
Instance-level recall:  36.58   (RatcliffObershelp) 

======= Citation metadata ======= 

Evaluation on 1942 random PDF files out of 1943 PDF (ratio 1.0).

======= Strict Matching ======= (exact matches)

===== Field-level results =====

label                accuracy     precision    recall       f1     

authors              97.24        81.29        70.09        75.27  
date                 98.82        92.2         77.57        84.25  
first_author         98.29        88.91        76.55        82.27  
inTitle              95.86        71.08        66.85        68.9   
issue                99.37        83.63        77.86        80.64  
page                 98.47        92.11        78.65        84.85  
title                96.79        77.43        69.43        73.21  
volume               99.05        94.81        82.16        88.03  

all fields           97.98        85.15        74.56        79.51   (micro average)
                     97.98        85.18        74.89        79.68   (macro average)


======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)

===== Field-level results =====

label                accuracy     precision    recall       f1     

authors              97.32        81.85        70.58        75.8   
date                 98.82        92.2         77.57        84.25  
first_author         98.3         89.03        76.65        82.38  
inTitle              97.37        81.62        76.76        79.12  
issue                99.37        83.63        77.86        80.64  
page                 98.47        92.11        78.65        84.85  
title                98.31        88.56        79.41        83.74  
volume               99.05        94.81        82.16        88.03  

all fields           98.37        88.33        77.34        82.47   (micro average)
                     98.37        87.98        77.45        82.35   (macro average)


==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

authors              98.09        87.39        75.35        80.93  
date                 98.82        92.2         77.57        84.25  
first_author         98.33        89.21        76.8         82.54  
inTitle              97.52        82.68        77.76        80.15  
issue                99.37        83.63        77.86        80.64  
page                 98.47        92.11        78.65        84.85  
title                98.66        91.16        81.73        86.19  
volume               99.05        94.81        82.16        88.03  

all fields           98.54        89.65        78.5         83.7    (micro average)
                     98.54        89.15        78.49        83.45   (macro average)


= Ratcliff/Obershelp Matching = (Minimum Ratcliff/Obershelp similarity at 0.95)

===== Field-level results =====

label                accuracy     precision    recall       f1     

authors              97.65        84.27        72.66        78.04  
date                 98.82        92.2         77.57        84.25  
first_author         98.29        88.93        76.56        82.28  
inTitle              97.17        80.23        75.46        77.77  
issue                99.37        83.63        77.86        80.64  
page                 98.47        92.11        78.65        84.85  
title                98.54        90.26        80.93        85.34  
volume               99.05        94.81        82.16        88.03  

all fields           98.42        88.69        77.66        82.81   (micro average)
                     98.42        88.3         77.73        82.65   (macro average)

===== Instance-level results =====

Total expected instances:               89789
Total extracted instances:              86507
Total correct instances:                35610 (strict) 
Total correct instances:                46360 (soft) 
Total correct instances:                50433 (Levenshtein) 
Total correct instances:                47359 (RatcliffObershelp) 

Instance-level precision:       41.16 (strict) 
Instance-level precision:       53.59 (soft) 
Instance-level precision:       58.3 (Levenshtein) 
Instance-level precision:       54.75 (RatcliffObershelp) 

Instance-level recall:  39.66   (strict) 
Instance-level recall:  51.63   (soft) 
Instance-level recall:  56.17   (Levenshtein) 
Instance-level recall:  52.74   (RatcliffObershelp) 

Instance-level f-score: 40.4 (strict) 
Instance-level f-score: 52.59 (soft) 
Instance-level f-score: 57.21 (Levenshtein) 
Instance-level f-score: 53.73 (RatcliffObershelp) 

Matching 1 :    62566

Matching 2 :    3384

Matching 3 :    2786

Matching 4 :    665

Total matches : 69401

======= Fulltext structures ======= 

Evaluation on 1942 random PDF files out of 1943 PDF (ratio 1.0).

======= Strict Matching ======= (exact matches)

===== Field-level results =====

label                accuracy     precision    recall       f1     

figure_title         96.55        28.3         23.14        25.46  
reference_citation   57.14        55.99        52.72        54.31  
reference_figure     94.58        61.04        61.1         61.07  
reference_table      99.08        82.87        82.21        82.54  
section_title        94.47        74.91        66.88        70.67  
table_title          97.44        7.91         8.22         8.06   

all fields           89.88        58.19        54.69        56.38   (micro average)
                     89.88        51.84        49.05        50.35   (macro average)


======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)

===== Field-level results =====

label                accuracy     precision    recall       f1     

figure_title         98.4         74.09        60.57        66.65  
reference_citation   59.53        60.14        56.63        58.33  
reference_figure     94.53        62.06        62.12        62.09  
reference_table      99.08        83.39        82.73        83.06  
section_title        95.08        79.14        70.65        74.65  
table_title          97.57        15.51        16.13        15.81  

all fields           90.7         63.24        59.44        61.28   (micro average)
                     90.7         62.39        58.14        60.1    (macro average)

====================================================================================

@kermitt2 do you see any differences (hopefully a little improvement) with the previuos e2e measures?

@kermitt2
Copy link
Owner

kermitt2 commented May 4, 2018

There are differences, in particular I see a loss in citation metadatas and improvement on abstract. However the only way to be sure is to run it on the same architecture with and without the fixes (in case you have a branch). It depends also if you use consolidation or not.

@lfoppiano lfoppiano removed their assignment Jul 3, 2019
@lfoppiano
Copy link
Collaborator

lfoppiano commented Oct 18, 2019

I've checked based on the first comment and with 0.5.6 the hypens are safe :-)

I also checked the comment from @borkdude and we improve the results on kent.pdf, @borkdude could you have a look, especially if you have other cases?

de-code pushed a commit to elifesciences/grobid that referenced this issue Nov 29, 2019
de-code pushed a commit to elifesciences/grobid that referenced this issue Nov 29, 2019
de-code pushed a commit to elifesciences/grobid that referenced this issue Nov 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera enhancement
Projects
None yet
Development

No branches or pull requests

5 participants