ignore invalid utf-8 sequences by default #475

de-code · 2019-08-05T20:45:55Z

This is a workaround for #472

By default it will ignore invalid UTF-8 sequences as it did in GROBID 0.5.3 and before.
The validation can be turned on via the properties file (grobid.3rdparty.pdf2xml.validation.enabled).

/cc @lfoppiano

de-code · 2019-08-05T20:46:32Z

grobid-core/src/main/java/org/grobid/core/utilities/GrobidPropertyKeys.java

@@ -15,6 +15,7 @@
    String PROP_3RD_PARTY_PDFTOXML = "grobid.3rdparty.pdf2xml.path";
    String PROP_3RD_PARTY_PDFTOXML_MEMORY_LIMIT = "grobid.3rdparty.pdf2xml.memory.limit.mb";
    String PROP_3RD_PARTY_PDFTOXML_TIMEOUT_SEC = "grobid.3rdparty.pdf2xml.memory.timeout.sec";
+    String PROP_3RD_PARTY_PDFTOXML_VALIDATION_ENABLED = "grobid.3rdparty.pdf2xml.validation.enabled";


I am assuming this also belongs to "third party"? (just trying to be consistent)

no it's the XML parsing part inside grobid... not third party, but I would suggest to be robust wrt encoding all the time, so not introduce this additional property.

de-code · 2019-08-05T20:48:16Z

grobid-core/src/main/java/org/grobid/core/utilities/GrobidProperties.java

@@ -569,6 +569,12 @@ public static Integer getPdfToXMLTimeoutMs() {
        return Integer.parseInt(getPropertyValue(GrobidPropertyKeys.PROP_3RD_PARTY_PDFTOXML_TIMEOUT_SEC, "60"), 10) * 1000;
    }

+    public static boolean isPdfToXMLValidationEnabled() {


Not quite sure why we spell Pdf and XML. Java naming conventions is all caps but personally I prefer Pdf and Xml.

Not quite sure neither :)
Maybe a compromise between the different conventions :D

coveralls · 2019-08-05T20:56:29Z

Coverage increased (+0.01%) to 38.029% when pulling d41a05a on elifesciences:ignore-invalid-utf-8-sequences into 206583a on kermitt2:master.

lfoppiano

In principle I'm fine on how the change has been done.

I'm wondering whether we should replace these invalid characters with some placeholder or just remove them.
@kermitt2 what do you think?

de-code · 2019-08-06T05:59:17Z

I'm wondering whether we should replace these invalid characters with some placeholder or just remove them.

Yes, I think that would be better. Actually GROBID 0.5.3 was also displaying some odd character.

de-code · 2019-08-06T07:33:28Z

I am also thinking whether there should be an option to save the erroneous XML files? I had to hack the code to do that and inspect the XML. Or is there an option not to delete the temp files?

kermitt2 · 2019-08-06T08:17:18Z

So as I mentioned in the issue, it's something wrong with pdfalto. The contract between pdfalto and grobid now is somehow that pdfalto must neutralize all these encoding problems before sending the ALTO file to grobid.

Of course having grobid robust to that is also very nice. I find the property name misleading because in the context of XML, validation has a different meaning which is XML validation (something we really don't want to do and to allow with the ALTO files), here it's just unicode correctness in the XML parsing (well-formedness).

What about ignoring the unicode problem all the time in the XML parsing? It's not clear what is the interest of keeping it and having an option to allow it or not.

About the temporary ALTO XML files (the .lxml extension), they are indeed always removed. For inspecting those files, I was simply re-generating them with the pdfalto command line, which is something very simple and avoid adding options?

de-code · 2019-08-06T08:18:16Z

I have revised it to replace with question marks. It didn't actually completely ignore the bytes but was skipping bytes until the utf-8 sequence is valid. It is now displaying question marks for every invalid byte. In the example above it ends up with ??5 in the output.

de-code · 2019-08-06T08:28:33Z

So as I mentioned in the issue, it's something wrong with pdfalto. The contract between pdfalto and grobid now is somehow that pdfalto must neutralize all these encoding problems before sending the ALTO file to grobid.

I would agree that the XML generated by pdfalto should ideally be valid.

Of course having grobid robust to that is also very nice. I find the property name misleading because in the context of XML, validation has a different meaning which is XML validation (something we really don't want to do and to allow with the ALTO files), here it's just unicode correctness in the XML parsing (well-formedness).

I agree, validate is an overloaded term and in the XML world usually associated with schema validation like you said. I couldn't think of a better name, I didn't try very hard to be honest. Any suggestion?

What about ignoring the unicode problem all the time in the XML parsing? It's not clear what is the interest of keeping it and having an option to allow it or not.

In principal, rejecting badly encoded files seems to be a fair choice. This is really a work-around as it is not clear what to do with the badly encoded bytes. Once pdfalto fulfils the contract it might, some people might prefer to turn it on. But I wouldn't entirely opposed to removing the configuration.

About the temporary ALTO XML files (the .lxml extension), they are indeed always removed. For inspecting those files, I was simply re-generating them with the pdfalto command line, which is something very simple and avoid adding options?

Okay, fair enough. It is just that it tells me there is an error with that particular XML file but by the time I could look at it, it's gone. Running it manually I would need to make sure that I call it with the same parameters. But it's not a big issue. Let's leave it out.

kermitt2 · 2019-08-06T08:46:22Z

I have revised it to replace with question marks. It didn't actually completely ignore the bytes but was skipping bytes until the utf-8 sequence is valid. It is now displaying question marks for every invalid byte. In the example above it ends up with ??5 in the output.

ok it has to be consistent I think with the way place holders from pdfalto are managed in GROBID, otherwise it's going to be messy.

pdfalto generates place holders with certain unicode character range (in theory it should do the same in this case too, but for some reason here there's a problem). For me, the way those place holders are used will depend on the application. For instance the place holder could be considered in the further processing because they are structurally useful (for instance in dictionaries, we have special characters which are separators but failing encoding resolution), or we could replace them all with a special character, or we could ignore them all. In grobid-core, we could ignore them I think.

I don't like putting hard coded question marks. How can we differentiate a normal question mark from an encoding issue then?

I think the best would be to always ignore encoding issue when parsing XML as you introduced here (we don't want to reject badly encoded ALTO files while we can process them), and just log the encoding issue so that we keep track of the error and address the problem in pdfalto.

I would simply ignore those characters in grobid-core - for instance calling a unique method that indicates what to do for all place holder and "CodingErrorAction"?

de-code · 2019-08-06T10:58:59Z

I have revised it to replace with question marks. It didn't actually completely ignore the bytes but was skipping bytes until the utf-8 sequence is valid. It is now displaying question marks for every invalid byte. In the example above it ends up with ??5 in the output.

ok it has to be consistent I think with the way place holders from pdfalto are managed in GROBID, otherwise it's going to be messy.

pdfalto generates place holders with certain unicode character range (in theory it should do the same in this case too, but for some reason here there's a problem). For me, the way those place holders are used will depend on the application. For instance the place holder could be considered in the further processing because they are structurally useful (for instance in dictionaries, we have special characters which are separators but failing encoding resolution), or we could replace them all with a special character, or we could ignore them all. In grobid-core, we could ignore them I think.

Using a special character that is unlikely to be used otherwise seems to make sense to me. In some cases the original character could have been important. In the example document it is μ (micro) as a unit. What the impact of the misinterpretation of the invalid byte / character is, may not even be clear to the application but only the user. Using that special character could allow it to be visible to the user unless the application really knows what to do with it (and it could replace it).

I don't like putting hard coded question marks. How can we differentiate a normal question mark from an encoding issue then?

I agree, a question mark is not the best choice. Maybe use a designated unicode character. I don't know what pdfalto is using?

I think the best would be to always ignore encoding issue when parsing XML as you introduced here (we don't want to reject badly encoded ALTO files while we can process them), and just log the encoding issue so that we keep track of the error and address the problem in pdfalto.

I would simply ignore those characters in grobid-core - for instance calling a unique method that indicates what to do for all place holder and "CodingErrorAction"?

CodingErrorAction also has REPORT option:

Action indicating that a coding error is to be reported, either by returning a CoderResult object or by throwing a CharacterCodingException, whichever is appropriate for the method implementing the coding process.

I am not quite sure how to control that it should log and ignore. Do you?

de-code · 2019-08-08T16:52:45Z

I didn't easily find which placholder character used by pdfalto.

Could we maybe agree to for now change it to?:

use the placeholder character used by pdfalto (to be defined)
remove the configuration option an always replace invalid utf-8 byte sequences with it

Ideally this shouldn't happen anyway and I raised kermitt2/pdfalto#68

removed no longer used line added line feed

kermitt2 · 2020-11-29T22:24:02Z

I've merged manually this PR without the property (commit 2bd9f00) - the invalid URF-8 characters are always ignored to avoid failing on the whole document because of a couple crappy character codes :)
Thanks a lot @de-code !

de-code commented Aug 5, 2019

View reviewed changes

lfoppiano reviewed Aug 6, 2019

View reviewed changes

This was referenced Sep 9, 2019

ignore invalid utf 8 sequences elifesciences/grobid#7

Merged

ignore invalid utf 8 sequences elifesciences/grobid#11

Merged

kermitt2 force-pushed the master branch from 4a7e4f3 to 9ad861e Compare October 26, 2019 18:54

de-code force-pushed the ignore-invalid-utf-8-sequences branch from 35dc2c4 to a270f58 Compare November 29, 2019 09:20

de-code force-pushed the ignore-invalid-utf-8-sequences branch from 413cd89 to 73f452c Compare April 17, 2020 15:05

ignore invalid utf-8 sequences by default

d41a05a

removed no longer used line added line feed

de-code force-pushed the ignore-invalid-utf-8-sequences branch from 73f452c to d41a05a Compare April 17, 2020 15:08

kermitt2 added this to the 0.6.1 milestone Apr 24, 2020

lfoppiano mentioned this pull request Jul 30, 2020

Processing failed with error 500. .MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence. #612

Open

kermitt2 modified the milestones: 0.6.1, 0.6.2 Aug 12, 2020

kermitt2 added a commit that referenced this pull request Nov 29, 2020

merging manually PR #475 without option in properties

2bd9f00

kermitt2 closed this Nov 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore invalid utf-8 sequences by default #475

ignore invalid utf-8 sequences by default #475

de-code commented Aug 5, 2019

de-code Aug 5, 2019

kermitt2 Aug 6, 2019

de-code Aug 5, 2019

kermitt2 Aug 6, 2019 •

edited

Loading

coveralls commented Aug 5, 2019 •

edited

Loading

lfoppiano left a comment

de-code commented Aug 6, 2019

de-code commented Aug 6, 2019

kermitt2 commented Aug 6, 2019

de-code commented Aug 6, 2019

de-code commented Aug 6, 2019

kermitt2 commented Aug 6, 2019

de-code commented Aug 6, 2019

de-code commented Aug 8, 2019

kermitt2 commented Nov 29, 2020

ignore invalid utf-8 sequences by default #475

ignore invalid utf-8 sequences by default #475

Conversation

de-code commented Aug 5, 2019

de-code Aug 5, 2019

Choose a reason for hiding this comment

kermitt2 Aug 6, 2019

Choose a reason for hiding this comment

de-code Aug 5, 2019

Choose a reason for hiding this comment

kermitt2 Aug 6, 2019 • edited Loading

Choose a reason for hiding this comment

coveralls commented Aug 5, 2019 • edited Loading

lfoppiano left a comment

Choose a reason for hiding this comment

de-code commented Aug 6, 2019

de-code commented Aug 6, 2019

kermitt2 commented Aug 6, 2019

de-code commented Aug 6, 2019

de-code commented Aug 6, 2019

kermitt2 commented Aug 6, 2019

de-code commented Aug 6, 2019

de-code commented Aug 8, 2019

kermitt2 commented Nov 29, 2020

kermitt2 Aug 6, 2019 •

edited

Loading

coveralls commented Aug 5, 2019 •

edited

Loading